Finding all href values in a HTML string with C# .NET
December 5, 2014 8 Comments
Say you’d like to collect all link URLs in a HTML text. E.g.:
<html> <p> <a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a> </p> <div> <a href=\"http://www.cnn.com\">Read the news</a> </div> </html>
The goal is to find “http://www.fantasticsite.com” and “http://www.cnn.com”. Using an XML parser could be a solution if the HTML code is well formatted XML. This is of course not always the case so the dreaded regular expressions provide a viable alternative.
The following code uses a Regex to find those sections in the input text that match a regular expression:
static void Main(string[] args) { string input = "<html><p><a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a></p><div><a href=\"http://www.cnn.com\">Read the news</a></div></html>"; FindHrefs(input); Console.WriteLine("Main done..."); Console.ReadKey(); } private static void FindHrefs(string input) { Regex regex = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))", RegexOptions.IgnoreCase); Match match; for (match = regex.Match(input); match.Success; match = match.NextMatch()) { Console.WriteLine("Found a href. Groups: "); foreach (Group group in match.Groups) { Console.WriteLine("Group value: {0}", group); } } }
This gives the following output:
View all posts related to string and text operations here.
Nice, Thanks Andras.
Good Article… Thanks.. 🙂
Excellent work. I appreciate it.
Thanks Andras… it was very much useful…Thank you…
Thanks Andras. its very useful.
I am able to retrive href value by using your code. but how to retrieve “path” value? I tried to replace “href” with “path” but not getting proper value. How to replace “%20” with “” and “%2c” with “,” using regular expression using C#?
I need expected result As : ddn/SpecialDeals/Lists/SpecialDeals/7803_.000/Bolar, Suni – to file.pdf
from result = href=”/ddn/SpecialDeals/_layouts/QuestSoftware/ItemHandler.ashx?path=/ddn/SpecialDeals/Lists/SpecialDeals/7803_.000/Bolar%2c%20Suni%20-%20to%20file.pdf”
Origional string:
Bolar, Suni – to file.pdf
Thanks in Advance Andras
Sorry about reviving this post if that’s a problem, I am currently having problems on the following line: Console.WriteLine(“Group value: {0}”, group);
How would one be able to just get the path without the ‘ href=”” ‘?
Perfect and very useful. Thanks.