Finding all href values in a HTML string with C# .NET
December 5, 2014 8 Comments
Say you’d like to collect all link URLs in a HTML text. E.g.:
<html>
<p>
<a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a>
</p>
<div>
<a href=\"http://www.cnn.com\">Read the news</a>
</div>
</html>
The goal is to find “http://www.fantasticsite.com” and “http://www.cnn.com”. Using an XML parser could be a solution if the HTML code is well formatted XML. This is of course not always the case so the dreaded regular expressions provide a viable alternative.
The following code uses a Regex to find those sections in the input text that match a regular expression:
static void Main(string[] args)
{
string input = "<html><p><a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a></p><div><a href=\"http://www.cnn.com\">Read the news</a></div></html>";
FindHrefs(input);
Console.WriteLine("Main done...");
Console.ReadKey();
}
private static void FindHrefs(string input)
{
Regex regex = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))", RegexOptions.IgnoreCase);
Match match;
for (match = regex.Match(input); match.Success; match = match.NextMatch())
{
Console.WriteLine("Found a href. Groups: ");
foreach (Group group in match.Groups)
{
Console.WriteLine("Group value: {0}", group);
}
}
}
This gives the following output:
View all posts related to string and text operations here.
