Finding all href values in a HTML string with C# .NET

Say you’d like to collect all link URLs in a HTML text. E.g.:

<html>
   <p>
     <a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a>
   </p>
   <div>
     <a href=\"http://www.cnn.com\">Read the news</a>
   </div>
</html>

The goal is to find “http://www.fantasticsite.com&#8221; and “http://www.cnn.com&#8221;. Using an XML parser could be a solution if the HTML code is well formatted XML. This is of course not always the case so the dreaded regular expressions provide a viable alternative.

The following code uses a Regex to find those sections in the input text that match a regular expression:

static void Main(string[] args)
{
	string input = "<html><p><a href=\"http://www.fantasticsite.com\">Visit fantasticsite!</a></p><div><a href=\"http://www.cnn.com\">Read the news</a></div></html>";
	FindHrefs(input);
	Console.WriteLine("Main done...");
	Console.ReadKey();
}

private static void FindHrefs(string input)
{
	Regex regex = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))", RegexOptions.IgnoreCase);
	Match match;
	for (match = regex.Match(input); match.Success; match = match.NextMatch())
	{
		Console.WriteLine("Found a href. Groups: ");
		foreach (Group group in match.Groups)
		{
			Console.WriteLine("Group value: {0}", group);
		}				
	}

}

This gives the following output:

FindHrefs Regexp in action

View all posts related to string and text operations here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

3 Responses to Finding all href values in a HTML string with C# .NET

  1. rsp says:

    Nice, Thanks Andras.

  2. Subhash PM says:

    Good Article… Thanks.. 🙂

  3. mathewpoc says:

    Excellent work. I appreciate it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

iReadable { }

.NET Tips & Tricks

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: