Powerful tool for parsing HTML and collecting Web pages

Last Update:2014-09-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Htmlagilitypack is a. NET, third-party, free, open-source, mini-class library that is used primarily for parsing HTML documents on the server side (in the B/s structure of the program, the client can parse HTML with JavaScript). By the time this article was published, the latest version of Htmlagilitypack was 1.4.0. : http://htmlagilitypack.codeplex.com/

After the download is uncompressed there are 3 files, here only need to HtmlAgilityPack.dll (assembly), Htmlagilitypack.xml (document, for visual Studio 2008 Code Intelligence Tips and help instructions for use in the solution can be used, no need to install anything, very "green".

A using Htmlagilitypack is introduced at the beginning of a C # class file, so you can use the type under that namespace. In practice, almost all take the HTMLDocument class as the main line, which is very similar to the XmlDocument class in the Microsoft. NET Framework. The XmlDocument class is an XML document, and the HTMLDocument class operates on HTML documents (which can also manipulate XML documents), and they operate in a DOM-based way. The difference is that the latter cancels a method such as getElementsByTagName, reinforces the getElementById method (which can be used directly in HTMLDocument and XmlDocument not). The Htmlagilitypack nodes are basically XPath expressions, and the reference document of the XPath expression is visible: http://www.w3school.com.cn/xpath/xpath_syntax.asp

For example, we want to collect the title of the blog home page recommendation article, in ASP. NET can write the following code:

[CSharp]View Plaincopy

Htmlweb htmlweb = new Htmlweb ();
HTMLDocument htmldoc = htmlweb.load (@"http://www.cnblogs.com/");
Htmlnodecollection anchors = htmlDoc.DocumentNode.SelectNodes (@"//a[@class = ' Titlelnk ']");
foreach (Htmlnode anchor in anchors)
Response.Write (anchor. InnerHtml + "<br/>");
Response.End ();

This code parses the collected first page HTML static text into a DOM node tree, and then uses an XPath expression to get all the a elements of the class attribute value Titlelnk in the entire document. Two methods for getting node-most-Used node objects: selectnodes ("XPath expression") and selectSingleNode ("XPath expression"), which returns an instance of the node collection htmlnodecollection The latter returns the first node that satisfies the condition, and the type is an instance of Htmlnode. The subsequent Foreach loop outputs the inline text for each a element.

In general, Htmlagilitypack is more efficient and accurate than regular expression parsing of HTML, which is reflected in both development efficiency and operational performance. The flexibility of the htmlagilitypack is also very good. For example, the Foreach loop in the above code is modified to Response.Write (anchor. outerHTML + "<br/>"); the output is the hyperlink itself, not the inline text. You can even modify the hyperlink itself:

[CSharp]View Plaincopy

foreach (Htmlnode anchor in anchors)
{
Anchor. Attributes.Add ("style", "color:red");
Response.Write (anchor. outerHTML + "<br/>");
}

You see a red hyperlink after you run this. You can operate on the nodes of the DOM node tree that Htmlagilitypack parses almost arbitrarily, just as you have a tree of your own, and you can trim it at will. This is also a regular method cannot be compared. Htmlagilitypack the structure of the source text is very loose, even without the root element is also normal use, this is very strict with the XmlDocument completely different. The key to mastering Htmlagilitypack parsing HTML documents is to familiarize yourself with XPath expression syntax, but it's easy to get started with XPath syntax, and it takes just a few hours to meet most applications. Relying on Dom's efficient and versatile architecture, XPath's powerful and concise syntax, htmlagilitypack can really be called "the Divine Weapon for parsing HTML and collecting Web pages."

Powerful tool for parsing HTML and collecting Web pages

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Powerful tool for parsing HTML and collecting Web pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Powerful tool for parsing HTML and collecting Web pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support