Powerful tool for parsing HTML and collecting Web pages

Source: Internet
Author: User

Htmlagilitypack is a. NET, third-party, free, open-source, mini-class library that is used primarily for parsing HTML documents on the server side (in the B/s structure of the program, the client can parse HTML with JavaScript). By the time this article was published, the latest version of Htmlagilitypack was 1.4.0. : http://htmlagilitypack.codeplex.com/

After the download is uncompressed there are 3 files, here only need to HtmlAgilityPack.dll (assembly), Htmlagilitypack.xml (document, for visual Studio 2008 Code Intelligence Tips and help instructions for use in the solution can be used, no need to install anything, very "green".

A using Htmlagilitypack is introduced at the beginning of a C # class file, so you can use the type under that namespace. In practice, almost all take the HTMLDocument class as the main line, which is very similar to the XmlDocument class in the Microsoft. NET Framework. The XmlDocument class is an XML document, and the HTMLDocument class operates on HTML documents (which can also manipulate XML documents), and they operate in a DOM-based way. The difference is that the latter cancels a method such as getElementsByTagName, reinforces the getElementById method (which can be used directly in HTMLDocument and XmlDocument not). The Htmlagilitypack nodes are basically XPath expressions, and the reference document of the XPath expression is visible: http://www.w3school.com.cn/xpath/xpath_syntax.asp

For example, we want to collect the title of the blog home page recommendation article, in ASP. NET can write the following code:

[CSharp]View Plaincopy
    1. Htmlweb htmlweb = new Htmlweb ();
    2. HTMLDocument htmldoc = htmlweb.load (@"http://www.cnblogs.com/");
    3. Htmlnodecollection anchors = htmlDoc.DocumentNode.SelectNodes (@"//a[@class = ' Titlelnk ']");
    4. foreach (Htmlnode anchor in anchors)
    5. Response.Write (anchor.  InnerHtml + "<br/>");
    6. Response.End ();


This code parses the collected first page HTML static text into a DOM node tree, and then uses an XPath expression to get all the a elements of the class attribute value Titlelnk in the entire document. Two methods for getting node-most-Used node objects: selectnodes ("XPath expression") and selectSingleNode ("XPath expression"), which returns an instance of the node collection htmlnodecollection The latter returns the first node that satisfies the condition, and the type is an instance of Htmlnode. The subsequent Foreach loop outputs the inline text for each a element.

In general, Htmlagilitypack is more efficient and accurate than regular expression parsing of HTML, which is reflected in both development efficiency and operational performance. The flexibility of the htmlagilitypack is also very good. For example, the Foreach loop in the above code is modified to Response.Write (anchor. outerHTML + "<br/>"); the output is the hyperlink itself, not the inline text. You can even modify the hyperlink itself:

[CSharp]View Plaincopy
    1. foreach (Htmlnode anchor in anchors)
    2. {
    3. Anchor.  Attributes.Add ("style", "color:red");
    4. Response.Write (anchor.  outerHTML + "<br/>");
    5. }


You see a red hyperlink after you run this. You can operate on the nodes of the DOM node tree that Htmlagilitypack parses almost arbitrarily, just as you have a tree of your own, and you can trim it at will. This is also a regular method cannot be compared. Htmlagilitypack the structure of the source text is very loose, even without the root element is also normal use, this is very strict with the XmlDocument completely different. The key to mastering Htmlagilitypack parsing HTML documents is to familiarize yourself with XPath expression syntax, but it's easy to get started with XPath syntax, and it takes just a few hours to meet most applications. Relying on Dom's efficient and versatile architecture, XPath's powerful and concise syntax, htmlagilitypack can really be called "the Divine Weapon for parsing HTML and collecting Web pages."

Powerful tool for parsing HTML and collecting Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.