". NET" crawls Web page data using Htmlagilitypack

Last Update:2014-10-24 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Just learned the XPath path expression, mainly to search the nodes in the XML document, through the XPath expression can quickly locate and access the node location in the XML document, HTML is also a kind of XML-like markup language, but the syntax is not so rigorous, In CodePlex, there is an open source project Htmlagilitypack, which provides an XPath parsing HTML file, which hides how to use the class library.

First, the XPath path expression

An XPath path expression

That is used to select the node or node set in the XML document.

1. Terminology: Node: 7 types: Elements, attributes, text, namespaces, processing commands, annotations, document (root) nodes

2. Node relationship: parent, child (children), Fellow (Sibling), ancestor (Ancestor), descendant (descendant)

3. Path expression

NodeName node name, select all child nodes of this node example: Childnode the Childnode child node in the current node, not including grandchildren

/select from root node example:/root/childnode/grandsonnode

Represents all descendant node examples://childnode all descendant nodes named Childnode

. Represents the current node example:./childnode represents the Childnode node of the current node

.. Represents the parent node example:.. /nearnode represents the Nearnode child node of the Father node

@ Select attribute/root/childnode/@id represents all node sets containing the id attribute of Childnode

4. Predicate (predicates)

Predicates can limit the set of nodes to make the selection more precise

/ROOT/BOOK[1] The first node in a node set

/root/book[last ()] The last node in a node set

/root/book[position ()-1] The second-to-last node set in a node set

/root/book[position () < 5] The first five node sets in a node set

/root/book[@id] Node set with attribute IDs in the nodes set

/root/book[@id = ' Chinese ' node set with id attribute value Chinese

/root/book[price > 35]/title node Set book's price element value greater than 35 title node

5. Wildcard: Wildcard characters are also supported in the XPath path (*,@*,node (), text ())

Example:/bookstore/*

Title[@*]

6. XPath axis

Defines a node set relative to the current node

Ancestor all ancestor nodes

attribute all attribute nodes

Child all children elements

Descendant all descendant nodes (child, grandchild ... ）

Following all nodes after the end tag preceding all nodes before the start tag

Following-sibling all sibling nodes after the end tag

Preceding-sibling all sibling nodes before the start tag

namespace all nodes of the current namespace

Parent Parents Node

Self current node

Usage: Axis name:: node test [predicate]

Example: Ancestor::book

Child::text ()

7. Operators

| Consolidated example of two node sets:/root/book[1] | /ROOT/BOOK[3]

+,-,*,dev,mod

=,!=,<,>,<=,>=

Or,and OR And with

    Delete the comment, Script,style    node. Descendants ()                . Where (n = n.name = = "Script" | | n.name = = "Style" | | n.name== "#comment")                . ToList (). ForEach (n = n.remove ());    Iterates through all descendant nodes of the node nodes,    foreach (Var htmlnode in node. Descendants ())    {            }

Htmlagilitypack Class Library Usage

　1, first need to get to the HTML page data, can be obtained through the WebRequest class

        public static string gethtmlstr (string url)        {                try            {                WebRequest rget = webrequest.create (URL);                WebResponse rSet = Rget.getresponse ();                Stream s = Rset.getresponsestream ();                StreamReader reader = new StreamReader (s, Encoding.UTF8);                Return reader. ReadToEnd ();            }            catch (WebException)            {                //connection failed                return null;            }        }

2. Loading HTML data through the HTMLDocument class

        String htmlstr = Gethtmlstr ("http://www.hao123.com");        Htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument ();        Doc. Loadhtml (HTMLSTR);        Htmlnode RootNode = doc. Documentnode;    An XPath path expression, which represents the last child node of the font selected in all span nodes, where the class attribute value of the span node is num        //Set XPath path expression based on the content of the Web page        xpathstring = "//span[@class = ' num ']/font[last ()]";            Htmlnodecollection AA = RootNode. SelectNodes (xpathstring);    All found nodes are a collection                if (AA! = null)        {            string innertext = Aa[0]. InnerText;            String color = aa[0]. Getattributevalue ("Color", "");    Get the Color property, the second parameter is the default value            //other property for everyone to try        }

You can also get HTMLDocument through the Htmlweb class.

        Htmlweb Web = new Htmlweb ();        Htmlagilitypack.htmldocument doc = web. Load (URL);        Htmlnode RootNode = doc. Documentnode;

Add:

Multiple attribute criteria query//div[@align = ' center ' and @height = ' 24 ']

There is no class attribute//div[not (@class)]

Htmlagilitypack Handling wildcard characters

Doc. Documentnode.selectnodes ("//input[contains (@id, ' BT ')]")
Doc. Documentnode.selectnodes ("//input[contains (@name, ' __ ')]")
Doc. Documentnode.selectnodes ("//input[starts-with (@id, ' TB ')]")

". NET" crawls Web page data using Htmlagilitypack

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More