". NET" crawls Web page data using Htmlagilitypack

Source: Internet
Author: User
Tags xpath

Just learned the XPath path expression, mainly to search the nodes in the XML document, through the XPath expression can quickly locate and access the node location in the XML document, HTML is also a kind of XML-like markup language, but the syntax is not so rigorous, In CodePlex, there is an open source project Htmlagilitypack, which provides an XPath parsing HTML file, which hides how to use the class library.

First, the XPath path expression

An XPath path expression

That is used to select the node or node set in the XML document.

1. Terminology: Node: 7 types: Elements, attributes, text, namespaces, processing commands, annotations, document (root) nodes

2. Node relationship: parent, child (children), Fellow (Sibling), ancestor (Ancestor), descendant (descendant)

3. Path expression

NodeName node name, select all child nodes of this node example: Childnode the Childnode child node in the current node, not including grandchildren

/select from root node example:/root/childnode/grandsonnode

Represents all descendant node examples://childnode all descendant nodes named Childnode

. Represents the current node example:./childnode represents the Childnode node of the current node

.. Represents the parent node example:.. /nearnode represents the Nearnode child node of the Father node

@ Select attribute/root/childnode/@id represents all node sets containing the id attribute of Childnode

4. Predicate (predicates)

Predicates can limit the set of nodes to make the selection more precise

/ROOT/BOOK[1] The first node in a node set

/root/book[last ()] The last node in a node set

/root/book[position ()-1] The second-to-last node set in a node set

/root/book[position () < 5] The first five node sets in a node set

/root/book[@id] Node set with attribute IDs in the nodes set

/root/book[@id = ' Chinese ' node set with id attribute value Chinese

/root/book[price > 35]/title node Set book's price element value greater than 35 title node

5. Wildcard: Wildcard characters are also supported in the XPath path (*,@*,node (), text ())

Example:/bookstore/*

Title[@*]

6. XPath axis

Defines a node set relative to the current node

Ancestor all ancestor nodes

attribute all attribute nodes

Child all children elements

Descendant all descendant nodes (child, grandchild ... )

Following all nodes after the end tag preceding all nodes before the start tag

Following-sibling all sibling nodes after the end tag

Preceding-sibling all sibling nodes before the start tag

namespace all nodes of the current namespace

Parent Parents Node

Self current node

Usage: Axis name:: node test [predicate]

Example: Ancestor::book

Child::text ()

7. Operators

| Consolidated example of two node sets:/root/book[1] | /ROOT/BOOK[3]

+,-,*,dev,mod

=,!=,<,>,<=,>=

Or,and OR And with

 

    Delete the comment, Script,style    node. Descendants ()                . Where (n = n.name = = "Script" | | n.name = = "Style" | | n.name== "#comment")                . ToList (). ForEach (n = n.remove ());    Iterates through all descendant nodes of the node nodes,    foreach (Var htmlnode in node. Descendants ())    {            }

Htmlagilitypack Class Library Usage

 1, first need to get to the HTML page data, can be obtained through the WebRequest class

        public static string gethtmlstr (string url)        {                try            {                WebRequest rget = webrequest.create (URL);                WebResponse rSet = Rget.getresponse ();                Stream s = Rset.getresponsestream ();                StreamReader reader = new StreamReader (s, Encoding.UTF8);                Return reader. ReadToEnd ();            }            catch (WebException)            {                //connection failed                return null;            }        }

2. Loading HTML data through the HTMLDocument class

        String htmlstr = Gethtmlstr ("http://www.hao123.com");        Htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument ();        Doc. Loadhtml (HTMLSTR);        Htmlnode RootNode = doc. Documentnode;    An XPath path expression, which represents the last child node of the font selected in all span nodes, where the class attribute value of the span node is num        //Set XPath path expression based on the content of the Web page        xpathstring = "//span[@class = ' num ']/font[last ()]";            Htmlnodecollection AA = RootNode. SelectNodes (xpathstring);    All found nodes are a collection                if (AA! = null)        {            string innertext = Aa[0]. InnerText;            String color = aa[0]. Getattributevalue ("Color", "");    Get the Color property, the second parameter is the default value            //other property for everyone to try        }

You can also get HTMLDocument through the Htmlweb class.

        Htmlweb Web = new Htmlweb ();        Htmlagilitypack.htmldocument doc = web. Load (URL);        Htmlnode RootNode = doc. Documentnode;

Add:

Multiple attribute criteria query//div[@align = ' center ' and @height = ' 24 ']

There is no class attribute//div[not (@class)]

Htmlagilitypack Handling wildcard characters

Doc. Documentnode.selectnodes ("//input[contains (@id, ' BT ')]")
Doc. Documentnode.selectnodes ("//input[contains (@name, ' __ ')]")
Doc. Documentnode.selectnodes ("//input[starts-with (@id, ' TB ')]")

". NET" crawls Web page data using Htmlagilitypack

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.