Just learned the XPath path expression, mainly to search the nodes in the XML document, through the XPath expression can quickly locate and access the node location in the XML document, HTML is also a kind of XML-like markup language, but the syntax is not so rigorous, In CodePlex, there is an open source project Htmlagilitypack, which provides an XPath parsing HTML file, which hides how to use the class library.
First, the XPath path expression
An XPath path expression
That is used to select the node or node set in the XML document.
1. Terminology: Node: 7 types: Elements, attributes, text, namespaces, processing commands, annotations, document (root) nodes
2. Node relationship: parent, child (children), Fellow (Sibling), ancestor (Ancestor), descendant (descendant)
3. Path expression
NodeName node name, select all child nodes of this node example: Childnode the Childnode child node in the current node, not including grandchildren
/select from root node example:/root/childnode/grandsonnode
Represents all descendant node examples://childnode all descendant nodes named Childnode
. Represents the current node example:./childnode represents the Childnode node of the current node
.. Represents the parent node example:.. /nearnode represents the Nearnode child node of the Father node
@ Select attribute/root/childnode/@id represents all node sets containing the id attribute of Childnode
4. Predicate (predicates)
Predicates can limit the set of nodes to make the selection more precise
/ROOT/BOOK[1] The first node in a node set
/root/book[last ()] The last node in a node set
/root/book[position ()-1] The second-to-last node set in a node set
/root/book[position () < 5] The first five node sets in a node set
/root/book[@id] Node set with attribute IDs in the nodes set
/root/book[@id = ' Chinese ' node set with id attribute value Chinese
/root/book[price > 35]/title node Set book's price element value greater than 35 title node
5. Wildcard: Wildcard characters are also supported in the XPath path (*,@*,node (), text ())
Example:/bookstore/*
Title[@*]
6. XPath axis
Defines a node set relative to the current node
Ancestor all ancestor nodes
attribute all attribute nodes
Child all children elements
Descendant all descendant nodes (child, grandchild ... )
Following all nodes after the end tag preceding all nodes before the start tag
Following-sibling all sibling nodes after the end tag
Preceding-sibling all sibling nodes before the start tag
namespace all nodes of the current namespace
Parent Parents Node
Self current node
Usage: Axis name:: node test [predicate]
Example: Ancestor::book
Child::text ()
7. Operators
| Consolidated example of two node sets:/root/book[1] | /ROOT/BOOK[3]
+,-,*,dev,mod
=,!=,<,>,<=,>=
Or,and OR And with
Delete the comment, Script,style node. Descendants () . Where (n = n.name = = "Script" | | n.name = = "Style" | | n.name== "#comment") . ToList (). ForEach (n = n.remove ()); Iterates through all descendant nodes of the node nodes, foreach (Var htmlnode in node. Descendants ()) { }
Htmlagilitypack Class Library Usage
1, first need to get to the HTML page data, can be obtained through the WebRequest class
public static string gethtmlstr (string url) { try { WebRequest rget = webrequest.create (URL); WebResponse rSet = Rget.getresponse (); Stream s = Rset.getresponsestream (); StreamReader reader = new StreamReader (s, Encoding.UTF8); Return reader. ReadToEnd (); } catch (WebException) { //connection failed return null; } }
2. Loading HTML data through the HTMLDocument class
String htmlstr = Gethtmlstr ("http://www.hao123.com"); Htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument (); Doc. Loadhtml (HTMLSTR); Htmlnode RootNode = doc. Documentnode; An XPath path expression, which represents the last child node of the font selected in all span nodes, where the class attribute value of the span node is num //Set XPath path expression based on the content of the Web page xpathstring = "//span[@class = ' num ']/font[last ()]"; Htmlnodecollection AA = RootNode. SelectNodes (xpathstring); All found nodes are a collection if (AA! = null) { string innertext = Aa[0]. InnerText; String color = aa[0]. Getattributevalue ("Color", ""); Get the Color property, the second parameter is the default value //other property for everyone to try }
You can also get HTMLDocument through the Htmlweb class.
Htmlweb Web = new Htmlweb (); Htmlagilitypack.htmldocument doc = web. Load (URL); Htmlnode RootNode = doc. Documentnode;
Add:
Multiple attribute criteria query//div[@align = ' center ' and @height = ' 24 ']
There is no class attribute//div[not (@class)]
Htmlagilitypack Handling wildcard characters
Doc. Documentnode.selectnodes ("//input[contains (@id, ' BT ')]")
Doc. Documentnode.selectnodes ("//input[contains (@name, ' __ ')]")
Doc. Documentnode.selectnodes ("//input[starts-with (@id, ' TB ')]")
". NET" crawls Web page data using Htmlagilitypack