Original: HTML parsing tool-htmlagility learning
Htmlagility is an open source HTML parsing library, which is said to be a C # version of jquery, very powerful.
This article learns its analytic function, can also simulate user request, create HTML, set up agent and so on, do not study.
----------------------------------------------------------------------------
1. Simple examples
usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;usingSystem.Text;usingSystem.Net;usingHtmlagilitypack;namespaceconsoleapplication1{classProgram {Static voidMain (string[] args) {Htmlweb webClient=NewHtmlweb (); HTMLDocument Doc= Webclient.load ("http://www.baidu.com"); varRootNode =Doc. Documentnode; Htmlnodecollection categorynodelist= Rootnode.selectnodes ("//html[1]/body[1]"); foreach(varIteminchcategorynodelist) {Console.WriteLine ("Item:"+item. Name); } console.read (); } }}
Is the first hellow world, grilled Baidu page.
----------------------------------------------------------------------------
2. Read
Well, if it is loaded with local HTML or a direct read stream, the string. You can do this.
New htmldocument ();d OC. Load (@ "D:\xxx.mht"false);
Public void Loadhtml (string html); // Direct-read-string-typed HTML Public void Load (Stream stream); // Flow Public void Load (string path); // Local Path
The htmldocumen itself also provides a method for detecting the encoding.
Htmlweb is mainly automatic detection of the code, if you want to customize the encoding can change the properties. Overrideencoding, autodetectencoding. And HTMLDocument on the operation of the code is not the same, specified in the parameters, it is estimated that the automatic detection of the code has been very powerful, very little to specify ....
----------------------------------------------------------------------------
3. Node selection
RootNode.SelectNodesrootNode.SelectSingleNode
Select a node and select a single node.
Take selectnodes as an example and look at the parameters
Rootnode.selectnodes ("//html[1]/body[1]");
"//" double slash means finding all child nodes from the root node
The "/" slash indicates that only the first-level child node is found
The "./" dot slash indicates a lookup starting from the current node
[] A child node index in brackets that represents the same name.
var resultlist = Rootnode.selectnodes ("//html[1]/body[1]/div[1]/div[position () <5] "); // take the first 4 elements resultlist = Rootnode.selectnodes ("//html[1]/body[1]/div[1]/div[last ()]" ); // take the last 1 elements resultlist = Rootnode.selectnodes ("//html[1]/body[1]/div[1]/div[@id] "); // take all elements with an id attribute resultlist = rootnode.selectnodes ("//html[1]/body[1]/div[1]/div[@id = ' head ' ]"); // take an element with a property ID value of head
More properties can be viewed in W3school http://www.w3school.com.cn/xpath/xpath_functions.asp
Take attribute
Doc. attributes["ID"];
Take elements
Doc. getElementById ("ID");