Shi linfeng: open-source HtmlAgilityPack public small library package,
Developed programs may have been used for information collection, and Shi linfeng often does some data collection or automated operation software for some websites.
It is easy to get the information of the target webpage. You can use HttpWebResponse, HttpWebRequest, and WebClient for network programming.
The complexity is that key information needs to be filtered after the webpage content is obtained. At first, Shi linfeng mainly uses regular expressions to match the target data.
Such matching can also achieve the goal, but it is difficult for developers or beginners who are not familiar with regular expressions, especially complicated regular expressions.
It is best to have a dedicated tool to test first, and then put the regular expression in the program for testing. Regextester.exe is recommended here.
Later, I came into contact with HtmlAgilityPack by chance. This is an open-source class library. To study the source code, click here: HtmlAgilityPack source code
It is more convenient to use at the beginning. If you need to use it, start new and then find the target node, get the attribute or take the text. When I use more, I have the idea of encapsulating the class library, and then continuously improve and update the class library during use. Currently, the usage is relatively stable.
REFERENCE The NuGet in HtmlAgilityPack. dll Visual Studio to obtain
First, the source code of the Class Library:
1 /// <summary> 2 // html document parsing auxiliary class library 3 /// </summary> 4 public class HtmlParse {5 private readonly HtmlDocument doc = new HtmlDocument (); 6 7 /// <summary> 8 /// constructor initialization document and parse the default UTF-8 mode 9 /// </summary> 10 /// <param name = "htmlOrUrl"> The obtained html string or url link </param> 11 public HtmlParse (string htmlOrUrl) {12 InitDoc (htmlOrUrl ); 13} 14 15 // <summary> 16 // constructor initialization document and parse the default UTF-8 mode 17 /// </summary> 18 /// <param name =" htmlOrUrl "> the obtained html string or url link </param> 19 // <param name =" encode "> character encoding </param> 20 public HtmlParse (string htmlOrUrl, string encode) {21 InitDoc (htmlOrUrl, encode ); 22} 23 24 25 // <summary> 26 // obtain the document based on the url or html string and parse the document 27 /// </summary> 28 // <param name = "htmlOrUrl"> html string or url </param> 29 // <param name = "encode"> website code </param> 30 /// <returns> </returns> 31 public HtmlDocument InitDoc (string htmlOrUrl, string encode = "UTF-8") {32 if (htmlOrUrl. trim (). startsWith ("http") {33 htmlOrUrl = NetHelper. getPageStr (htmlOrUrl, "", encode); 34} 35 doc. loadHtml (htmlOrUrl); 36 return doc; 37} 38 39 // <summary> 40 // obtain the node set 41 // </summary> 42 // <param name = "xPath"> </param> 43 // <returns> </returns> 44 public HtmlNodeCollection GetNodes (string xPath) {45 return doc. documentNode. selectNodes (xPath ); 46} 47 48 49 // <summary> 50 // obtain a single node 51 // </summary> 52 // <param name = "xPath"> </ param> 53 // <returns> </returns> 54 public HtmlNode GetNode (string xPath) {55 return doc. documentNode. selectSingleNode (xPath ); 56} 57 58 // <summary> 59 // obtain the node attribute value 60 /// </summary> 61 // <param name = "node"> node </param> 62 // <param name = "attrName"> attribute name </param> 63 // <returns> </returns> 64 public string GetNodeAttr (HtmlNode node, string attrName) {65 if (node = null | node. attributes [attrName] = null) {66 return string. empty; 67} 68 return node. attributes [attrName]. value; 69} 70 71 // <summary> 72 // obtain the InnerText value of the node 73 /// </summary> 74 /// <param name = "node"> </param> 75 // <returns> </returns> 76 public string GetNodeText (HtmlNode node) {77 if (node = null) {78 return string. empty; 79} 80 return node. innerText; 81} 82 83 // <summary> 84 // obtain the InnerHtml or OuterHtml value of the node 85 /// </summary> 86 // <param name = "node"> node </param> 87 // <param name = "isOuter"> whether to obtain OuterHtml </param> 88 // <returns> </returns> 89 public string getNodeHtml (HtmlNode node, bool isOuter = false) {90 if (node = null) {91 return string. empty; 92} 93 if (isOuter) {94 return node. outerHtml; 95} 96 return node. innerHtml; 97} 98 99 // <summary> 100 // obtain the attribute value 101 Based on the Xpath and attribute names /// </summary> 102 // <param name = "xPath"> </param> 103 // <param name = "attrName"> </param> 104 // <returns> </returns> 105 public string GetNodeAttr (string xPath, string attrName) {106 var node = GetNode (xPath); 107 return GetNodeAttr (node, attrName ); 108} 109 110 // <summary> 111 // obtain the InnerText112 Node Based on XPath /// </summary> 113 // <param name = "xPath"> </param> 114 // <returns> </returns> 115 public string GetNodeText (string xPath) {116 var node = GetNode (xPath); 117 return GetNodeText (node ); 118} 119 120 // <summary> 121 // obtain the InnerHtml or OuterHtml value of the node based on XPath 122 // </summary> 123 // <param name =" xPath "> </param> 124 // <param name =" isOuter "> </param> 125 // <returns> </returns> 126 public string GetNodeHtml (string xPath, bool isOuter = false) {127 var node = GetNode (xPath); 128 return GetNodeHtml (node); 129} 130}
Tip: to use HtmlAgilityPack skillfully, you must understand the knowledge of XPath. If you don't understand it, please go here: XPath getting started tutorial
In fact, XPath mainly focuses on several points to solve the problem of 80%.
1. Start with a slash (/) and start with a root node. Start with a slash (/) without considering their locations.
2. You can use attributes to locate the node or node set to be selected, for example, // span [@ class = "time"] is to select all span elements of class = "time" in the document.
3. select a node using [I], for example, // span [@ class = "time"] [1] is to select all the span elements of class = "time" in the document. the first span in. Note that the index of the selected node starts from 1, instead of 0.
4. use | for Fault Tolerance selection, for example, a data in a webpage may be in <div class = "a1"> </div> or <div class = "a2"> </div>. // div [@ class = "a1"] | // div [@ class = "a2"] As XPath
5. single quotation marks can be used in XPath. Because strings in C # need double quotation marks, single quotation marks must be used in XPath, so that no escape is required.
A NetHelper is also used above. It is mainly used to obtain the Url content. This is a lot of stuff on the Internet, so it won't be ugly here. You can combine them on your own.
The usage is also simple:
// For example, obtain the content on the homepage of my blog and parse the list of current articles.
Var doc = new HtmlParse ("http://www.cnblogs.com/jayshsoft/"); var nodeList = doc. getNodes ("// div [@ class = 'Post post-list-item']"); foreach (var node in nodeList) {// write your own logic here}
After encapsulating the class library, the collected content has become very Easy. You only need to analyze the process, and the elements in the Html document can be exploited...
In addition, we recommend a FireFox browser plug-in: XPath Checker
With it, you can directly write the XPath in the browser and see the result at a glance.
Direct
Right-click the blank area of the webpage and select View XPath
Enter XPath to get the desired data. Is it intuitive? Yes. I recommend it...