Recently, there is a need to parse the HTML page, read some needed data, and insert the local database. I know it can be done with regular expressions, but the regular expressions are to me, just like the assembly language is to me. I know what it is, and I know what it can do, but I've never known how to do it, tried it, used it too little, and finally gave it up. Also know that some components can achieve operation HMTL, such as mshtml, such as WebBrowser, but always feel not very good, not too professional. Judah doubts, has not begun, until the discovery of htmlagilitypack, unsanitary environment, the middle of the word agility, is agile, flexible meaning.
The following text, some excerpts from the Zhou Public blog.
A Concise introduction to XPath
XPath uses a path expression to pick a node or set of nodes in an XML document. A node is picked up either along a path or a step (steps).
The most useful path expressions are listed below:
NodeName: Selects all child nodes of this node.
/is selected from the root node.
is to select the nodes under the current node starting from the current node of the matching selection, regardless of their location and depth.
. (one point) is to select the current node.
.. (two points) Select the parent node of the current node.
For example, there is the following XML:
<?XML version= "1.0" encoding= "Utf-8"?> <Articles> <article> <Title>Using the Highcharts JS Chart in ASP.</title> <URL>http://zhoufoxcn.blog.51cto.com/792419/537324</URL> <createattype= "en">2011-04-07</ Price> </article> <article> <TitleLang= "Eng">Log4net Use Details (cont.)</title> <URL>Http://blog.csdn.net/zhoufoxcn/archive/2010/11/23/6029021.aspx</URL> <createattype= "ZH-CN">November 23, 2010</ Price> </article> <article> <Title>General Steps for J2ME development</title> <URL>Http://blog.csdn.net/zhoufoxcn/archive/2011/06/12/6540223.aspx</URL> <createattype= "ZH-CN">June 12, 2011</ Price> </article> <article> <TitleLang= "Eng">Powerdesign Advanced Applications</title> <URL>http://zhoufoxcn.blog.51cto.com/792419/166415</URL> <createattype= "ZH-CN">2007-09-08</ Price> </article> </Articles>
For the XML file above, we list some path expressions with predicates, and the result of the expression:
/ARTICLES/ARTICLE[1]: Select the first article element that belongs to the articles child element. That is the first group of <Article></Article>. Note that starting from 1, it is not starting from 0.
/articles/article[last ()]: Select the last article element that belongs to the articles child element. That's the last group <Article></Article>
/articles/article[last ()-1]: Select the second-to-last article element that belongs to the articles child element.
/articles/article[position () <3]: Selects the first two article elements that belong to the child elements of the bookstore element.
title[@lang]: Selects all the title elements that have properties named Lang.
createat[@type = ' ZH-CN '): Selects all createat elements that have a type attribute with a value of ZH-CN.
/ARTICLES/ARTICLE[ORDER>2]: Selects all article elements of the articles element, and the value of the Order element must be greater than 2.
/articles/article[order<3]/title: Selects all the Title elements of the article element in the articles element, and the value of the Order element must be less than 3.
Htmlagilitypack API Brief Introduction
The classes commonly used in Htmlagilitypack are HTMLDocument, htmlnodecollection,
Htmlnode and Htmlweb and so on.
The process is typically to get HTML first, which can load static content through HTMLDocument's load () or loadhtml (), or you can htmlweb the get () or load () method to load the HTML for the URL on the network.
After getting the instance of HTMLDocument, we can use HTMLDocument's Documentnode property, which is the root node of the whole HTML document, it is also a htmlnode, You can then use the Htmlnode selectnodes () method to return multiple Htmlnode collection Object Htmlnodecollection, or you can take advantage of Htmlnode's selectSingleNode () method returns a single htmlnode.
Htmlagilitypack Combat
Parsing HTML using C # and Htmlagilitypack