Once, I foolishly used regular expressions to successfully parse the school's news network, educational management system, library management system I want all the content. At that time, a great effort to write that the regular Ah, and finally still a variety of not to force, often there will be unexpected bugs appear, and finally after countless repair can be used normally. But it's still very uncomfortable. Later saw others use this thing to parse HTML, feel very strong, today I try to do a bit, then a few days of code, with this class library a few minutes to get it done. Nonsense not much to say, into the subject.
Html Agility Pack Home: http://htmlagilitypack.codeplex.com/
Author's homepage: http://zhoufoxcn.blog.51cto.com/792419/595344/
Using the class library The first step: referencing the class library;
Step two: Load the HTML file: Support local files, or take advantage of the document provided by the class library. Loadhtml () method to load a remote resource
Step three: Get the root node:
Htmlnode RootNode = document. Documentnode;
The third step: under the root node to find what you look for, this I did not all try to do, here are some of my test code, the analysis is NetEase's news page;
HTMLDocument document=new HTMLDocument (); Document. Load (@ "E:\c.htm", Encoding.default); Htmlnode RootNode = document. Documentnode; Htmlnode Titlenode = Rootnode.selectsinglenode ("//h1[@id = ' h1title ']"); Console.WriteLine ("-------------------------title-------------------------------"); Console.WriteLine (titlenode.innerhtml); Console.WriteLine ("-------------------------Time-------------------------------"); Htmlnode Timenode = Rootnode.selectsinglenode ("//div[@class = ' ep-info cdgray ']/div[@class = ' Left ']"); Console.WriteLine (timenode.innerhtml); Console.WriteLine ("-------------------------body-------------------------------"); Htmlnode Newsnode = Rootnode.selectsinglenode ("//div[@class = ' end-text ']"); Console.WriteLine (newsnode.innerhtml); Console.readkey ();
The official documentation tells us that you can use the following method to get one or more child nodes below the root node:
/ARTICLES/ARTICLE[1]: Select the first article element that belongs to the articles child element.
/articles/article[last ()]: Select the last article element that belongs to the articles child element.
/articles/article[last ()-1]: Select the second-to-last article element that belongs to the articles child element.
/articles/article[position () <3]: Selects the first two article elements that belong to the child elements of the bookstore element.
title[@lang]: Selects all the title elements that have properties named Lang.
createat[@type = ' ZH-CN '): Selects all createat elements that have a type attribute with a value of ZH-CN.
/ARTICLES/ARTICLE[ORDER>2]: Selects all article elements of the articles element, and the value of the Order element must be greater than 2.
/articles/article[order<3]/title: Selects all the Title elements of the article element in the articles element, and the value of the Order element must be less than 3.
The most useful path expressions are listed below:
NodeName: Selects all child nodes of this node.
/: Selected from the root node.
: Selects the nodes in the document from the current node of the matching selection, regardless of their location.
.: Select the current node.
: Select the parent node of the current node
C # parsing HTML artifact HTML Agility Pack