C # parsing HTML artifact HTML Agility Pack

Source: Internet
Author: User

Once, I foolishly used regular expressions to successfully parse the school's news network, educational management system, library management system I want all the content. At that time, a great effort to write that the regular Ah, and finally still a variety of not to force, often there will be unexpected bugs appear, and finally after countless repair can be used normally. But it's still very uncomfortable. Later saw others use this thing to parse HTML, feel very strong, today I try to do a bit, then a few days of code, with this class library a few minutes to get it done. Nonsense not much to say, into the subject.

Html Agility Pack Home: http://htmlagilitypack.codeplex.com/

Author's homepage: http://zhoufoxcn.blog.51cto.com/792419/595344/

Using the class library The first step: referencing the class library;

Step two: Load the HTML file: Support local files, or take advantage of the document provided by the class library. Loadhtml () method to load a remote resource

Step three: Get the root node:

Htmlnode RootNode = document. Documentnode;

The third step: under the root node to find what you look for, this I did not all try to do, here are some of my test code, the analysis is NetEase's news page;

HTMLDocument document=new HTMLDocument ();            Document. Load (@ "E:\c.htm", Encoding.default);            Htmlnode RootNode = document. Documentnode;            Htmlnode Titlenode = Rootnode.selectsinglenode ("//h1[@id = ' h1title ']");            Console.WriteLine ("-------------------------title-------------------------------");            Console.WriteLine (titlenode.innerhtml);            Console.WriteLine ("-------------------------Time-------------------------------");            Htmlnode Timenode = Rootnode.selectsinglenode ("//div[@class = ' ep-info cdgray ']/div[@class = ' Left ']");            Console.WriteLine (timenode.innerhtml);            Console.WriteLine ("-------------------------body-------------------------------");            Htmlnode Newsnode = Rootnode.selectsinglenode ("//div[@class = ' end-text ']");            Console.WriteLine (newsnode.innerhtml); Console.readkey ();

The official documentation tells us that you can use the following method to get one or more child nodes below the root node:

/ARTICLES/ARTICLE[1]: Select the first article element that belongs to the articles child element.
/articles/article[last ()]: Select the last article element that belongs to the articles child element.
/articles/article[last ()-1]: Select the second-to-last article element that belongs to the articles child element.
/articles/article[position () <3]: Selects the first two article elements that belong to the child elements of the bookstore element.
title[@lang]: Selects all the title elements that have properties named Lang.
createat[@type = ' ZH-CN '): Selects all createat elements that have a type attribute with a value of ZH-CN.
/ARTICLES/ARTICLE[ORDER>2]: Selects all article elements of the articles element, and the value of the Order element must be greater than 2.
/articles/article[order<3]/title: Selects all the Title elements of the article element in the articles element, and the value of the Order element must be less than 3.

The most useful path expressions are listed below:
NodeName: Selects all child nodes of this node.
/: Selected from the root node.
: Selects the nodes in the document from the current node of the matching selection, regardless of their location.
.: Select the current node.
: Select the parent node of the current node

C # parsing HTML artifact HTML Agility Pack

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.