C # Parsing of HTML documents

Source: Internet
Author: User

It is believed that many people have the need to parse HTML documents. For example, we crawl the page data of a 1 Web site, the format is the HTML format. We used to parse through regular expressions, but found some problems. Parsing an HTML document is not easy, and if the format of the document changes slightly it is likely that it will not match correctly. So we need specialized tools to help us easily parse HTML documents.

In fact already has a very good tool to provide. Like Htmlagilitypack. It can help us parse HTML documents just as easily and easily as parsing XML with the XmlDocument class.

This tool can be downloaded to http://htmlagilitypack.codeplex.com/, which has DLLs that support various versions of the. NET Framework.

Well, here's a simple enough example for you. On this basis, we can extrapolate.

For example, to parse the following HTML.

<Table>    <thead>        <TR>            <th>Time</th>            <th>Type</th>            <th>Name</th>            <th>Unit</th>            <th>Amount</th>        </TR>    </thead>    <tbody>        <TR>            <TD>2013-12-29</TD>            <TD>Invoice 1</TD>            <TD>Purchase material Invoice 1</TD>            <TD>XXX Company 1</TD>            <TD>123 USD</TD>        </TR>        <TR>            <TD>2013-12-29</TD>            <TD>Invoice 2</TD>            <TD>Purchase Material Invoice 2</TD>            <TD>XXX Company 2</TD>            <TD>321 USD</TD>        </TR>    <tbody></Table>

As an example of a console project, first refer to the HtmlAgilityPack.dll file so that you can use the classes and methods inside the DLL.

        Static voidMain (string[] args) {            stringStrwebcontent =@"<table><thead> <tr> <th> time </th> <th> type </th            > <th> name </th> <th> Unit </th> <th> amount </th> </tr> </thead> <tbody>"+@"<tr> <td>2013-12-29</td> <td> invoice 1</td> <td> Mining Purchase material invoice 1</td> <td> XXX company 1</td> <td>123 yuan </td> </tr>"+@"<tr> <td>2013-12-29</td> <td> invoice 2</td> <td> Mining            Purchase material invoice 2</td> <td> XXX company 2</td> <td>321 </td> </tr> </tbody> </table>"; List<Data> datas =NewList<data> ();//Define 1 lists for saving resultsHTMLDocument HTMLDocument=NewHTMLDocument (); Htmldocument.loadhtml (strwebcontent);//loads the HTML string, if the file can be loaded with the Htmldocument.load methodhtmlnodecollection Collection= HtmlDocument.DocumentNode.SelectSingleNode ("Table/tbody"). ChildNodes;//As with XPath, easily navigate to the appropriate node            foreach(Htmlnode nodeinchcollection) {                //remove the \ r \ n and space to get the data in the corresponding TD                string[] line = node. Innertext.split (New Char[] {'\ r','\ n',' '}, Stringsplitoptions.removeemptyentries); //loaded into the object list if the condition is met                if(line. Length = =5) datas. ADD (NewData () {time = line[0], type = line[1], name = line[2], unit = line[3], amount = line[4] }); }            //loop output To see if the results are correct            foreach(varVinchdatas) {Console.WriteLine (string. Join (",", V. Time, v. Type, v. Name, v. Unit, v. Amount)); }        }
    /// <summary>    ///defined entity classes are used to receive data/// </summary>     Public classData { Public stringTimeGet;Set; }  Public stringTypeGet;Set; }  Public stringNameGet;Set; }  Public stringUnitGet;Set; }  Public stringAmountGet;Set; } }

This is the complete code, and the annotations are clear.

Finally look at the results of the parse:

C # Parsing of HTML documents

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.