It is believed that many people have the need to parse HTML documents. For example, we crawl the page data of a 1 Web site, the format is the HTML format. We used to parse through regular expressions, but found some problems. Parsing an HTML document is not easy, and if the format of the document changes slightly it is likely that it will not match correctly. So we need specialized tools to help us easily parse HTML documents.
In fact already has a very good tool to provide. Like Htmlagilitypack. It can help us parse HTML documents just as easily and easily as parsing XML with the XmlDocument class.
This tool can be downloaded to http://htmlagilitypack.codeplex.com/, which has DLLs that support various versions of the. NET Framework.
Well, here's a simple enough example for you. On this basis, we can extrapolate.
For example, to parse the following HTML.
<Table> <thead> <TR> <th>Time</th> <th>Type</th> <th>Name</th> <th>Unit</th> <th>Amount</th> </TR> </thead> <tbody> <TR> <TD>2013-12-29</TD> <TD>Invoice 1</TD> <TD>Purchase material Invoice 1</TD> <TD>XXX Company 1</TD> <TD>123 USD</TD> </TR> <TR> <TD>2013-12-29</TD> <TD>Invoice 2</TD> <TD>Purchase Material Invoice 2</TD> <TD>XXX Company 2</TD> <TD>321 USD</TD> </TR> <tbody></Table>
As an example of a console project, first refer to the HtmlAgilityPack.dll file so that you can use the classes and methods inside the DLL.
Static voidMain (string[] args) { stringStrwebcontent =@"<table><thead> <tr> <th> time </th> <th> type </th > <th> name </th> <th> Unit </th> <th> amount </th> </tr> </thead> <tbody>"+@"<tr> <td>2013-12-29</td> <td> invoice 1</td> <td> Mining Purchase material invoice 1</td> <td> XXX company 1</td> <td>123 yuan </td> </tr>"+@"<tr> <td>2013-12-29</td> <td> invoice 2</td> <td> Mining Purchase material invoice 2</td> <td> XXX company 2</td> <td>321 </td> </tr> </tbody> </table>"; List<Data> datas =NewList<data> ();//Define 1 lists for saving resultsHTMLDocument HTMLDocument=NewHTMLDocument (); Htmldocument.loadhtml (strwebcontent);//loads the HTML string, if the file can be loaded with the Htmldocument.load methodhtmlnodecollection Collection= HtmlDocument.DocumentNode.SelectSingleNode ("Table/tbody"). ChildNodes;//As with XPath, easily navigate to the appropriate node foreach(Htmlnode nodeinchcollection) { //remove the \ r \ n and space to get the data in the corresponding TD string[] line = node. Innertext.split (New Char[] {'\ r','\ n',' '}, Stringsplitoptions.removeemptyentries); //loaded into the object list if the condition is met if(line. Length = =5) datas. ADD (NewData () {time = line[0], type = line[1], name = line[2], unit = line[3], amount = line[4] }); } //loop output To see if the results are correct foreach(varVinchdatas) {Console.WriteLine (string. Join (",", V. Time, v. Type, v. Name, v. Unit, v. Amount)); } }
/// <summary> ///defined entity classes are used to receive data/// </summary> Public classData { Public stringTimeGet;Set; } Public stringTypeGet;Set; } Public stringNameGet;Set; } Public stringUnitGet;Set; } Public stringAmountGet;Set; } }
This is the complete code, and the annotations are clear.
Finally look at the results of the parse:
C # Parsing of HTML documents