I previously posted an article about html-to-xml Conversion, which attracted the attention of many netizens. The implementation method is to use htmlparser to break down html content, and then generate xml strings one by one according to the dom structure. After failing to fully implement the solution, I thought the solution could solve the problem. However, after actual use, the efficiency is indeed very low, and the conversion of some special html attributes is not supported, the results are not satisfactory.
When I accidentally browsed the codeplex website, I found a very good html parsing and conversion tool, which is the Html Agility Pack mentioned in the title of this article. Html Agility Pack is an open-source framework in codeplex. Its main function is to use the object model to operate html content, xml technologies such as xpath can be easily and flexibly applied to html document parsing. As mentioned in its introduction, this framework is very suitable for developing crawlers and network data mining tools. More importantly, the framework is fully written in c # language, which facilitates modification and in-depth research on the framework.
The following describes how to convert html to xml format.
First, create an HtmlDocument object (this HtmlDocument is a class in the Html Agility Pack, not the one in winform). All html operations are implemented through this object.
HtmlDocument htmlDoc = new HtmlDocument ();
Next, set some options for output to xml.
// Output to xml format
HtmlDoc. OptionOutputAsXml = true;
Load the html string content and output the Conversion Result
// Load html content
HtmlDoc. LoadHtml (@ "<Table>
<Tr>
<Td> dafd </td>
<Td>
</Tr>
</Table>
</Body>
// Save the output result to the string stream
StringBuilder sbXml = new StringBuilder ();
StringWriter sw = new StringWriter (sbXml );
HtmlDoc. Save (sw );
Console. WriteLine (sbXml. ToString ());
The html content provided is not in good format xml. The converted result is as follows:
<? Xml version = "1.0" encoding = "gb2312"?>
<Html>
<Body>
<Table>
<Tr>
<Td> dafd </td>
<Td> </td>
</Tr>
</Table>
</Body>
</Html>
After the conversion, no matching tag is automatically fixed and the xml declaration is added.
In addition, if the given html document content does not have a root node, a root node named span is automatically added after conversion.
For example, the entered html document is as follows:
<Script> var B = 'B'; </script>
<Html> <body>
<Table>
<Tr>
<Td> dafd </td>
<Td>
</Tr>
</Table>
</Body>
</Html>
The conversion result is as follows:
<? Xml version = "1.0" encoding = "gb2312"?> <Span> <script>
// <! [CDATA [
Var B = 'B ';
//]> //
</Script> <Table>
<Tr>
<Td> dafd </td>
<Td>
</Td> </tr>
</Table>
</Body> This method ensures the security during conversion. whether to use it depends on the specific project requirements.
The above method is given the existing html string, and there is another more convenient way, that is, directly give the url path, using HtmlWeb can be downloaded and converted. The implementation method is as follows:
StringBuilder sbXml = new StringBuilder ();
StringWriter sw = new StringWriter (sbXml );
XmlTextWriter tw = new XmlTextWriter (sw );
HtmlWeb htmlWeb = new HtmlWeb ();
HtmlWeb. LoadHtmlAsXml ("http://htmlagilitypack.codeplex.com/", tw );
Console. WriteLine (sbXml. ToString ());
Although the above method is convenient, there is an unstable factor: the downloaded html document may be garbled, and such a situation exists. For better use, I modified the source code so that it can automatically determine the encoding method during download.
The efficiency of Html Agility Pack is greatly improved than that of htmlparser. However, when processing some large pages, you still have to wait. Another disadvantage is that the conversion result still cannot conform to the html format of the table by 100%. It can only be said to be 95% close, which is far worse than the html parsing function of firebug.
Download link of Html Agility Pack
Http://htmlagilitypack.codeplex.com/
Modified dll (fixed the problem of garbled characters after the document is downloaded)
HtmlAgilityPack_Shenba