A nice HTML to XML tool-html Agility Pack for HTML to XML

Source: Internet
Author: User

"Turn" a very good Html to XML tool-html Agility Pack

Before sending an article about the implementation of HTML into XML is inferior to the implementation of HTML to XML, by many netizens ' concern. This is accomplished by htmlparser the HTML content and then generating the XML string one by one, based on the structure of the DOM. In the absence of sufficient practice, it was thought that the solution would solve the problem. However, after practical use, the efficiency is really very low, and some special HTML attributes of the conversion is not supported, the results are not satisfactory.

Once in a while browsing the CodePlex website, I found a good HTML parsing and conversion tool, which is the HTML Agility Pack mentioned in this title. HTML Agility Pack is an open-source framework in CodePlex, its main function is to use the object model to manipulate HTML content, can be XPath and other XML aspects of the simple and flexible application of HTML document parsing. As the introduction says, the framework is well suited for developing crawler, network data mining tools. More importantly, the framework is written entirely in the C # language, making it easy to modify and delve into the framework.

Here's a look at how to convert HTML into XML format

First create a HTMLDocument object (the HTMLDocument is the class in the HTML Agility pack, not the one in WinForm), and all of the HTML operations are implemented through this object.

HTMLDocument Htmldoc= NewHTMLDocument ();

Then set some options to output to XML

//output into XML format
Htmldoc.optionoutputasxml= true;

Load HTML string Contents while outputting conversion results

//Loading HTML content
htmldoc.loadhtml (@"<table>
<tr>
<td>dafd</td>
<td>
</tr>
</table>
</body>");

//to save the output to a string stream
StringBuilder Sbxml= NewStringBuilder ();
StringWriter SW= NewStringWriter (Sbxml);
Htmldoc.save (SW);

Console.WriteLine (Sbxml.tostring ());

The HTML content provided is not a well-formed XML, after the result of the conversion:

<?XML version= "1.0" encoding= "gb2312"?>
<HTML>
<Body>
<Table>
<TR>
<TD>DAFD</TD>
<TD></TD>
</TR>
</Table>
</Body>
</HTML>

After conversion, automatically fixes no matching tags, and adds XML declarations.

In addition, if the content of the given HTML document does not have a root node, a root node named span is automatically added after the conversion.

For example, the input HTML document is as follows:

<Script>varb='b';</Script>
<HTML><Body>
<Table>
<TR>
<TD>DAFD</TD>
<TD>
</TR>
</Table>
</Body>
</HTML>

The conversion results are as follows:

<?XML version= "1.0" encoding= "gb2312"?><span><Script>
//<! [Cdata[
var b = ' B ';
//]]>//
</Script><HTML><Body>
<Table>
<TR>
<TD>DAFD</TD>
<TD>
</TD></TR>
</Table>
</Body></HTML></span>

This approach guarantees the security of the conversion, whether or not to use or look at specific project requirements.

The above method is given the existing HTML string, there is another more convenient way, that is, directly to the URL path, the use of htmlweb can be arranged to download and conversion functions. Here's how it's implemented:

StringBuilder Sbxml= NewStringBuilder ();
StringWriter SW= NewStringWriter (Sbxml);
XmlTextWriter TW= NewXmlTextWriter (SW);

Htmlweb Htmlweb= NewHtmlweb ();
Htmlweb.loadhtmlasxml ("http://htmlagilitypack.codeplex.com/", TW);

Console.WriteLine (Sbxml.tostring ());

Although the above method is convenient, but there is an unstable factor is: The downloaded HTML document is likely garbled, and indeed there is such a situation, in order to better use, I modified the next source code, so that it can be downloaded at the time to automatically determine the encoding method.

The efficiency of the Html Agility pack is much higher than the htmlparser. But there is still some waiting to be processed when dealing with some super-large pages. In addition, there is a drawback is that the result of conversion can not be 100% to conform to the table HTML format content, can only be said to be 95% close, compared to the Firebug HTML parsing function is far worse.

Download link for Html Agility pack

http://htmlagilitypack.codeplex.com/

Modified DLL (fix the problem of garbled document after download)

Htmlagilitypack_shenba

A nice HTML to XML tool-html Agility Pack for HTML to XML

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.