There are several ways to do HTML file parsing and data extraction on the. NET platform, the simplest and most secure way is to use tools to organize HTML documents into XML documents, and then manipulate the data flexibly with XML DOM models or XPath. SGML is an HTML document sorting tool class library:
Microsoft's XML guru Chris Lovett has developed a SGML parser called Sgmlreader, which parses HTML files and even transforms them into a format specification structure. Sgmlreader is derived from XmlReader, which means that you can parse an HTML file as you would with a class such as XmlTextReader to parse an XML file.
Here is a sample code:
Public static XmlDocument converthtmltoxml (string html)
using (sgmlreader sgmlreader = new Sgmlreader ()) {
Sgmlreader.doctype = "HTML";
Sgmlreader.inputstream = new StringReader (HTML);
using (StringWriter StringWriter = new StringWriter ()) {
using (XmlTextWriter xmlWriter = new XmlTextWriter (StringWriter))
{
While (!sgmlreader.eof) {
Xmlwriter.writenode (Sgmlreader, true);
}
}
}
}
XmlDocument xmldoc = new XmlDocument ();
Xmldoc.loadxml (Stringwriter.tostring ());
return xmldoc;
Home: Http://code.msdn.microsoft.com/SgmlReader
Language: English Licensing form: Open source
Related website:
Http://msdn.microsoft.com/en-us/library/aa302299.aspx
Download page (SourceForge) sgmlreader 1.8 MSDN Code Gallery