There are multiple methods available in.. NET platform for HTML file parsing and data extraction. The simplest and safest way is to first use tools to organize HTML documents into XML documents, then, use the xml dom model or XPath to process data flexibly. SGML is a tool library for HTML Document Sorting:
Microsoft XML Master Chris Lovett has developed a SGML parser called sgmlreader, which can parse HTML files and even convert them into a standard format structure. Sgmlreader is derived from xmlreader, which means that you can parse HTML files like using classes such as xmltextreader to parse XML files.
Here is a sample code:
Public static xmldocument converthtmltoxml (string HTML)
Using (sgmlreader = new sgmlreader ()){
Sgmlreader. doctype = "html ";
Sgmlreader. inputstream = new stringreader (HTML );
Using (stringwriter = new stringwriter ()){
Using (xmltextwriter xmlwriter = new xmltextwriter (stringwriter ))
{
While (! Sgmlreader. EOF ){
Xmlwriter. writenode (sgmlreader, true );
}
}
}
}
Xmldocument xmldoc = new xmldocument ();
Xmldoc. loadxml (stringwriter. tostring ());
Return xmldoc;
Home: http://code.msdn.microsoft.com/SgmlReader
Language: English authorization form: Open Source
Related urls:
Http://msdn.microsoft.com/en-us/library/aa302299.aspx
Download Page (SourceForge) sgmlreader 1.8 msdn code library