XPCOM
Using the. NET Framework class to parse HTML files and read data is not the easiest. Although you can use. many classes (such as streamreader) in the Net Framework to Parse Files row by row. However, the APIS provided by xmlreader are not "out of the box, because the HTML format is not standard. You can use regular expressions (regular expressions), but if you are not familiar with these expressions, you may find them difficult at the beginning.
Microsoft's XML Master Chris Lovett recently released a new SGML parser called sgmlreader on the http://www.gotdotnet.com website, which can parse HTML files and even convert them into a formatted and regular structure. Sgmlreader is derived from xmlreader, which means that you can parse HTML files like using classes such as xmltextreader to parse XML files. In this article, I will introduce how to use the sgmlreader class to parse HTML files and generate formatted HTML, so that you can use XPath statements to read data.
Create an sgmlreader instance to parse HTML
Before using sgmlreader, download it from gotdotnet.com and put the Assembly in your application bin folder. After using the Assembly set, you can write code to read the HTML you want to parse. In this example, we use the httpwebrequest and httpwebresponse objects to access a remote HTML file: httpwebrequest Req =
(Httpwebrequest) webrequest. Create (URI );
Httpwebresponse res =
(Httpwebresponse) Req. getresponse ();
Streamreader sreader = new
Streamreader (res. getresponsestream ());
After obtaining a remote HTML file, you can create an instance of the sgmlreader class. By setting its doctype attribute to "html", users can know that you are processing HTML files: sgmlreader reader = new sgmlreader ();
Reader. doctype = "html ";
The response stream of the HTML file can be loaded to the sgmlreader instance and parsed through its inputstream attribute. First, load the HTML file stream to a textreader object, and then assign the textreader to the inputstream attribute: reader. inputstream = new
Stringreader (sreader. readtoend ());
Now, you can call sgmlreader's read () method to parse HTML files: Sw = new stringwriter ();
Writer = new xmltextwriter (SW );
Writer. Formatting = formatting. indented;
While (reader. Read ()){
If (reader. nodetype! = Xmlnodetype. whitespace ){
Writer. writenode (reader, true );
}
}
Because sgmlreader creates standard HTML, you can use XPath statements to read different nodes. The following code illustrates how to load the output result generated by sgmlreader to an xpathnavigator, and then use an XPATH statement to query the HTML file structure: stringbuilder sb = new stringbuilder ();
Xpathdocument Doc = new xpathdocument (New
Stringreader (SW. tostring ()));
Xpathnavigator nav = Doc. createnavigator ();
Xpathnodeiterator nodes = nav. Select (XPath );
While (nodes. movenext ()){
SB. append (nodes. Current. value );
}
Return sb. tostring ();
Click here to view an instance demonstration of the sgmlreader class.
If you are familiar with the XPath language and understand different XML parsing APIs in. NET Framework, you can easily use the sgmlreader class to parse HTML and read data.