Parse HTML With XPath

Source: Internet
Author: User
Tags xpath

XPCOM

Using the. NET Framework class to parse HTML files and read data is not the easiest. Although you can use. many classes (such as streamreader) in the Net Framework to Parse Files row by row. However, the APIS provided by xmlreader are not "out of the box, because the HTML format is not standard. You can use regular expressions (regular expressions), but if you are not familiar with these expressions, you may find them difficult at the beginning.

Microsoft's XML Master Chris Lovett recently released a new SGML parser called sgmlreader on the http://www.gotdotnet.com website, which can parse HTML files and even convert them into a formatted and regular structure. Sgmlreader is derived from xmlreader, which means that you can parse HTML files like using classes such as xmltextreader to parse XML files. In this article, I will introduce how to use the sgmlreader class to parse HTML files and generate formatted HTML, so that you can use XPath statements to read data.

Create an sgmlreader instance to parse HTML
Before using sgmlreader, download it from gotdotnet.com and put the Assembly in your application bin folder. After using the Assembly set, you can write code to read the HTML you want to parse. In this example, we use the httpwebrequest and httpwebresponse objects to access a remote HTML file: httpwebrequest Req =
(Httpwebrequest) webrequest. Create (URI );
Httpwebresponse res =
(Httpwebresponse) Req. getresponse ();
Streamreader sreader = new
Streamreader (res. getresponsestream ());

After obtaining a remote HTML file, you can create an instance of the sgmlreader class. By setting its doctype attribute to "html", users can know that you are processing HTML files: sgmlreader reader = new sgmlreader ();
Reader. doctype = "html ";

The response stream of the HTML file can be loaded to the sgmlreader instance and parsed through its inputstream attribute. First, load the HTML file stream to a textreader object, and then assign the textreader to the inputstream attribute: reader. inputstream = new
Stringreader (sreader. readtoend ());

Now, you can call sgmlreader's read () method to parse HTML files: Sw = new stringwriter ();
Writer = new xmltextwriter (SW );
Writer. Formatting = formatting. indented;
While (reader. Read ()){
If (reader. nodetype! = Xmlnodetype. whitespace ){
Writer. writenode (reader, true );
}
}

Because sgmlreader creates standard HTML, you can use XPath statements to read different nodes. The following code illustrates how to load the output result generated by sgmlreader to an xpathnavigator, and then use an XPATH statement to query the HTML file structure: stringbuilder sb = new stringbuilder ();
Xpathdocument Doc = new xpathdocument (New
Stringreader (SW. tostring ()));
Xpathnavigator nav = Doc. createnavigator ();
Xpathnodeiterator nodes = nav. Select (XPath );
While (nodes. movenext ()){
SB. append (nodes. Current. value );
}
Return sb. tostring ();

Click here to view an instance demonstration of the sgmlreader class.

If you are familiar with the XPath language and understand different XML parsing APIs in. NET Framework, you can easily use the sgmlreader class to parse HTML and read data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.