Parse HTML With XPath

Last Update:2018-12-04 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

XPCOM

Using the. NET Framework class to parse HTML files and read data is not the easiest. Although you can use. many classes (such as streamreader) in the Net Framework to Parse Files row by row. However, the APIS provided by xmlreader are not "out of the box, because the HTML format is not standard. You can use regular expressions (regular expressions), but if you are not familiar with these expressions, you may find them difficult at the beginning.

Microsoft's XML Master Chris Lovett recently released a new SGML parser called sgmlreader on the http://www.gotdotnet.com website, which can parse HTML files and even convert them into a formatted and regular structure. Sgmlreader is derived from xmlreader, which means that you can parse HTML files like using classes such as xmltextreader to parse XML files. In this article, I will introduce how to use the sgmlreader class to parse HTML files and generate formatted HTML, so that you can use XPath statements to read data.

Create an sgmlreader instance to parse HTML
Before using sgmlreader, download it from gotdotnet.com and put the Assembly in your application bin folder. After using the Assembly set, you can write code to read the HTML you want to parse. In this example, we use the httpwebrequest and httpwebresponse objects to access a remote HTML file: httpwebrequest Req =
(Httpwebrequest) webrequest. Create (URI );
Httpwebresponse res =
(Httpwebresponse) Req. getresponse ();
Streamreader sreader = new
Streamreader (res. getresponsestream ());

After obtaining a remote HTML file, you can create an instance of the sgmlreader class. By setting its doctype attribute to "html", users can know that you are processing HTML files: sgmlreader reader = new sgmlreader ();
Reader. doctype = "html ";

The response stream of the HTML file can be loaded to the sgmlreader instance and parsed through its inputstream attribute. First, load the HTML file stream to a textreader object, and then assign the textreader to the inputstream attribute: reader. inputstream = new
Stringreader (sreader. readtoend ());

Now, you can call sgmlreader's read () method to parse HTML files: Sw = new stringwriter ();
Writer = new xmltextwriter (SW );
Writer. Formatting = formatting. indented;
While (reader. Read ()){
If (reader. nodetype! = Xmlnodetype. whitespace ){
Writer. writenode (reader, true );
}
}

Because sgmlreader creates standard HTML, you can use XPath statements to read different nodes. The following code illustrates how to load the output result generated by sgmlreader to an xpathnavigator, and then use an XPATH statement to query the HTML file structure: stringbuilder sb = new stringbuilder ();
Xpathdocument Doc = new xpathdocument (New
Stringreader (SW. tostring ()));
Xpathnavigator nav = Doc. createnavigator ();
Xpathnodeiterator nodes = nav. Select (XPath );
While (nodes. movenext ()){
SB. append (nodes. Current. value );
}
Return sb. tostring ();

Click here to view an instance demonstration of the sgmlreader class.

If you are familiar with the XPath language and understand different XML parsing APIs in. NET Framework, you can easily use the sgmlreader class to parse HTML and read data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More