Parsing HTML files-using the Sgmlreader class to parse HTML files

Source: Internet
Author: User

Using the. NET Framework classes to parse HTML files and read data is not the easiest. Although you can use many classes in the. NET framework (such as StreamReader) to parse files line by row, the API provided by XmlReader is not "out of the box" because the HTML format is not canonical. You can use regular expressions (regular expression), but if you're not familiar with these expressions, you might start to think they're a little bit difficult.

Microsoft's XML guru Chris Lovett recently published a new SGML parser on the http://www.gotdotnet.com website, called Sgmlreader, which parses HTML files and even transforms them into a format specification structure. Sgmlreader is derived from XmlReader, which means that you can parse an HTML file as you would with a class such as XmlTextReader to parse an XML file. In this article, I'll show you how to use the Sgmlreader class to parse an HTML file and generate the format specification HTML so that you can read the data using an XPath statement.

Create an Sgmlreader instance to parse the HTML
Before you start using Sgmlreader, download it from gotdotnet.com and put the assembly in your application Bin folder. After you can apply the assembly set, write the code to read the HTML you want to parse. In the example in this article, we used the HttpWebRequest and HttpWebResponse objects to access a remote HTML file: HttpWebRequest req = (HttpWebRequest) webrequest.create (URI); HttpWebResponse res = (HttpWebResponse) req. GetResponse (); StreamReader Sreader = new StreamReader (res. GetResponseStream ());

Once you have the remote HTML file, you can create an instance of the Sgmlreader class. Let the user know that you are working on the HTML file by setting its DOCTYPE property to "HTML": Sgmlreader reader = new Sgmlreader (), reader. DocType = "HTML";

The response stream for an HTML file can be loaded into an Sgmlreader instance and parsed by its InputStream property. First, the HTML file stream is loaded into a TextReader object, and then the TextReader is assigned to the InputStream property: Reader. InputStream = new StringReader (Sreader.readtoend ());

Now you can parse the HTML file by calling Sgmlreader's Read () method: sw = new StringWriter (); writer = new XmlTextWriter (SW); writer. formatting = formatting.indented;while (reader. Read ()) {if (reader. NodeType! = xmlnodetype.whitespace) writer. WriteNode (reader, true); }}

Because Sgmlreader creates a format specification for HTML, you can use XPath statements to read different nodes. The following code shows how to load the output generated by Sgmlreader into a XPathNavigator, and then how to query the HTML file structure with an XPath statement: StringBuilder sb = new StringBuilder (); XPathDocument doc = new XPathDocument (new StringReader (SW). ToString ())); XPathNavigator nav = doc. CreateNavigator (); XPathNodeIterator nodes = nav. Select (XPath); while (nodes. MoveNext ()) {sb. Append (nodes. Current.value);} Return SB. ToString ();

Click here to view an example demonstration of the Sgmlreader class.

If you are already familiar with the XPath language and understand the different XML parsing APIs in the. NET framework, then you can easily parse the HTML and read the data using the Sgmlreader class.

Part of the Code C #

private String getwellformedhtml (String uri,string XPath) ... {
StreamReader sreader = null;
StringWriter sw = null;
Sgmlreader reader = null;
XmlTextWriter writer = null;
Try ... {
if (uri = = String.Empty) uri = "Http://www.XMLforASP.NET";
HttpWebRequest req = (HttpWebRequest) webrequest.create (URI);
HttpWebResponse res = (HttpWebResponse) req. GetResponse ();
Sreader = new StreamReader (res. GetResponseStream ());
reader = new Sgmlreader ();
Reader. DocType = "HTML";
Reader. InputStream = new StringReader (Sreader.readtoend ());
SW = new StringWriter ();
writer = new XmlTextWriter (SW);
writer. formatting = formatting.indented;
//writer. WriteStartElement ("Test");
While (reader. Read ()) ... {
if (reader. NodeType! = xmlnodetype.whitespace) ... {
writer. WriteNode (reader, true);
                    }
                } 
//writer. WriteEndElement ();
if (XPath = = null) ... {
return SW.   ToString ();
} else ... {//filter out nodes from HTML
StringBuilder sb = new StringBuilder ();
XPathDocument doc = new XPathDocument (new StringReader (SW). ToString ()));
XPathNavigator nav = doc. CreateNavigator ();
XPathNodeIterator nodes = nav. Select (XPath);
While (nodes. MoveNext ()) ... {
sb. Append (nodes. Current.value + "");
                    }
return SB. ToString ();
                }
} catch (Exception exp) ... {
writer. Close ();
Reader. Close ();
SW. Close ();
sreader.close ();
return exp. Message;
            }
        }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.