Sometimes, the applications we develop need to capture the content of web pages for their own use, such as the weather information and news of QQ websites, unlike the search crawler mechanism such as Google, the crawling target page is known to developers. We have reason to avoid the tedious analysis process of using regular expressions too much. It would be nice to parse HTML through DOM after obtaining the HTML of the target webpage. There are two problems here. DOM operations can only be performed on the client using Javascript, VBScript, and other scripting languages. In addition, HTML itself is not in a non-strong format, you cannot use methods similar to XSL for XML parsing. However, since this Blog is written, there must be a solution J.
Thank you very much for the open-source SgmlReader project brought to us by Microsoft XML Master Chris Lovett. We know that XML and HTML are both a subset of Sgml. Through SgmlReader, you can convert HTML to generate a standard HTML (Well-Formed HTML, although this is not called, but this is what we will say for the time being ), the XML XPath syntax can be used to read webpage data. In. NET Framework, the problems we encounter become so easy.
A simple example program is written to capture the weather information of the QQ website and obtain the webpage content by changing the city name and XPath.
Code:Download
PS: In addition to SgmlReader, the. NET Html Agility Pack of Simon Mourier also has similar functions.