HTML Agility Pack: simple and easy-to-use HTML Parser

Last Update:2018-12-05 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Codeplex software Package information
Suite name	HTML Agility Pack
Author	Simon Mourier
Current version	1.4.0 Beta 2
URL	Http://htmlagilitypack.codeplex.com/
Ease of use	Medium
The Helper tools available when using this suite	HAP Explorer (available in the preceding URL) Internet Explorer 8 Developer Tools
Basic knowledge	HTML XML and XPath It is best to use the XmlDocument class in the System. Xml Naming Space and Its SelectNodes () or SelectSingleNode () method.

Analysis HTML: pain points in the minds of Web developers

Since the advent of Web applications since W3C in 1993, moreover, HTML has evolved several versions (1.0-2.0-3.0-3.2-4.0-4.01 ), now it has become the most basic Web page or application, and wants to learn how to design a Web page or develop a Web application, this is already something that must be learned, even if it is convenient to control flood (such as ASP. NET), but HTML still has the need to learn about it, So If HTML is not available, it will be like learning Web pages.

Thanks to the booming development of HTML and Web browsers, all kinds of applications are rapidly developed on the Internet, for example, e-commerce, enterprise portal, local and local orders, and enterprise-level applications, even with the ability to operate communities, individuals, Web 2.0, and other businesses and organizations, in the era of information explosion, many of the applications for integration of information can also be used out, and these applications for integration of information will be connected to different websites for their resources, and in the heavy HTML analysis of the expected information (such as the price per share, growth decline, volume, etc ).

However, HTML itself is not a final structure statement. It allows tags to be used without being closed. This is also because of the high compatibility (Fault Tolerance) of the compiler design. In this way, it is impossible to analyze the HTML file according to the rules, in addition, the HTML structure of the Peer website may also change in time. In this case, it is very hard to analyze HTML, although W3C has also promoted XHTML (HTML in XML simplified syntax format), there are still few cases of using it to design web pages, most websites still use HTML. Therefore, we need a tool to quickly parse HTML to retrieve the information we need.

How to parse HTML by using the Compiler

As we all know, HTML itself is actually a string of HTML tags, so we generally say that to parse HTML, the first thing we will think of is string comparison ), add a pattern to the HTML structure, and then compare the pattern one by the correspondence, for example:

[C #]

String pattern = "<td id = 'stockprice'> ";
Html. IndexOf (pattern );

However, the character strings of the rule system are too inefficient and have no rules. Therefore, the Regular Expression technology is displayed, for example:

[Regular Expression]

</? \ W + (\ s + \ w + (\ s * = \ s *(? :".*? "| '.*? '| [^' "> \ S] + ))?) + \ S * | \ s *)/?>

Source: http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx

However, the Regular Expression has a high learning curve. If you want to use it to parse HTML and then customize it, generally, developers do not have any affinity or strength.

HTML also has a special feature, that is, it is highly sensitive (Hierarchy). Therefore, when the browser attempts to parse it, it uses the document tree) but the Regular Expression does not support linear profiling, and it is the closest tool to linear profiling, more than XML Parser, its DOM and XPath features can make the XML parsing work more complex, however, XML Parser cannot parse the General HTML (XHTML is acceptable) because the general HTML is a hierarchical structure, XML Parser checks whether the structure of the statement method is complete (that is, the structure of Well-known) at the time of loading ), the exception message is generated if the contents are merged into the structure. Therefore, XML Parser cannot be used directly for help.

HTML file publisher (IE8 developer tool)

However, some people have developed HTML tools that can use XPath-like methods in HTML to access the hidden structure, this tool is the HTML Agility Pack to be introduced in this article.

HTML Agility Pack Introduction

HTML Agility Pack is a software development tool developed by Simon Mourier, a French simplified architecture and developed by DarthObiwan and Jessynoo, it allows the analysis of HTML in the scattered format to be as simple as the analysis of XML, and it is similar to System. there are many differences between the Xml DOM in the XML Naming space. In addition to the ability to access HTML by using plain text, it also supports searching HTML by using XPath, in this case, the text used in the past is more accurate than the corresponding or Regular Expression, for example:

In the above example, the latest W3C message announcement area is framed by the plain color, and its HTML syntax is as follows:

In the past, when we used Regular Expression analysis, we may have to take many steps (Match will return a lot of information, unless the response is refined) before it will go to the location of the response box, however, when using the HTML Agility Pack component, we can use this method:

[XPath]

/Html [1]/body [1]/div [1]/div [2]/div [3]/div [2]/div [1]/div [1] /div [1]

As far as our location is concerned, this method and XPath are similar, and it is more advantageous for developers who are familiar with XPath or DOM. The differences between HTML Agility Pack elements and xml dom Parser are as follows:

Differences between HTML Agility Pack Elements

As mentioned above, we can compile such a program to retrieve the latest message published by W3C's first release:

[C #]

Using HtmlAgilityPack;
Public static void Main (string [] args)
{
HtmlWeb webClient = new HtmlWeb ();
HtmlDocument doc = webClient. Load ("http://www.w3.org /");
HtmlNodeCollection nodes = doc. documentNode. selectNodes ("/html [1]/body [1]/div [1]/div [2]/div [3]/div [2]/div [1]/div [1]/div [1]/div ");
Foreach (HtmlNode node in nodes)
{
Console. WriteLine (node. InnerText. Trim ());
}
Doc = null;
Nodes = null;
WebClient = null;
Console. WriteLine ("Completed .");
Console. ReadLine ();
}

Retrieve the latest announced program release in W3C's first release (Case Type: Console Application)

The HTML Agility Pack depends only on the. NET Framework. Therefore, you do not need any other HTML Parser components, as long as you have. NET Framework.

Usage

To use the HTML Agility Pack component, you can first go to the HTML Agility Pack website of Codeplex to perform binary operations (the original program release, declarative statements, and the tools available for the HAP Explorer tool can be downloaded at the same time ), after resolving the problem, add the HTML package to the problem. dll upload test:

Then join the program announcement:

[C #]

Using HtmlAgilityPack;

You can use the HTML Agility Pack function in your program.

Example: analyze the stock information of Yahoo qimo stock market.

The author believes that this should be the main goal of many applications that create stock market data collection, if it is necessary to obtain the right to grant data by the certificate, it may take some time to use it, but it is free of charge to parse and retrieve the information in the Yahoo qimo stock market, however, the HTML structure of the Yahoo qimo stock market has been fragmented for a long time. Unlike W3C, HTML is XHTML (HTML Agility Pack functionality was used to demonstrate its functionality ), therefore, it takes a lot of effort to analyze it. Now we can use HTML Agility Pack to compile this job.

In the Yahoo qimo stock market, the information of a stock is as follows:

Its HTML structure is as follows:

Therefore, if we want to use XPath to parse it, we must first use the following XPath to first locate the External table on the External table, and then get the content in HTML:

[XPath]

/Html [1]/body [1]/center [1]/table [2]/tr [1]/td [1]/table [1]

Therefore, we can write the following program example:

[C #]

Using System. net; using System. IO; using HtmlAgilityPack; public static void Main (string [] args) {// shares Yahoo qimo stock market information (for example, 2317 million RMB) webClient client = new WebClient (); MemoryStream MS = new MemoryStream (client. downloadData ("http://tw.stock.yahoo.com/q/q? S = 2317 "); // use the operator to set the operator parameters to HTML HtmlDocument doc = new HtmlDocument (); doc. load (MS, Encoding. default); // The result of the first vertex check when HtmlDocument docStockContext = new HtmlDocument (); docStockContext. loadHtml (doc. documentNode. selectSingleNode ("/html [1]/body [1]/center [1]/table [2]/tr [1]/td [1]/table [1]"). innerHtml); // gets the header HtmlNodeCollection nodeHeaders = docStockContext. documentNode. selectNodes (". /tr [1]/th "); // obtain the unit value string [] values = docStockContext. documentNode. selectSingleNode (". /tr [2] "). innerText. trim (). split ('\ n'); int I = 0; // outputs the data foreach (HtmlNode nodeHeader in nodeHeaders) {Console. writeLine ("Header: {0}, Value: {1}", nodeHeader. innerText, values [I]. trim (); I ++;} doc = null; docStockContext = null; client = null; ms. close (); Console. writeLine ("Completed. "); Console. readLine ();}

Program compute for retrieving Yahoo qimo's personal data (case-type: Console Application)

NOTE

Currently, HTML Agility Pack should be configured in French. Therefore, if you want to retrieve Chinese HTML content, you cannot directly use HtmlDocument. loadHtml () method, but use HtmlDocument through MemoryStream. load () method.

The result of this example is as follows:

With this program, the author thinks that there are a lot of things that can be done, for example, using XSLT to compile it into different HTML for demonstration, or save it to the data warehouse to do other work (the analysis or reading table can be done ).

From the program listed above, we can see that the use of HTML Agility Pack is actually not much different from the use of xml dom, and there is no need to explain the commands of Regular expressions with no affinity or force, attackers can use XPath to parse HTML, and even the hidden HTML can parse it.

NOTE

Although the HTML Agility Pack can parse the HTML file content that has been successfully merged, please note that the HTML content that has been removed may result in the failure of HtmlNode, or, when parsing a table, the cursor and content are not synchronized, this part of the developer may need to pay more attention.

NOTE

HTML Agility Pack can also be used to generate HTML content, just as XmlDocument produces XML file content. It also has content such as CreateElement () and CreateAttribute (), CreateComments () and CreateTextNode () methods, their use and XmlDocument is not much, so the author here is not described, please refer to: http://msdn.microsoft.com/zh-tw/library/t058x2df.aspx

WARNING

This example uses Yahoo qimo as an example. In practice, if you want to use it, pay attention to the internal permission agreement and free of charge,The author uses this example to demonstrate the HTML Agility Pack function. It does not mean that using this example is equivalent to obtaining the content permission of Yahoo qimo.If you want to reference it, pay attention to whether there is a legal internal permission,If an infringement problem occurs due to the reference of the quote routine, all rights are permitted by the quote.

Helper tools

The HTML Agility Pack itself provides a tool for the HAP Explorer, allowing developers to quickly learn about the XPath syntax to be used, developers can use it to parse the location information of HTML.

However, the author thinks that this tool is too popular in spring (maybe Simon Mourier is not familiar with Windows Forms or WPF ), therefore, developers suggest using Developer Tools like FireFox or IE8 to build their own XPath libraries, they also provide the ability to circle a part of a website and mark its HTML location. In this way, it is faster to create an XPath structure than to compile.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More