HTML agility pack with scrapysharp completely relieves HTML parsing pain. Net parsing HTML document class library htmlagilitypack complete instructions for use-collection software development is particularly useful

Source: Internet
Author: User

Web applications have developed since W3C was established in 1993, and HTML has evolved through several versions (1.0-2.0-3.0-3.2-4.0-4.01 ), now it has become the most basic for web pages or applications. It is absolutely necessary to learn how to design web pages or develop web applications, even convenient controls (such as ASP. net), but HTML still has the need to learn it, so if you don't know HTML, it is equivalent to never learning web pages.

Thanks to the flourishing development of HTML and Web browsers, all kinds of applications are rapidly developing on the network, such as e-commerce, enterprise portal, online ordering, and collaborative applications among enterprises, even social networking, personalization, Web 2.0, and other capabilities for business and organizational use. In the age of information explosion, many information integration applications are also released, these information integration applications connect to different websites to download their information, and analyze the desired data (such as the price per share, rise/fall, and transaction volume) in heavy HTML ).

However, HTML itself is not a well-structured language. It allows the tag to be used without being closed. This is also because of the high fault tolerance design of the browser. As a result, it is almost impossible to parse HTML files according to rules, in addition, the HTML structure of the website may change at any time. In this case, it is very hard to parse HTML. Although W3C has also promoted XHTML (HTML in XML format ), however, there are still a few cases of using it to design web pages, and most websites still use HTML. Therefore, we need a tool to quickly parse HTML to retrieve the data we need.

As we all know, HTML itself is actually just an HTML-tagged string. Therefore, when it comes to HTML parsing, the first thing that comes to mind is string comparison ), write a pattern for the HTML structure, and then compare them one by the correspondence, for example:

[C #]

1.StringPattern = "<TD id = 'stockprice'> ";

2. html. indexof (pattern );

However, the traditional string comparison performance is too poor and there is no regularity. Therefore, the regular expression technology is displayed, such as the following syntax:

[Regular expression]

1. </? \ W + (\ s + \ W + (\ s * = \ s *(? :".*? "| '.*? '| [^' "> \ S] + ))?) + \ S * | \ s *)/?>

However, the learning curve of regular expression is very high. If you want to use it to parse HTML and customize it, there is really no affinity for general developers.

Another characteristic of HTML is its hierarchy. Therefore, when interpreting it, the browser uses the document tree method and uses recursion) but regular expression does not support hierarchical analysis. The most useful tool for hierarchical analysis is XML parser, which has Dom and XPath features, however, XML Parser cannot read the General HTML (XHTML can) because the general HTML is of a loose structure, XML Parser checks whether the syntax structure is complete (that is, the well-known structure) during reading. If the structure is loose, an exception message is thrown, therefore, XML Parser cannot be used directly.

HTML agility pack is a software tool developed by Simon mourier, a French software architect and developed by darthobiwan and jessynoo, it allows the analysis of HTML in a loose format to be as simple as parsing XML, and it is similar to system. many types of xml dom in an XML namespace, in addition to the class-based access to HTML, it also supports searching for HTML using XPath, this will be clearer than the previous comparison method using text or regular expression.

To use the HTML agility Pack component, first download the binary file from the HTML agility pack website of codeplex (the source code, description file, and the tool program of the hap Explorer can be downloaded ), and decompress the package. DLL reference.

There are about 28 classes in the HTML agility pack source code, which is not a very complex class library, but its functions are not weak, providing enough powerful function support for parsing Dom, similar to jquery Dom operations: HTML agility Pack does not have many basic classes. For Dom parsing, there are only two common classes, htmldocument and htmlnode, and an htmlnodecollection class.

The operation of the HTML agility pack is still very troublesome. The component we will introduce below is scrapysharp, Which is packaged for the HTML agility pack in two aspects, making it no longer painful to parse HTML pages, the happiness index rose to 90 points.

Scapysharp has a real browser packaging class (processing reference, Cookie, etc.). The other is to use CSS selector and LINQ syntax similar to jquery. This makes it easy to use. Put its code at https://bitbucket.org/rflechner/scrapysharp. You can also use nuget to add

Let's take a look at the code for parsing blog articles in the blog Garden:

Using system;
Using system. Collections. Generic;
Using system. LINQ;
Using system. text;
Using htmlagilitypack;
Using scrapysharp. extensions;
Using scrapysharp. Network;

Namespace htmlagilitydemo
{
Class Program
{
Static void main (string [] ARGs)
{
VaR uri = new uri ("http://www.cnblogs.com/shanyou/archive/2012/05/20/2509435.html ");
VaR browser1 = new scrapingbrowser ();
VaR html1 = browser1.downloadstring (URI );
VaR htmldocument = new htmldocument ();
Htmldocument. loadhtml (html1 );
VaR html = htmldocument. documentnode;

VaR Title = html. cssselect ("title ");
Foreach (VAR htmlnode in title)
{
Console. writeline (htmlnode. innerhtml );
}
VaR divs = html. cssselect ("Div. postbody ");

Foreach (VAR htmlnode in divs)
{
Console. writeline (htmlnode. innerhtml );
}

Divs = html. cssselect ("# cnblogs_post_body ");
Foreach (VAR htmlnode in divs)
{
Console. writeline (htmlnode. innerhtml );
}
}
}
}

Basic examples of cssselect usages:

 

VaR divs = html. cssselect ("Div"); // All DIV elements

VaR nodes = html. cssselect ("Div. Content"); // All DIV elements with CSS class 'content'

VaR nodes = html. cssselect ("Div. widget. monthlist"); // All DIV elements with the both CSS class

VaR nodes = html. cssselect ("# postpaging"); // all HTML elements with the ID postpaging

VaR nodes = html. cssselect ("Div # postpaging. testclass"); // all HTML elements with the ID postpaging and CSS class testclass

VaR nodes = html. cssselect ("Div. content> P. Para"); // P elements who are direct children of DIV elements with CSS class 'content'

VaR nodes = html. cssselect ("input [type = text]. login"); // textbox with CSS class Login

We can also select ancestors of elements:

VaR nodes = html. cssselect ("P. Para"). cssselectancestors ("Div. content> Div. widget ");

References:

HTML agility Pack: simple and easy-to-use HTML Parser

Open-source project HTML agility pack for quick HTML Parsing

Jquery in C # -- htmlagilitypack

Introduction and Application of HTML agility pack basic classes

. Net parsing HTML document library htmlagilitypack complete instructions for use-collection software development is particularly useful

Baidu keyword mining example: baidutools.zip

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.