HTML Agility Pack with Scrapysharp, completely remove the pain of HTML parsing (go)

Source: Internet
Author: User
Tags xml parser xpath

Since Web applications have evolved since 1993, and HTML has evolved over several versions (1.0–2.0–3.0–3.2–4.0–4.01), it has become the foundation of Web pages or applications, and wants to learn how to design W EB Web page or development Web application, this is absolutely necessary to learn something, even if it is a convenient control (such as ASP. NET), but HTML still has the need to learn it, so if you don't have HTML, you're not learning Web pages in general.

Thanks to the booming of HTML and web browsers, a wide variety of applications have developed rapidly on the Internet, including e-commerce, Enterprise portals, online orders, enterprise collaboration applications, and even social, personalized, Web 2.0 and other business and organizational use capabilities, and in the era of the explosion of explosive, Many of the information-integration applications are also available, and the information-integrated applications connect to different websites to download their information and dissect the desired data in heavy HTML (e.g. price per share, percentage change, volume, etc.).

But HTML itself is not a very structured language, it allows the volume label (tag) to continue to use without close. This is also due to the high tolerance of browser design (Fault tolerance), so that in order to parse the HTML file according to the rules is almost impossible, and the other side of the site's HTML structure may change at any time, in this case, parsing HTML becomes very laborious, although the web There are additional XHTML (adhering to XML strict format HTML), but use it to design pages of the case is still a minority, most of the site is still using HTML. So we'll need a tool that can be used to quickly parse the HTML to get the data we need.

As we all know, HTML itself is actually just a string of HTML tags, so generally speaking to parse HTML, the first thing to think about is the string comparison (string comparison), the structure of their own HTML to write a pattern, Then the function to do the comparison of each, for example:

[C #]

1. string pattern = "";

2. html. IndexOf (pattern);

However, the traditional string performance is too poor, and there is no rule, so the development of regular expression (Regular expressions) technology, such as the following syntax:

[Regular Expression]

1. \s]+))?) +\s*|\s*)/?<

But Regular Expression has a high learning curve, and to use it to parse HTML, and then customize it (customization), there is really no affinity for the general developer.

HTML also has a feature, that is, it is layered (Hierarchy), so the browser will interpret it as a file tree, and then recursively (recursive) method to deal with it, but Regular Expression There is no support for hierarchical analysis, and the closest to the class anatomy and useful tool, is the XML Parser, its DOM and XPath features, can make parsing XML work easier, but XML Parser can not read the general HTML (XHTML), because a Like HTML is a loosely structured type, XML Parser checks to see if the syntax structure is complete (that is, the structure of well-known), and if it reads in loosely structured content, it throws an exception, so it cannot be directly aided by XML Parser.

HTML Agility Pack is a software tool developed by the French software architect Simon Mourier and developed by Darthobiwan and Jessynoo, which allows parsing of loosely formatted HTML as simple as parsing XML Single, it also has many categories similar to the Xml DOM in the System.Xml namespace, in addition to accessing HTML in a hierarchical way, it also supports the use of XPath to search for HTML, which is more than the previous use of literal or Regular Expression More explicit than the way it was.

To use the HTML Agility Pack component, you can first download the binaries on the Codeplex HTML Agility Pack website (also providing source code, description files, and the HAP Explorer utility program can be downloaded), and after unzipping, add to Ht in the project A reference to the MlAgilityPack.dll.

HTML Agility Pack Source class about 28 or so, in fact, not a very complex class library, but its function is not weak, to parse the DOM has provided strong enough functionality to support, can be compared with the jquery operation Dom:) Html Agility Pack the most commonly used base class is not much, for parsing DOM, only HTMLDocument and htmlnode the two commonly used classes, there is a Htmlnodecollection collection class.

HTML Agility Pack's operation is still very troublesome, the following we want to introduce this component is Scrapysharp, he in 2 aspects for the HTML Agility pack packaging, so that parsing HTML page no longer painful, happiness index straight up to 90 minutes ha.

Scapysharp has a real browser wrapper class (processing Reference,cookie, etc.), and the other is using a CSS selector like jquery and LINQ syntax. It's very cool for us to use them. Its code is placed in the Https://bitbucket.org/rflechner/scrapysharp. You can also add it through NuGet

Let's look at the code for a blog post that parses the blog park:

Using System;
Using System.Collections.Generic;
Using System.Linq;
Using System.Text;
Using Htmlagilitypack;
Using Scrapysharp.extensions;
Using Scrapysharp.network;

Namespace Htmlagilitydemo
{
Class Program
{
static void Main (string[] args)
{
var uri = new Uri ("http://www.cnblogs.com/shanyou/archive/2012/05/20/2509435.html");
var browser1 = new Scrapingbrowser ();
var html1 = Browser1. Downloadstring (URI);
var htmldocument = new HTMLDocument ();
Htmldocument.loadhtml (HTML1);
var html = Htmldocument.documentnode;

var title = html. Cssselect ("title");
foreach (Var htmlnode in title)
{
Console.WriteLine (htmlnode.innerhtml);
}
var divs = html. Cssselect ("Div.postbody");

foreach (Var htmlnode in divs)
{
Console.WriteLine (htmlnode.innerhtml);
}

DIVs = HTML. Cssselect ("#cnblogs_post_body");
foreach (Var htmlnode in divs)
{
Console.WriteLine (htmlnode.innerhtml);
}
}
}
}

Basic examples of cssselect usages:

var divs = html.  Cssselect ("div"); All DIV elements

var nodes = html. Cssselect ("Div.content"); All DIV elements with CSS class ' content '

var nodes = html. Cssselect ("Div.widget.monthlist"); All DIV elements with the both CSS class

var nodes = html. Cssselect ("#postPaging"); All HTML elements with the ID postpaging

var nodes = html. Cssselect ("Div#postpaging.testclass"); All HTML elements with the ID postpaging and CSS class TestClass

var nodes = html. Cssselect ("Div.content < P.para"); P elements who is direct children of DIV elements with CSS class ' content '

var nodes = html. Cssselect ("Input[type=text].login"); TextBox with CSS class login

We can also select ancestors of elements:

var nodes = html. Cssselect ("P.para"). Cssselectancestors ("Div.content < Div.widget");

Reference article:

HTML Agility Pack: Simple and fast HTML Parser

Open source project HTML Agility Pack for fast parsing of HTML

Jquery--htmlagilitypack in C #

Html Agility Pack Basic class Introduction and application

. NET parsing HTML document class library Htmlagilitypack complete instructions for use--capture software development is especially useful

Crawler-lib Crawler Engine

Mining Baidu Keyword Example: baidutools.zip

Original address: http://www.cnblogs.com/shanyou/archive/2012/05/27/2520603.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.