Powerful tools for parsing html and collecting web pages, and powerful tools for collecting html data

Last Update:2014-09-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HtmlAgilityPack is A. Net-based, third-party free open-source micro-class library, mainly used to parse html documents on the server side (in B/S structure programs, the client can use Javascript to parse html ). As of this article, the latest version of HtmlAgilityPack is 1.4.0. : Http://htmlagilitypack.codeplex.com/

After downloading and decompressing the package, there are three files. dll (assembly), HtmlAgilityPack. xml (document used for code smart prompts and help instructions in Visual Studio 2008) can be used in a solution without installing anything. It is very "green ".

Introduce using HtmlAgilityPack at the beginning of the C # class file to use the type in the namespace. In actual use, almost all are based on the HtmlDocument class, which is very similar to the XmlDocument class in Microsoft. net framework. The XmlDocument class operates on xml documents, while the HtmlDocument class operates on html documents (in fact, xml documents can also be operated). Their operations are based on Dom, the difference is that the latter cancels methods such as GetElementsByTagName and strengthens the GetElementById method (which can be used directly in HtmlDocument, but XmlDocument cannot ). In HtmlAgilityPack, Xpath expressions are basically used for locating nodes. See the reference documents of Xpath expressions: http://www.w3school.com.cn/xpath/xpath_syntax.asp.

For example, if you want to collect the title of the blog homepage recommendation article, you can write the following code in ASP. NET:

[Csharp]View plaincopy

HtmlWeb htmlWeb = new HtmlWeb ();
HtmlDocument htmlDoc = htmlWeb. Load (@ "http://www.cnblogs.com /");
HtmlNodeCollection anchors = htmlDoc. DocumentNode. SelectNodes (@ "// a [@ class = 'titlelnk ']");
Foreach (HtmlNode anchor in anchors)
Response. Write (anchor. InnerHtml + "<br/> ");
Response. End ();

This code parses the collected homepage html static text into a Dom node tree, and then uses the Xpath expression to obtain all the elements whose class attribute value is titlelnk in the entire document. Two methods to obtain the most common node objects of a node: SelectNodes ("xpath expression") and SelectSingleNode ("xpath expression"). The former returns an instance of the HtmlNodeCollection node; the latter returns the first node that meets the conditions, an instance of Type HtmlNode. The Foreach loop follows outputs the inline text of each a element.

Generally, HtmlAgilityPack is more efficient and accurate than regular expression parsing html, which is reflected in the development efficiency and running performance. The flexibility of HtmlAgilityPack is also very good. For example, if you change the foreach loop in the above Code to Response. Write (anchor. OuterHtml + "<br/>"), the output is the hyperlink itself rather than the inline text. You can even modify the hyperlink itself:

[Csharp]View plaincopy

Foreach (HtmlNode anchor in anchors)
{
Anchor. Attributes. Add ("style", "color: red ");
Response. Write (anchor. OuterHtml + "<br/> ");
}

After this operation, you will see a red hyperlink. You can perform node operations on the Dom node tree generated by HtmlAgilityPack parsing as you like you have a Christmas tree of your own, and you can trim and crop it at will. This is also incomparable to the regular expression method. HtmlAgilityPack has very loose requirements on the structure of source text, even if no root element is used normally, which is totally different from XmlDocument with very strict requirements. Familiar with the syntax of Xpath expressions is the key to parsing html documents with HtmlAgilityPack. Fortunately, the syntax of Xpath is relatively simple, and it takes only several hours to complete most applications. Based on the efficient and general structure of Dom, the powerful and concise Syntax of Xpath, HtmlAgilityPack can be called "a powerful weapon to parse html and collect web pages ".

How to Get outerhtml content by using php for webpage data collection

The PHP Web page appears to be accessed to obtain the source code. It has nothing to do with the browser and how to display it.

If the number of webpages is large, one or more pages will be collected, and then refresh or jump to the next page.

How to use java to parse javascript on an html page? Some web pages contain many javascript scripts.

This has never been done
My personal thoughts are what this browser does.
I'm afraid it's not easy to do it myself

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Powerful tools for parsing html and collecting web pages, and powerful tools for collecting html data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Powerful tools for parsing html and collecting web pages, and powerful tools for collecting html data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support