Powerful tools for parsing html and collecting web pages, and powerful tools for collecting html data

Source: Internet
Author: User

Powerful tools for parsing html and collecting web pages, and powerful tools for collecting html data

HtmlAgilityPack is A. Net-based, third-party free open-source micro-class library, mainly used to parse html documents on the server side (in B/S structure programs, the client can use Javascript to parse html ). As of this article, the latest version of HtmlAgilityPack is 1.4.0. : Http://htmlagilitypack.codeplex.com/

After downloading and decompressing the package, there are three files. dll (assembly), HtmlAgilityPack. xml (document used for code smart prompts and help instructions in Visual Studio 2008) can be used in a solution without installing anything. It is very "green ".

Introduce using HtmlAgilityPack at the beginning of the C # class file to use the type in the namespace. In actual use, almost all are based on the HtmlDocument class, which is very similar to the XmlDocument class in Microsoft. net framework. The XmlDocument class operates on xml documents, while the HtmlDocument class operates on html documents (in fact, xml documents can also be operated). Their operations are based on Dom, the difference is that the latter cancels methods such as GetElementsByTagName and strengthens the GetElementById method (which can be used directly in HtmlDocument, but XmlDocument cannot ). In HtmlAgilityPack, Xpath expressions are basically used for locating nodes. See the reference documents of Xpath expressions: http://www.w3school.com.cn/xpath/xpath_syntax.asp.

For example, if you want to collect the title of the blog homepage recommendation article, you can write the following code in ASP. NET:

 

[Csharp]View plaincopy
  1. HtmlWeb htmlWeb = new HtmlWeb ();
  2. HtmlDocument htmlDoc = htmlWeb. Load (@ "http://www.cnblogs.com /");
  3. HtmlNodeCollection anchors = htmlDoc. DocumentNode. SelectNodes (@ "// a [@ class = 'titlelnk ']");
  4. Foreach (HtmlNode anchor in anchors)
  5. Response. Write (anchor. InnerHtml + "<br/> ");
  6. Response. End ();


This code parses the collected homepage html static text into a Dom node tree, and then uses the Xpath expression to obtain all the elements whose class attribute value is titlelnk in the entire document. Two methods to obtain the most common node objects of a node: SelectNodes ("xpath expression") and SelectSingleNode ("xpath expression"). The former returns an instance of the HtmlNodeCollection node; the latter returns the first node that meets the conditions, an instance of Type HtmlNode. The Foreach loop follows outputs the inline text of each a element.

 

Generally, HtmlAgilityPack is more efficient and accurate than regular expression parsing html, which is reflected in the development efficiency and running performance. The flexibility of HtmlAgilityPack is also very good. For example, if you change the foreach loop in the above Code to Response. Write (anchor. OuterHtml + "<br/>"), the output is the hyperlink itself rather than the inline text. You can even modify the hyperlink itself:

[Csharp]View plaincopy
  1. Foreach (HtmlNode anchor in anchors)
  2. {
  3. Anchor. Attributes. Add ("style", "color: red ");
  4. Response. Write (anchor. OuterHtml + "<br/> ");
  5. }


After this operation, you will see a red hyperlink. You can perform node operations on the Dom node tree generated by HtmlAgilityPack parsing as you like you have a Christmas tree of your own, and you can trim and crop it at will. This is also incomparable to the regular expression method. HtmlAgilityPack has very loose requirements on the structure of source text, even if no root element is used normally, which is totally different from XmlDocument with very strict requirements. Familiar with the syntax of Xpath expressions is the key to parsing html documents with HtmlAgilityPack. Fortunately, the syntax of Xpath is relatively simple, and it takes only several hours to complete most applications. Based on the efficient and general structure of Dom, the powerful and concise Syntax of Xpath, HtmlAgilityPack can be called "a powerful weapon to parse html and collect web pages ".


How to Get outerhtml content by using php for webpage data collection

The PHP Web page appears to be accessed to obtain the source code. It has nothing to do with the browser and how to display it.

If the number of webpages is large, one or more pages will be collected, and then refresh or jump to the next page.
 
How to use java to parse javascript on an html page? Some web pages contain many javascript scripts.

This has never been done
My personal thoughts are what this browser does.
I'm afraid it's not easy to do it myself

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.