One of the Web page Grabber Authoring experience: Static page grabber

Source: Internet
Author: User

In strict sense, the collector and the crawler is not one thing: The collector is the data source of the specific structure of the analysis, structure, the necessary data extracted from it, and the main goal of the crawler is the page of the link and the title of the page.

The collector has also written a lot of, casually write a bit of experience, is to give yourself a memo.

The first is the simplest: static page grabber. That is, the collected data source page is static, at least the part of the data that the collector cares about is static, and can get the full page code that contains the target data by directly accessing the page URL. This collector is the most commonly used, but also the most basic. There are a lot of mature and commercially available collector products, but for me it feels a bit too complicated. Some of the problems that I would pay attention to when I write my own collectors do not seem to be in these products, or the name is not the one I intend to find. Used a few times, simply as you write, but also more time-efficient.

Preparation knowledge: HTTP protocol basics, HTML language basics, regular expressions, and any programming tool that supports regular expressions (. NET, Java, PHP, Python, Ruby, and so on)

The first step is to download the target page HTML.

There is nothing too difficult about this step, and there are HttpWebRequest and HttpWebResponse classes in. NET, and there are similar things in other languages. However, it is important to note that when writing the downloader for the collector, the parameter configuration must be flexible: user-agent, Refer, cookies and other fields must be made to match, but also to support the use of Proxy server, which is to break through the target server access restriction policy or robot identification policy. Related technologies such as the common anti-robot and anti-robot are written in subsequent articles.

After the page code has been downloaded locally, you have to start parsing it. There are two methods of parsing

1. Use it as HTML parsing

People who are familiar with HTML can parse the downloaded HTML page directly as HTML, which is also the quickest and most efficient. After iterating through the HTML elements and attributes, it directly finds the part of the data that focuses on the content, and gets the data by accessing its elements, element attributes, and child elements. NET native no HTML parsing library, you can find third-party libraries, most of the more useful, at least when the general to parse a page to mention a data and so on enough. The only thing to note is that the page code download is incomplete or there is an error in the target page structure.

2. Use it as a string and parse it with regular expressions

The benefit of regular expressions is flexibility, which can be considered when method 1 fails or is cumbersome (for example, the path to the HTML element of the target data may not be fixed). The idea of using regular expressions is to find the characteristic or feature string of the target data and its context, and then write the regular expression to extract it.

The following is an example of analyzing Bing's search results page, which describes the fundamentals of static collector work.

The first is page fetching. You can find the rules of the page parameters in two clicks, for example:

Http://cn.bing.com/search?q=MOLLE+II&first=31

This URL represents the "MOLLE" "II" two keywords search, the current page is the fourth page. The first parameter refers to the index number of the search results displayed on this page, and the fourth page shows 31-40 search results.

This is used to pass parameters in the Get mode, which is true in most cases. If the target page passes the parameters by post, grab a package in the browser's developer mode and see what the parameters are.

Then we download it to the target page and open it in the regular expression tester:

Well, that's a lot of work. Simply write yourself a tool to take advantage of the hand.

Our goal is to extract the link text and link URL in the search results. For the need to parse from the same page two or more corresponding to the same number of data, there are two strategies: according to the different characteristics of these data directly write an expression from the page to extract the target data (such as first with a regular processing page, get all the link title text, and then a regular processing page, Get all the link URLs), or analyze the page structure to find the smallest page structure that contains the target data items (such as table rows <tr> elements in an HTML table), and then parse. The latter is a bit more reliable, it can save a lot of interference, but a little more trouble. The following method is introduced later.

Using the browser's check tool (Chrome previously called the View element, the new version of the change called check, just a half-day) analysis of the page code, we can find that all the search content is contained in an id attribute "B_results" in the <ol> tag. Write an expression to extract it:

For parsing the regular use of HTML, 0-wide assertions and reverse lookup (lookups) are frequently used to extract strings with specific prefixes and suffixes. The technical blog about regular expressions has a lot of relevant articles, which are not mentioned here.

It is important to note, however, that there are some switches to be aware of for. NET regular expression libraries. For parsing HTML, it is often necessary to select the Singleline parameter so that the engine takes all the carriage returns in the string as normal characters, rather than the end of a row of data. However, this is not absolute, but also needs to be flexibly configured according to the actual situation.

There is also a little trick. In the mobile popular today, some websites will be based on the user browser request user-agent provide different pages, the request to the mobile phone will provide mobile version of the page, in the light of the customer traffic, the general mobile version of the page will be more than the PC side of the clean, less noise.

Continue back to page parsing, just found the page structure containing all the target elements, in fact, if it is found that the minimum structure of the target data features in the page is also unique, direct extraction is no harm:

So we get all the content of the <li> tag that contains the target data. Incidentally, because the user AGENT of the Nokia phone used in the tool, so I got the mobile version of the page, and the PC version slightly different, a little more clean.

The next step is to parse each element. Since all Li tags have the same format structure, we can use the same set of regular parsing.

Our goal is to link the title and link URL, to be blunt, is the <a> tag's href attribute and label content.

It's good to write the expression directly:

Then it's OK to use the same expression for each Li tag's content.

Well, the basic principle of the collector is complete, I wrote this regular tool can be found in my blog, you use happy, welcome to the bug and functional recommendations.

The next article will introduce dynamic page data acquisition.

One of the Web page Grabber Authoring experience: Static page grabber

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.