A course of building massive data acquisition crawler frame

Source: Internet
Author: User
Tags regular expression xpath

With the concept of big data growing, how to build a system that can collect massive data is put in front of everyone. How to do what you can see is the result of no blocking collection, how to quickly structure and store irregular pages, how to meet more and more data acquisition in a limited time to collect. This article is based on our own project experience.

Let's take a look at how do people get Web data?

1. Open the browser and enter the URL to access the page content.

2, copy the page content title, author, content.

3, stored in a text file or Excel.

From a technical point of view, the whole process is mainly for network access, the seizure of structured data, storage. Let's take a look at how to implement this process with Java programs.

Import java.io.IOException;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
Import Org.apache.commons.lang.StringUtils;

public class Httpcrawler {
public static void Main (string[] args) {

String content = null;
try {
HttpClient httpclient = new HttpClient ();
1. Network request
GetMethod method = new GetMethod ("http://www.baidu.com");
int StatusCode = Httpclient.executemethod (method);
if (StatusCode = = Httpstatus. SC_OK) {
Content = Method.getresponsebodyasstring ();
Structured deduction
String title = Stringutils.substringbetween (Content, "<title>", "</title>");
Store
System. Out. println (title);
}

catch (HttpException e) {
E.printstacktrace ();
catch (IOException e) {
E.printstacktrace ();
finally {
}
}
}

In this example, we see the httpclient fetching data, the string manipulation of the header content, and then the output through System.out. Do you feel that being a reptile is also quite simple. This is a basic example of getting started, and we are going to detail how to build a distributed crawler framework that is suitable for massive data acquisition.

The entire framework should include the following sections, resource management, reverse monitoring management, crawl management, monitoring management. Look at the architecture of the entire frame:

Resource management refers to the management and maintenance of the basic resources such as website classification system, website and website access URL.

Anti-monitoring management refers to the Web site (especially social media) will prohibit the crawler access, how to prevent them from monitoring our access to the crawler software, which is the counter monitoring mechanism;

A good collection framework, regardless of our target data where, as long as users can see should be able to collect. What can be seen is the result of the collection without blocking, regardless of whether the data need to log in to be able to successfully collect. Now most social networking sites need to log in, in order to be able to log in to the site to have a simulated user login crawler system, to normal access to data. However, social web sites want to form a closed loop, do not want to put the data into the station, this system will not be like news and other content so open to get people. Most of these social sites will take some restrictions to prevent the robotic crawler system from crawling data, and generally an account crawl will not be detected as long as it is blocked access. Is that why we can't crawl the data from these sites? Certainly not, as long as the social web site does not close the Web Access, the normal people can access the data, we can access. In the final analysis is to simulate the normal behavior of human operations, a professional point is called "anti-monitoring."

What are the restrictions on general websites?

The number of single IP visits in a certain period of time, no one will be in a period of time too fast access, unless it is casual play, the duration will not be too long. A large number of irregular proxy IP can be used to simulate.

A certain amount of time a single account access times, this ibid, normal people do not operate. Can use a lot of normal behavior of the account, normal behavior is how ordinary people operate on social networking sites, if a person 24 hours a day to access a data interface that could be a robot.

If the account and IP access policy can be controlled, the basic solution to this problem. Of course, the other side of the site will also have Yun-dimensional will adjust the strategy, after all, this is a war, hiding in the computer screen after the two sides, the crawler must be able to perceive the other side of the counter monitoring strategy has been adjusted to inform the Administrator timely processing. The ideal of the future should be through the machine learning algorithm to automatically complete the policy adjustment to ensure that the crawl uninterrupted.

Crawl management refers to the use of URL, combined with resources, reverse monitoring data capture and storage; now most of our crawler systems, many need to set their own regular expression, or using Htmlparser, Jsoup and other software to hard code to solve the problem of structural crawling. But we are doing reptiles will also find that if you crawl a site to develop a class, in the small time can also accept, if you need to crawl thousands of sites, then we are not to develop hundreds of classes. Therefore, we developed a general grasping class, which can drive the internal logic scheduling through parameters. For example, we specify the crawl Sina Weibo in the parameters, the crawler will dispatch the Sina Weibo web page to grab the node data, call the storage rules to store data, no matter what type of the final call to the same class to handle. For our users only need to set the crawl rules, the corresponding follow-up processing to the crawl platform.

The entire crawl uses XPath, regular expressions, message middleware, and multithreaded scheduling framework (reference). XPath is a structured web page element selector that supports list and single node data acquisition, and his benefits can support structured web data crawling. We're using the Google plugin XPath Helper, which can support the creation of XPath on a page-by-click Element, eliminating the effort to find XPath, and making it easier to do what you get in the future. Regular expressions complement the data that XPath cannot crawl, and you can filter some special characters. Message-oriented middleware, which plays a role in the intermediate forwarding of grasping tasks, avoids the coupling of crawl and each demand side. For example, each business system can crawl data, just send a crawl instruction to message middleware, grab the platform will return a message to the middleware, the business system receives message feedback from the message middleware, the entire crawl completes. Multithreaded scheduling framework mentioned before, our crawl platform can not only grasp the task of a message at the same time, and can not be unlimited to crawl, so that resources will be depleted, resulting in a vicious circle. This requires the use of multithreaded scheduling framework to schedule multithreaded tasks in parallel crawl, and the number of tasks to ensure that the resource consumption is normal.

No matter how the simulation will always have an exception, this requires an exception processing module, some Web site access for a period of time need to enter the verification code, if not processed follow-up will never return the correct data. We need to have a mechanism to deal with such as verification code such as exceptions, simple is to have the verification code for the input, advanced some can crack the verification code recognition algorithm to achieve the purpose of automatic input verification code.

Expand: What you see is what we really do? Rule configuration is also a recurring task? How do I not crawl Web pages?

1, some sites use JS to generate Web content, direct view of the source code is a bunch of JS. Can use Mozilla, WebKit and so on can parse the browser's toolkit to parse JS, Ajax, but the speed will be a bit slow.

2, there are some CSS hidden text page. Use the toolkit to remove the CSS hidden text.

3, picture flash information. If it is the image of text recognition, this is better to handle, can use OCR to identify the line of text, if it is flash can only store the entire URL.

4, a Web page has more than one web page structure. If only a set of grasping rules will not work, you need a number of rules with the crawl.

5, the HTML is incomplete, incomplete can not follow the normal mode to buckle. This time with XPath definitely cannot parse, we can first use Htmlcleaner to clean the page and then parse.

6, if the site more, the rules of the configuration of the workload will be very large. How do you help the system generate rules quickly? First, you can configure the rules can be visual configuration, such as users see the Web page want to crawl the data, just want to pull away the plug-in click needed, the rules will automatically generate good. Another in the large amount of time visualization is not enough, you can first classify the same type of site, and then grab some of the content of clustering, you can statistics, visual crawl to pull out several versions of the content of the user to correct, the final confirmation of the rules is the rules of the new site. These algorithms follow again. This piece is to be added (thank Zicjin for suggestions):

Background: If we need to crawl a lot of sites, that's a cost if you need to spend a lot of manpower on visual configuration. And this to the HTML-not-understand business to configure accuracy is worth considering, so the final still need technology to do a lot of things. Can we use technical tools to help generate rules that reduce human costs, or help businesses that do not understand the technology to accurately remove and replicate data clips.

Program: The first classification of the site, such as news, forums, videos, and so on, this type of Web site structure is similar. When the business opens a page that needs to be deducted and has not yet entered our Rules library, he first set the classification of this page (of course this can also be a machine to prejudge, they choose, this step must be important to judge), with the classification, we will be "statistical, visual judgment" to identify the classification of the field rules, But this is the machine recognition rules, may not be accurate, the machine after recognition, still need people to judge. After the judgment is completed, the rules of the new website are finally formed.

7, to deal with repeated web pages, if the repeated crawl will waste resources, if not grasp the need for a huge amount of the cache to be judged. The judgment grasps not to grasp, grasps after does not save, and this cache needs to read and write quickly. Common practice is bloomfilter, similarity polymerization, classification Hamming distance judgment.

Monitoring management means that no matter what system can be a problem, if the other server downtime, Web page revision, replacement address and so we need to know the first time, then the monitoring system has been a problem in time to find and notify the contact person.

At present, such a framework can basically solve a large number of crawl requirements. Through the interface can manage resources, anti-monitoring rules, Web page deduction rules, message-oriented middleware status, data monitoring charts, and can be adjusted through the background resource allocation and dynamic update to ensure the capture of continuous power. However, if a task is particularly large, it may take 24 hours or a few days to crawl. For example, we want to crawl a microblogging forwarding, this forwarding is 30w, if each page to crawl time must be very slow, if you can split this 30w many small tasks, then our parallel computing ability will improve a lot. Have to mention is to put the large crawl task Hadoop, nonsense do not say directly on the diagram:

Write here today, followed by a daily average of tens of millions of large collection of actual combat.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.