Building a crawler framework for massive social data collection

Source: Internet
Author: User

With the concept of big data gradually increasing, how to build an architecture system that can collect massive data is in front of everyone's eyes. How to achieve what you see is what you get, how to quickly structure and store irregular pages, how to meet the needs of more and more data collection within a limited time. This article is based on our own project experience.

Let's take a look at how people get webpage data?

1. Open a browser and enter the URL to access the page.
2. Copy the title, author, and content of the page content.
3. store it in a text file or Excel file.

From a technical point of view, the entire process mainly involves network access, access to structured data, and storage. Let's take a look at how Java programs are used to implement this process.
































Import java. Io. ioexception;
Import org. Apache. commons. httpclient. httpclient;
Import org. Apache. commons. httpclient. httpexception;
Import org. Apache. commons. httpclient. httpstatus;
Import org. Apache. commons. httpclient. Methods. getmethod;
Import org. Apache. commons. Lang. stringutils;

Public class httpcrawler {

Public static void main (string [] ARGs ){

String content = NULL;

Try {

Httpclient = new httpclient ();

// 1. Network request

Getmethod method = new getmethod (" ");

Int statuscode = httpclient.exe cutemethod (method );

If (statuscode = httpstatus. SC _ OK ){

Content = method. getresponsebodyasstring ();

// Structured Buckle

String title = stringutils. substringbetween (content, "<title>", "</title> ");

// Storage

System. Out. println (title );


} Catch (httpexception e ){

E. printstacktrace ();

} Catch (ioexception e ){

E. printstacktrace ();

} Finally {



In this example, we can see that the data is obtained through httpclient, the title content is deducted through the string operation, and the content is output through system. Out. Do you think it is quite simple to make a crawler. This is a basic example. Let's take a closer look at how to build a distributed crawler framework for massive data collection step by step.

The entire framework should include the following parts: Resource Management, anti-monitoring management, capture management, and monitoring management. Let's take a look at the structural diagram of the entire framework:

  • Resource managementManagement and Maintenance of basic resources such as website classification system, website, and website access URL;
  • Anti-Monitoring ManagementThe anti-monitoring mechanism is used to prevent crawlers from accessing websites (especially social media;

    A good collection framework, no matter where our target data is, should be able to be collected as long as users can see it. What you see is what you get. The data that you need to log on to can be collected without blocking. Currently, most social websites need to log on. In order to cope with login, a crawler system that simulates user login is required for websites to obtain data normally. However, social websites all want to form a closed loop and do not want to put data out of the site. Such a system will not be as open to people as news and other content. Most of these social websites adopt some restrictions to prevent the robot crawler system from crawling data. Generally, a single account cannot be crawled for a long time and it will be detected that access is forbidden. Can we not crawl the data of these websites? This is definitely not the case. As long as a social website does not close webpage access, we can also access the data that normal people can access. In the end, it is to simulate the normal operations of people, professionally called "anti-monitoring ".

    What are the restrictions on websites?

    Number of visits from a single IP address within a certain period of timeNo one will access the website too quickly within a period of time, unless they are randomly playing, and the duration will not be too long. A large number of irregular proxy IP addresses can be used for simulation.

    Number of visits to a single account within a certain period of timeThe same is true for normal people. You can use a large number of accounts with normal behaviors, which means how ordinary people operate on social networking websites. If a person accesses a data interface 24 hours a day, it may be a robot.

    If you can control the access policies of accounts and IP addresses, you can basically solve this problem. Of course, there will also be O & M adjustments on the website of the other Party. In the end, this is a war. crawlers must be able to perceive and adjust the anti-Monitoring Policies of the other party, notify the Administrator to handle the issue in a timely manner. The ideal future is to automatically adjust policies through machine learning algorithms to ensure uninterrupted crawling.

  • Capture ManagementIt refers to capturing and storing data through URLs, combining resources and anti-monitoring. Many of our current crawler systems need to set their own regular expressions, alternatively, you can use htmlparser, jsoup, and other software for hard coding to solve the problem of structured crawling. However, if you crawl a website and develop a class, you can still accept it when it is small. If you need to crawl thousands of websites, we do not need to develop hundreds of classes. For this reason, we have developed a general crawling class that can use parameters to drive internal logical scheduling. For example, if we specify to capture Sina Weibo in the parameters, the crawling machine will schedule the Sina Weibo webpage to deduct rules to capture node data and call storage rules to store data, the same class is called for processing no matter what type. For our users, we only need to set the crawling rules, and the corresponding subsequent processing will be handed over to the crawling platform.

    The entire crawling process uses XPath, regular expressions, message-oriented middleware, and multi-thread scheduling framework (For details, refer ).XpathIt is a structured web page element selector that supports data acquisition from lists and single nodes. Its benefits include regular web page data capturing. We use the Google plug-in XPath.
    Helper, which allows you to click an element on a webpage to generate an XPath, saving you the effort to search for xpath. It also facilitates the future implementation of what you get.Regular ExpressionSupplement the data that cannot be captured by XPath, and filter some special characters.Message MiddlewareTo capture the intermediate forwarding of tasks and avoid coupling between crawlers and various demanders. For example, each business system may capture data. You only need to send a capture command to the message-oriented middleware. After the capture platform completes, a message is returned to the message-oriented middleware, the business system receives message feedback from the message middleware, and the whole process is captured.Multi-thread scheduling frameworkAs mentioned before, our grabbing platform cannot capture only one message task at a time, nor can it Capture messages without restriction. In this way, resources are exhausted, leading to a vicious circle. This requires the use of a multi-threaded scheduling framework to schedule parallel capture of multi-threaded tasks, and the number of tasks to ensure normal resource consumption.

    No matter how it is simulated, there will always be exceptions. This requires an exception handling module. Some websites need to enter a verification code for a period of time. If it is not processed, the correct data will never be returned. We need a mechanism to handle exceptions such as verification codes. Simply put, the verification codes are manually input, and advanced verification code recognition algorithms can be cracked to automatically enter the verification code.

    Extension: What we see is what we get. Do we really do it? Is rule configuration a duplicate big task? How can I not capture duplicate webpages?

    1. Some websites use js to generate webpage content, Directly view the source code is a bunch of Js. You can use Mozilla, WebKit, and other tool kits that can parse browsers to parse JS and Ajax, but the speed will be a little slow.
    2. Some CSS-hidden text in the webpage. Use the toolkit to remove the hidden CSS text.
    3. Image flash Information. For text recognition in images, this process is better. You can use OCR to recognize text. For flash, you can only store the entire URL.
    4. A webpage has multiple webpage Structures. If there is only one set of crawling rules, it will certainly not work, and multiple rules need to be captured together.
    5. Incomplete htmlIf it is incomplete, it cannot be deducted according to the normal mode. At this time, XPath cannot be used for parsing. We can clean the web page with htmlcleaner before parsing.
    6. If there are multiple websitesThe workload for rule configuration is also very large. How can we help the system quickly generate rules? First, you can configure rules through visual configuration. For example, if you want to capture data on the webpage you see, you only need to open the plug-in and click the desired place, and the rules will be automatically generated. In addition, visualization is not enough when the volume is large. You can classify websites of the same type first, and then cluster the captured content, statistics and visual capturing can be used to extract several versions of the content for users to correct. The final confirmation rule is the rule of the new website. These algorithms will be discussed later.
    7,Dealing with duplicate web pagesResources are wasted if you do not need to capture the cache again. If the cache is not captured, it cannot be saved after it is captured, and the cache needs to be read and written quickly. Common practices include bloomfilter, similarity aggregation, and classification Hamming distance determination.

  • Monitoring ManagementAny system may have problems. If the recipient's server goes down, the webpage is revised, or the address is changed, we need to know the problem immediately. Then the monitoring system will detect the problem and notify the contact.

Currently, such a framework can basically meet a large number of crawling needs. The interface allows you to manage resources, anti-monitoring rules, web page fetch rules, message middleware status, and data monitoring charts. You can adjust resource allocation in the background and dynamically update resources to ensure constant crawling. However, if the processing of a task is very large, it may take 24 hours or a few days. For example, if we want to capture a microblog for forwarding, the forwarding rate is 30 W, it will be very slow if each page is linearly crawled. If we can split this 30 W into many small tasks, then our parallel computing capability will be improved a lot. We have to mention hadoop-based large-scale crawling tasks without talking nonsense:

I will write it here today and introduce the practice of large collection projects with an average daily volume of 10 million.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.