Reflecting on how we collected data a year ago-Web Crawler

Source: Internet
Author: User

I have never written it before. This is the first time I have written it. It is not a proper word. Please forgive me for not making it clear. I hope you will give more suggestions. Thank you.

Web crawlers are often ignored, especially when compared with search engines. I rarely see articles or documents that detail crawler implementation. However, crawler is actually a very important system, especially in today's era where data is king. If you are a company or project that has just started without any raw data accumulation, you can use crawlers to find valuable data on the Internet and then clean and organize the data, is an important means to quickly obtain data.

This article focuses on part of the crawler system design and implementation. The content comes from three aspects: first, a data collection project from January to August, the requirement for data is that a single PC Server downloads no less than 0.8 million valid pages per day, and the second is that it comes from a large amount of network information for reference, because the company's confidentiality was not very high at that time, we encountered problems in the development process can help the network at any time, but I use the most Google search and http://stackoverflow.com/website is the most I went, the third is the recently read book "The beauty of mathematics", which has been stored in the corners of Huawei for a long time. Most of the documents on crawler systems were published around 2000, and there have been few documents since then. This shows that the crawler system design has been basically solved more than a decade ago. In addition, since this article focuses on system issues, some content will not be involved, such as how to capture hidden Web data, how to capture Ajax pages, and how to dynamically adjust the crawling frequency.

Body

A formal and complete web crawler is actually a complicated system: first, it is a massive data processing system, because it is facing the entire Internet web page, even a small, vertical crawler usually needs to capture billions or tens of billions of webpages. Secondly, it is also a system with good performance requirements, you may need to download thousands of web pages at the same time, quickly extract URLs from the web pages, deduplicate massive URLs, and so on. Finally, it is indeed a system not intended for end users, although stability is also required, it is not a disaster that happens on the machine by accident, and there will be no such situation as a surge in access traffic. At the same time, if performance declines occur for a short period of time, it is not a problem. From this point of view, crawler system design is much simpler in some parts.

 

Is a crawler system framework, which basically includes all the modules required by a crawler system.

Any crawler system design diagram will find a loop,This ring represents the general workflow of crawlers: accordingURLDownload the corresponding webpage and extractURLAccording to the newURLDownload the corresponding webpage.The sub-modules of the crawler system are located in this loop and complete a specific function.

These sub-modules generally include:

Fetcher: used to download the corresponding webpage based on the URL;

DNS resolver: DNS resolution;

Content seen: deduplication of webpage content;

Extractor: extract the URL or other content from the webpage;

Url filter: filters out URLs that do not need to be downloaded;

URL seen: deduplicated URL;

URL set: stores all URLs;

URL frontier: scheduler, which decides which URLs to download next;

FetcherAndDNS resolver

These two modules are two very simple and independent services: DNS resolver is responsible for domain name resolution; fetcher input is the URL after domain name resolution, and return the webpage content corresponding to the URL. For any webpage capture, it needs to call these two modules.

For general crawlers, the two modules can be very simple and even merged together. However, systems with high performance requirements may become potential performance bottlenecks. The main reason is that domain name resolution and crawling are time-consuming. For example, web page capturing usually takes hundreds of milliseconds. If a website is slow, it may take several or even dozens of seconds, this causes the worker thread to be In the blocked wait state for a long time. If you want fetcher to download thousands of webpages or even higher per second, you need to start a large number of working threads.

Therefore, for Crawler systems with high performance requirements, epoll or similar technologies are generally used to change the two modules into an asynchronous mechanism. In addition, DNS resolution results are also cached, greatly reducing DNS resolution operations.

Content seen

Some websites on the Internet often have mirror websites (mirror), that is, the content of the two websites is the same but the Domain Name of the webpage is different. This will cause repeated crawling of the same web page crawler for multiple times. To avoid this problem, you must first enter the content seen module for each captured webpage. This module checks whether the content of a webpage is consistent with that of a previously downloaded webpage. if the content is consistent, the webpage will not be sent for further processing. This method can significantly reduce the number of webpages that crawlers need to download.

To determine whether the content of two webpages is consistent, the general idea is as follows: instead of directly comparing the content of the two webpages, the content of the webpages is calculated and generated.Fingerprint(Information fingerprint)Generally, fingerprint is a fixed-length string, which is much shorter than the webpage body. If the fingerprint of the two web pages is the same, they are considered to have the same content.

ExtractorAndURL Filter

Extractor extracts all the URLs it contains from the downloaded webpage. This is a meticulous task. You need to consider all possible URL styles. For example, a webpage usually contains a URL with a relative path. during extraction, You need to convert it to an absolute path.

Url filter filters the extracted URLs again. The criteria for filtering different applications are different. For example, for Baidu/Google search, the criteria are generally not filtered, but for vertical search or targeted crawling applications, it may only need URLs that meet certain conditions, such as URLs that do not require images, or URLs of a specific website.URL FilterIs a module closely related to applications..

URL seen

URL seen is used for URL deduplication. The URL deduplication will be introduced, so we will not discuss it in detail here.

For a large crawler system, it may already have tens of billions or hundreds of billions of URLs. It is critical that a new URL can quickly determine whether a URL has appeared. Because a large crawler system may download thousands of web pages in one second, a web page can usually extract dozens of URLs, and each URL needs to be de-duplicated, you can perform a large number of deduplication operations per second. ThereforeURL seenIt is a very technical part of the whole crawler system.. (This problem also exists in content seen)

URL set

After the URL is processed in the previous series, it will be put into the URL set for scheduling and capturing. Because the number of URLs is large, only a small part of the URLs may be stored in the memory, and most of the URLs will be written to the hard disk. Generally, the implementation of URL set is some files or databases.

URL frontier

The reason why URL frontier is placed at the end is that it isEngine and driver of the entire crawler SystemTo organize and call other modules.

When the crawler starts, froniter has some seed URLs. It sends the seed URLs to fetcher for crawling, and then sends the captured webpages to extractor to extract the new URLs, remove the new URL and add it to the URL set. When the URLs inside the froniter have been crawled, it also extracts new URLs from the URL set and repeats them. There are many frontier scheduling implementations. Here we will only introduce one of the most common implementation methods.

Before that, we need to explain that although we introduced fetcher, a good fetcher can download hundreds of web pages per second, but for a specific target website, such as www.sina.com, the crawler system crawls the website at a very slow speed and takes a few seconds to capture it. This is to ensure that the target website will not be caught by crawlers.

To do this, frontier has a corresponding FIFO queue for each domain name, which stores the URL under the domain name. Each time, frontier fetches a URL from a queue. The queue stores the last time called by frontier. If the time has exceeded a certain value, the queue can be called again.

Frontier may have thousands of such queues at the same time. It will round-robin to get a queue that can be called, and then crawl it from a pull URL in the queue. Once the URLs in all queues are consumed to a certain extent, frontier will extract a batch of new URLs from the URL set and put them in the corresponding queue.

Distributed

When the crawler performance of the standalone version cannot meet the requirements, you should consider using multiple machines to form a distributed crawler system. The distributed crawler architecture is actually much simpler than imagined. A simple practice is: assume there are n machines and each machine runs a complete crawler system, after obtaining a new URL from the extractor module, crawlers of each machine perform the hash operation based on the URL domain name and then modulo n to obtain the result n, then the URL will be placed in the URL set of the nth machine. In this way, the URLs of different websites will be processed on different machines.

At that time, the purpose of our design was to put the crawler program on a device such as a router and it can run normally and stably, because it does not need to store useless information on the device, all useful pages are stored in the specified server through socket communication. But when I leave the company, no one has mentioned this solution, because we find that we do not have hardware support and affect the speed of the router.

The above is a complete crawler system implementation. Of course, due to limited space, some details are avoided. For example, some websites provide sitemap, so that you can directly obtain all the URLs of the site from sitemap; and so on. There is also a structure chart of the crawling platform from the network. The structure of this graph is basically the same as that of the above graph.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.