: About web crawler design issues

Source: Internet
Author: User
Tags high cpu usage

There are already several open-source web crawlers. larbin, nutch, and heritrix all have their own user locations. To make their own crawlers, we need to solve many problems, for example, scheduling algorithms, update policies, and distributed storage, let's take a look at them one by one.

The main tasks that a crawler wants to do are as follows:

  1. Crawls RSS from a webpage entry, analysis link, layer-by-layer traversal, a set of webpage entries, or an RSS source list;
  2. Obtain the source code of each page and save it to a disk or database;
  3. Traverse the captured webpage for processing, such as extracting the body and removing duplicates;
  4. Index, classify, and cluster the processed text based on its purpose.

The above is my understanding. In these processes, there are the following problems:

How can I obtain a webpage or RSS source?
If it is a common crawler, it is to give several portal pages and then follow the hyperlink to traverse the graph algorithm to crawl a page. In this case, there are few webpage sources, you can choose to start crawling from a website such as hao123. If you do vertical search, manually collect some websites in this industry to form a list that crawls from this list. If you want to crawl RSS, You need to first collect RSS sources. Now the news channels of large portals and mainstream blog systems both have the RSS function. you can crawl the website first, find out the RSS link, get the content of each link, and analyze whether it is in RSS format. If so, save the link to the RSS source database, and then crawl the RSS of this RSS source. There is also a manual sorting. Generally, the RSS of a blog is regular, and the primary domain name and a user name are followed by a fixed RSS page, such. After sorting out the RSS source, manually set the weight and refresh interval of the RSS source.

If there are many source pages, how can we use multiple threads to effectively schedule and process them without waiting for each other or repeating them?
If 5 million pages are to be crawled, they must be processed by multiple threads or distributed multi-process. You can split the page horizontally, and each thread can process a segment, so that there is no need to synchronize between each thread, and each thread can process its own. For example, assign an auto-incremental ID to the pages. If there are two threads, let the first thread crawl the pages and 5, and the second thread crawl the pages and 6, in this way, multiple threads can basically be balanced and will not wait for each other, and will not be processed repeatedly, nor will the web page be pulled out. Each thread extracts pages at a time and records the highest source page ID number. After processing this batch, it extracts the next pages greater than the Source Page ID number from the database, until all pages to be processed by this thread are captured. The value can be adjusted based on the machine memory. To prevent a half capture from crashing, you must support resumable capture. To save the processing progress of each thread, you must record the largest web page ID of the thread for each batch of webpages, record to the database. After the process is restarted, you can read this ID and capture the following page.

How can I try to use the CPU to prevent threads from being idle, waiting, sleeping, or blocking? In addition, use as few threads as possible to reduce context switching.
Crawlers require I/O operations in two places. When capturing a webpage, they need to access the network through the NIC. After capturing the webpage, they need to write the content to the disk or database. Therefore, asynchronous Io operations are required for these two parts, so that you can wait for the webpage to capture or write the disk file without thread blocking. Both the NIC and hard disk support direct memory reading, A large number of Io operations will be queued in the hardware-driven queue, without consuming any CPU .. . Net asynchronous operations use thread pools, eliminating the need to frequently create and destroy threads and reducing overhead. Therefore, the thread model does not need to be considered, and the IO model does not need to be considered ,. net asynchronous Io operations directly use the completed port, which is very efficient, and the memory model does not need to be considered. In the whole capture process, each thread does not need to access shared resources, except for the source page in the database, each thread is segmented and can implement lock-free programming.

How can I not collect duplicate webpages?
The bloom filter of King director can be used for deduplication. Each thread uses a bitarray to save the hash value of the page last crawled on this batch of source pages. After the source PAGE analysis link is captured, go to this bitarray and check whether the page has been captured before. If not, capture the page. If you have captured the page, skip it. Assume that a source page has 30 links. For a batch of 10 million source pages, bitarray with links should not occupy too much memory. So there are five or six threads at the same time.

Can the captured page be saved faster? Is it stored in the distributed file system or in the database?
If you save the file to a disk, you can create a folder for each domain name. All the pages of this website are placed in this folder. As long as the file names are different, no conflict will occur. If you save the page to a disk, the database has its own lock management mechanism. You can simply use Bulk copy to store it in the database. Generally, frequent disk write operations may lead to high CPU usage, while frequent database write operations may result in better CPU performance. Sqlserver2008 also supports filestream fields, which have good performance when saving large text fields and can be accessed using database APIs. So I think it would be better to store SQL server without a distributed file system that is as efficient and mature as gfs.

How can we effectively adjust the crawler collection Interval Based on the update frequency of webpages?
Crawlers need to understand some HTTP protocols. If the webpage to be captured supports the last-modified or etag header, We can first send a head request to test whether the page has changed to decide whether to re-capture it, however, many websites do not support this feature at all, so it is difficult for crawlers to make their websites suffer more performance losses. In this way, we need to mark the update interval and weight of each source page, and then use a certain algorithm to develop the update policy for the SPIDER based on these two values.

What is the purpose of the collected data?
It can capture an industry website, perform word segmentation and indexing locally, and create a vertical search engine. You can use certain training algorithms to automatically classify captured pages and create a news portal. You can also use the popular Text Similarity Algorithm for text clustering.

How does one not affect the performance of the other website?
Nowadays, many websites are afraid of crawling because some spider crawls a website and normal users of other websites cannot access it. So many webmasters think of a lot of ways to deal with crawlers. Therefore, we need to follow the robot protocol to control the access volume to a website within the unit time.

Other problems:
Http://notes.zhourenjian.com/whizznotes/xhtml/4308.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.