Basic Principles of Web Crawler

Source: Internet
Author: User

Reproduced http://www.cnblogs.com/wawlian/archive/2012/06/18/2553061.html

Web Crawler is an important part of the indexing engine crawling system. Crawlers primarily aim to download webpages on the Internet to a local machine to form an image backup of the online content. This blog provides a brief overview of crawlers and crawling systems.

I. Basic Structure and workflow of Web Crawlers

A general web crawler framework:

The basic workflow of web crawler is as follows:

1. Select a part of carefully selected seed URLs;

2. Put these URLs into the URL queue to be crawled;

3. Retrieve the URL to be crawled from the URL queue to be crawled, parse the DNS, obtain the Host IP address, download the webpage corresponding to the URL, and store it in the downloaded webpage library. In addition, put these URLs into the captured URL queue.

4. Analyze the URLs in the captured URL queue, analyze other URLs in the queue, and put the URLs in the queue to be crawled URL to enter the next loop.

2. Divide the Internet from the perspective of Crawlers

Corresponding, you can divide all the pages on the Internet into five parts:

1. the unexpired webpage has been downloaded.

2. download expired webpage: the captured webpage is actually an image and backup of the Internet content. The Internet is dynamic and some content on the Internet has changed. At this time, this part of the captured web page has expired.

3. webpage to be downloaded: the pages in the URL queue to be crawled

4. we can see that the webpage has not been captured yet and is not in the URL queue to be crawled. However, you can analyze the obtained URL by analyzing the webpage that has been crawled or the URL to be crawled, the web page is known.

5. Some web pages cannot be directly crawled and downloaded by crawlers. It is called an unknown webpage.

Iii. Capture Policy

In the crawler system, the URL queue to be crawled is an important part. The sorting of the URLs in the URL queue to be crawled is also an important issue, because it involves grabbing the page first and then grabbing the page. The method that determines the order of these URLs is called the crawling policy. The following describes several common capture policies:

1. Deep priority traversal Policy

The depth-first traversal policy means that a web crawler will trace a link from the start page. After processing this line, it will transfer it to the next start page to continue tracking the link. The following figure is used as an example:

Traversal path: A-F-G E-H-I B c d

2. Width-first traversal Policy

The basic idea of the width-first traversal policy is to insert the link found on the new download page to the end of the URL queue to be crawled. That is to say, web crawlers will first crawl all the webpages linked to the starting webpage, then select one of them to continue crawling all the webpages linked to this webpage. Take the preceding figure as an example:

Traversing path: A-B-C-D-E-F g h I

3. reverse link count Policy

The number of reverse links refers to the number of links to a Web page directed by other web pages. The number of reverse links indicates the degree to which the content of a webpage is recommended by others. Therefore, the crawling system of the search engine often uses this indicator to evaluate the importance of webpages and determine the order in which different webpages are crawled.

In a real network environment, due to the existence of AD links and cheating links, the number of reverse links cannot be completely equal to the importance of others. Therefore, search engines often consider the number of reliable reverse links.

4. Partial PageRank Policy

Partial PageRankAlgorithmThe idea of PageRank algorithm is used for reference: For a downloaded webpage, together with the URL in the URL queue to be crawled, a web page set is formed, and the PageRank value of each page is calculated. After calculation, sort the URLs in the URL queue to be crawled by PageRank value and capture the page in this order.

If a page is crawled each time, the PageRank value is re-calculated. One compromise is that after each k pages are captured, the PageRank value is re-calculated. However, there is another problem in this case: there is no PageRank value for the links to be analyzed on the downloaded pages, that is, the part of the unknown web page we mentioned earlier. To solve this problem, a temporary PageRank value will be given to these pages: The PageRank values transmitted from all the inbound links of this page will be summarized to form the PageRank value of this unknown page, to participate in sorting. The following is an example:

5. OPIC Policy and Policy

This algorithm is actually used to rate the importance of a page. Before the algorithm starts, give all pages the same initial cash (cash ). After downloading a page p, allocate the cash of P to all links separated from P and clear the cash of P. All pages in the URL queue to be crawled are sorted by the amount of cash.

6. Big site priority strategy

All webpages in the URL queue to be crawled are classified based on their websites. For websites with a large number of pages to be downloaded, download is given priority. This policy is also called a big site priority policy.

 

Bibliography:

1. This is search engine-detailed explanation of core technologies Zhang junlin Electronics Industry Press

2. "Search Engine Technology basics" Liu yiqun and other Tsinghua University Press

Author: wawlian
Save me from myself

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.