Search engine-Web Crawler

Last Update:2018-12-05 Source: Internet

Author: User

Tags website server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The general search engine processes Internet webpages. Currently, the number of webpages is tens of billions. Web Crawlers of the search engine can efficiently download massive volumes of webpage data to a local device, create an image backup for an Internet webpage locally. It is a key and basic component in the search engine system.

1. Web crawlers are essentially browser http requests.

Web browsers and web crawlers are two different network clients, all of which are obtained in the same way:

1) First, the client program connects to the Domain Name System (DNS) server, and the DNS server converts the host name to an IP address.

2) Next, the client tries to connect to the server with the IP address. On the server, multiple different process programs may be running, and each process program listens to the network to find a new connection .. Each process listens to different network ports. A port is an l6-bit number of workers used to identify different services. HTTP requests are generally port 80 by default.

3) Once a connection is established, the client sends an HTTP request to the server. After receiving the request, the server returns the response to the client.

4) The client closes the connection.

Detailed understanding of HTTP working principles: Network Interconnection Reference Model (detailed description) and Apache Running Mechanism Analysis

2. Search Engine crawler Architecture

However, the browser requires the user to take the initiative to complete the HTTP request, and the crawler needs to automatically complete the HTTP request, the Web Crawler needs a complete architecture to complete the work.

Despite the decades of development of crawler technology, the overall framework has been relatively mature, but with the continuous development of the Internet, it is also facing some challenging new problems. The general crawler framework is as follows:

General crawler framework

General crawler Framework process:

1) First, carefully select some webpages from the Internet pages and use the links of These webpages as the seed URLs;

2) Put these seed URLs into the URL queue to be crawled;

3) crawlers read from the URL queue to be crawled in sequence and use DNS resolution to convert the link address to the IP address corresponding to the website server.

4) then, the IP address and the relative path of the webpage are handed over to the webpage Downloader,

5) The webpage download tool downloads the page content.

6) for a Web page downloaded to a local website, on the one hand, it is stored in the page library, waiting for indexing and other subsequent processing; on the other hand, the URL of the Downloaded web page is placed in the URL queue that has been crawled, this queue records the URLs of web pages that have been downloaded by the crawler system to avoid repeated webpage crawling.

7) for the newly downloaded webpage, extract all the link information from it and check the token in the crawled URL queue. If the link has not been crawled, put this URL into the URL to be crawled!

(8, 9) At the end, the webpage corresponding to the URL will be downloaded in the subsequent capture scheduling. In this way, a loop is formed until the URL queue to be crawled is empty.

3. crawler crawling Policy

In the crawler system, the URL queue to be crawled is an important part. The sorting of the URLs in the URL queue to be crawled is also an important issue, because it involves grabbing the page first and then grabbing the page. The method that determines the order of these URLs is called the crawling policy.

3.1 deep Priority Search Strategy (shunteng)
That is, the image depth-first traversal algorithm. Web Crawlers track a link from the start page, process the line, and then go to the next start page to continue tracking the link.

The following figure is used to describe:

Assume that the Internet is a directed graph, and each vertex in the graph represents a webpage.. If the initial state is that all vertices in the graph have not been accessed, the deep-first search can start from a vertex and start with v to access this vertex, next, traverse the graph from the inaccessible neighboring contacts of v in sequence until all vertices with the same path with v are accessed. If there are still unaccessed vertices in the graph, select an unaccessed vertex in the graph as the starting point and repeat the above process until all vertices in the graph are accessed.

The following example uses an undirected graph G1 as an example to describe the image depth first:

Search process:

Assume that you search for and capture from the vertex page V1. After you access page V1, select the adjacent contact page V2. If V2 is not accessed, search from V2. And so on, and then search from V4, V8, and V5. After accessing V5, because the adjacent contacts of V5 have been accessed, the search will return to V8. For the same reason, the search continues to return to V4, V2, and V1. At this time, because another neighbor of V1 is not accessed, the search proceeds from V1 to V3, the obtained vertex access sequence is:

　　3.2 breadth-first search strategy
The basic idea of the width-first traversal policy is to insert the link found on the new download page to the end of the URL queue to be crawled. That is to say, web crawlers will first crawl all the webpages linked to the starting webpage, then select one of them to continue crawling all the webpages linked to this webpage. The Design and Implementation of this algorithm are relatively simple. To cover as many webpages as possible, the breadth-first search method is generally used. There are also many studies that apply the breadth-first search policy to focused crawlers. The basic idea is to think that there is a high probability of topic relevance between the webpage and the initial URL within a certain link. Another method is to combine the breadth-first search and web page filtering technology, first capture the web page with the breadth-first policy, and then filter out irrelevant web pages. The disadvantage of these methods is that as the number of webpages crawled increases, a large number of irrelevant webpages will be downloaded and filtered, and the algorithm efficiency will decrease.

Take the preceding figure as an example. The capture process is as follows:

Extensive search process:

First, access the adjacent contacts v2 and v3 of the v1 and v1 pages, then access the adjacent contacts v4 and v5 of the v2 and the adjacent contacts v6 and v7 of the v3, and finally access the adjacent contacts v8 of the v4. Because the adjacent points of these vertices have been accessed and all vertices in the graph have been accessed, some of them have completed graph traversal. The obtained vertex access sequence is:

V1 → v2 → v3 → v4 → v5 → v6 → v7 → v8

Similar to deep-Priority Search, an access flag array is also required during traversal. And, in order to sequential access path length is 2, 3 ,... To store the accessed path length as 1, 2 ,... .

3.2 optimal priority search policy

The optimal priority search policy predicts the similarity between candidate URLs and target webpages based on a certain web page analysis algorithm, or the relevance with the subject, and selects one or more of the best evaluated URLs for crawling.

3. reverse link count Policy
The number of reverse links refers to the number of links to a Web page directed by other web pages. The number of reverse links indicates the degree to which the content of a webpage is recommended by others. Therefore, the crawling system of the search engine often uses this indicator to evaluate the importance of webpages and determine the order in which different webpages are crawled.

In a real network environment, due to the existence of AD links and cheating links, the number of reverse links cannot be completely equal to the importance of others. Therefore, search engines often consider the number of reliable reverse links.

3.4.Partial PageRank policy, that is, the best priority search policy
The partial PageRank algorithm draws on the idea of the PageRank algorithm: it predicts the similarity between candidate URLs and target webpages, or the correlation between candidate URLs and target webpages, based on a certain webpage analysis algorithm, and select one or more of the best evaluated URLs to capture, that is, for the downloaded webpages, together with the URLs in the URL queue to be crawled, form a set of webpages and calculate the PageRank value for each page, after calculation, the URLs in the URL queue to be crawled are arranged according to the PageRank value and the page is crawled in this order.

It only accesses webpages that are predicted to be "useful" by web analysis algorithms. One problem is that many related webpages on crawling paths may be ignored, because the best priority policy is a local optimal search algorithm. Therefore, the best priority should be combined with specific applications for improvement to jump out of the local advantages. Research shows that such closed-loop adjustment can reduce the number of irrelevant webpages by 30% ~ 90%.
If a page is crawled each time, the PageRank value is re-calculated. One compromise is that after each k pages are captured, the PageRank value is re-calculated. However, there is another problem in this case: there is no PageRank value for the links to be analyzed on the downloaded pages, that is, the part of the unknown web page we mentioned earlier. To solve this problem, a temporary PageRank value will be given to these pages: The PageRank values transmitted from all the inbound links of this page will be summarized to form the PageRank value of this unknown page, to participate in sorting.

3.5.OPIC Policy and Policy
This algorithm is actually used to rate the importance of a page. Before the algorithm starts, give all pages the same initial cash (cash ). After downloading a page p, allocate the cash of P to all links separated from P and clear the cash of P. All pages in the URL queue to be crawled are sorted by the amount of cash.

. Big site priority strategy
All webpages in the URL queue to be crawled are classified based on their websites. For websites with a large number of pages to be downloaded, download is given priority. This policy is also called a big site priority policy.

4. webpage update policy

The Internet changes in real time and has a strong dynamic nature. The webpage update policy mainly determines when to update a previously downloaded page. There are three common update policies:

1. historical reference policyAs the name suggests, you can predict when the page will change in the future based on historical updates on the page. Generally, modeling is performed through Poisson Process for prediction. 2. User Experience Policy
Although a search engine can return a large number of results for a specific query condition, users often only focus on the results of the previous few pages. Therefore, the crawling system can first update the pages on the first few pages of the query results, and then update those following the pages. This update policy also requires historical information. The user experience policy retains multiple historical versions of the web page and generates an average value based on the impact of each previous content change on the search quality. This value is used as the basis for determining when to re-capture the web page.
3. clustering sampling strategyThe preceding two update policies both have one premise: the historical information of the web page is required. In this way, there are two problems: first, if the system saves the historical information of multiple versions for each system, it will undoubtedly increase the burden on the system; second, if the new web page has no historical information, the update policy cannot be determined. This policy assumes that a webpage has many attributes, and a webpage with similar attributes may think that its update frequency is similar. To calculate the update frequency of a webpage category, you only need to sample the webpage category and use their update cycle as the update cycle of the entire category. Basic Ideas

5. Cloud storage documentation

Application knowledge:

1. GFS uses the GFS distributed file system to store massive files.

2. BitTable: Construct a BitTable Data Model Based on GFS;

3. The External Store storage model is also based on the storage and computing model of BitTable.

4. Map/Reduce cloud computing model and system computing framework.

4.1 BitTable stores original webpage Information

The logic model shown in 4-1. The example crawler LDB table is used to store web page information crawled by crawlers,

Where: The Row Key is the URL of the webpage. for sorting efficiency, the host domain name Character Sequence in the URL is often reversed, for example, www.facebook.com is processed as com. facebook. www;

Column Family includes title, content, and anchor. tile stores the title of the webpage, content stores the html content of the webpage, anchor stores the link referenced by other webpages, and qualifier is the URL of other webpages, the content is a character displayed on the page of the link on other webpages. The host domain string of the URL of the anchor link is reversed. For the content of the same webpage obtained at different times, the Timestampe is added. Different versions can be seen from the vertical coordinates.

Figure 4-1Crawldb Table logical model

In actual storage, the multi-dimensional Logical Structure shown in Figure 4-1 is divided into (Key, Value) pairs in two dimensions and sorted. In (Key, Value), the Key consists of four-dimensional Key values, including: Row Key, ColumnFamily (8-bit encoding used for processing), Column Qualifier, Timestamp, 4-2, is the actual structure of the Key. During the sorting process, keys with the latest Timestamp will be placed at the top, and the flag is used to indicate the Key, Value that the system needs) operators for record operations, such as adding, deleting, and updating.

Figure 4-2 key structure

4-3 is the sort format after the two-dimensional planialization of crawldb. In the figure, the information in the Key Column is composed of Row Key (page URL), Column Family, Column Qualifer, and Timestamp. The Key flag item is not displayed, and the flag item is mainly used for table item processing.

Figure 4-3 key/valuye list of the crawldb table

Figure 4-4 shows the CellStore file format of the crawldb table. The CellStore File Stores sorted Key-Value pairs. Physically, the data is compressed and stored, organized in blocks of about 64 K size. At the end of the file, there are three indexes Reserved: Bloom Filter, block index (row key + block offset in the file), and Trailer.

4.2Map/Reduce computing model processing webpage information: webpage de-duplication and inverted index generation

We adopt a simple policy to deduplicate webpages. The goal is to find all the webpages with the same content in the webpage set and obtain the hash value for the webpage content, such as MD5, if the MD5 values of the two web pages are the same, the two pages are considered identical. In the Map/Reduce framework, the input data is the webpage itself, and the URL of the webpage can be used as the Key of the input data. The webpage content is the value of the input data; the Map operation uses MD5 to calculate the hash value for the content of each webpage. The hash value is used as the Key of the intermediate data, and the URL of the webpage is used as the value of the intermediate data: the Reduce operation creates the URL corresponding to the intermediate data of the same Key into a chain table structure. This chain table represents a hash with the same web page content.
Which web pages are available for the value. In this way, the task of identifying the same webpage content is completed.

For the task of creating inverted indexes, as shown in Figure 4-6, the input data is also a webpage, And the DOCID of the webpage is used as the Key of the input data. The word set displayed on the webpage is the Value of the input data; the Map operation converts the input data to the form of (word, DOCID), that is, a word is used as the Key, and DOCID is used as the value of the intermediate data. Its meaning is that word appears on the DOCID webpage; the Reduce operation combines records with the same Key in the intermediate data to obtain the webpage ID List corresponding to a word: <word, List (DodD)>. This is the inverted list corresponding to word. In this way, you can create a simple inverted index. In the Reduce stage, you can also perform complex operations to obtain more complex inverted indexes.

Figure 4-6

References:

This is the search engine: detailed explanation of core technologies

Search engine-Information Retrieval practices

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More