[Search engine] web crawler for search engine technology

Source: Internet
Author: User
Tags domain name server

With the development of the Internet, the Internet is called the main carrier of information, and how to collect information in the Internet is a major challenge in the Internet field. What is web crawler technology? In fact, network crawler technology refers to the crawl of the network data, because the crawl data in the network is a related crawl, it is like a spider crawling in the Internet, so we are very vividly called it is the network crawler technology. The web crawler is also known as a network robot or a network chaser.

Web crawler technology is the most fundamental data technology in the search engine architecture, through the web crawler technology, we can save tens of billions of web page information in the Internet to the local, to form an image file, for the entire search engine to provide data support.

1. Web crawler technology basic work flow andInfrastructure Architecture

Web crawler access to Web page information and we usually use the browser to visit the Web page is exactly the same, are based on the HTTP protocol to obtain, the process mainly includes the following steps:

1) Connect the DNS domain name server, the URL to be crawled for domain name resolution (URL------>IP);

2) HTTP requests are sent according to the HTTP protocol to obtain the content of the Web page.

A complete web crawler infrastructure is as follows:

The entire architecture has the following processes in common:

1) The demand side provides a list of seed URLs that need to be crawled, creating a queue of URLs to be crawled (first-come-first) based on the list of URLs provided and the corresponding priority;

2) Web page crawl according to the sorting of the URL queue to be crawled;

3) Download the acquired Web content and information to a local Web page library and create a list of crawled URLs (for the process of de-weighing and judging the crawl);

4) The crawled Web pages are placed into the queue of URLs to be crawled for the loop fetch operation;

2. Crawling strategy of web crawler

In a crawler system, the URL queue to crawl is an important part. The sequence of URLs in the URL queue to be crawled is also an important issue because it involves grabbing which page first and then which page to crawl. The method of determining the order of these URLs is called a crawl strategy. Several common crawl strategies are highlighted below:

  1) Depth-first traversal strategy

The depth-first traversal strategy is well understood, which is the same as the depth-first traversal we have in the graph, because the network itself is a graph model. The idea of depth-first traversal is to start the crawl from a starting page, and then crawl through the links one after the other, until it can no longer crawl, return to the previous page to continue tracking links.

An example of a graph-depth-first search is as follows:

The left image is a forward graph, and the right image is the search process for depth-first traversal. The result of the depth-first traversal is:

  2) Breadth-First search strategy

Breadth-first search and depth-first search work exactly the opposite, with the idea that the links found in the new download page are inserted directly at the end of the URL queue to crawl. In other words, a web crawler crawls all the pages that are linked in the Start page, then selects one of the linked pages and continues to crawl all the pages that are linked in this page.

The breadth-first search flowchart for the forward graph of the upper instance, whose traversal results are:

V1→v2→v3→v4→v5→v6→v7→v8

From the structure of the tree, the breadth-first traversal of the graph is the hierarchical traversal of the tree.

  3) Reverse Link search strategy

The number of backlinks refers to the number of pages that a page is linked to by another page. The number of backlinks represents the extent to which the content of a Web page is recommended by other people. Therefore, many times the search engine crawl system will use this indicator to evaluate the importance of the page, so as to determine the order of the different pages crawl.

In the real network environment, due to the existence of advertising links, cheat links, the number of backlinks can not be completely equal to his my that also important degree. Therefore, search engines tend to consider some reliable backlink numbers.

  4) Major station priority strategy

  For all pages in the URL queue that you want to crawl, categorize them according to the site you belong to. For sites that have more pages to download, download them first. This strategy is therefore called the major station priority strategy.

  5) Other Search strategies

Some of the more commonly used crawler search side rates also include the partial PageRank search strategy (determining the next crawled URL based on the PageRank score), the OPIc search strategy (also a sort of importance). The last thing to point out is that we can set the crawl interval for our pages according to our needs, so that we can ensure that some of our basic major stations or active site content will not be missed.

3. Web crawler Update Strategy
The Internet is a real-time change, with a strong dynamic nature. The Web page update policy is primarily about deciding when to update pages that have been downloaded before. The following are three common update strategies: 1) Historical reference Strategy  As the name implies, update the data based on previous history of the page to predict when the page will change in the future. In general, it is modeled through the Poisson process to predict. 2) User experience Strategy  
Although search engines can return huge amounts of results for a query condition, users tend to focus only on the results of the previous pages. Therefore, the crawl system can first update those pages that are real in the first few pages of the query results, and then update those later pages. This update strategy also requires the use of historical information. The user experience policy retains multiple historical versions of the Web page and, based on the impact of the previous content changes on the search quality, gives an average value that is used as the basis for determining when to re-crawl.
3) Cluster sampling strategy  the two update strategies mentioned earlier have a premise: you need historical information about your Web pages. There are two problems: first, if the system saves multiple versions of the historical information for each system, it will undoubtedly add a lot of system burden; second, if the new Web page has no historical information at all, the update strategy cannot be determined. This strategy believes that Web pages with many properties, similar to the properties of the Web page, can be considered to update the frequency is similar. To calculate the frequency of updates for a particular category of pages, you only need to sample this type of Web page, with their update cycle as the entire category update cycle. Basic Ideas

4. Distributed Capture System ArchitectureIn general, crawling systems need to face hundreds of millions of pages across the Internet. It is not possible for a single crawler to complete such a task. Often requires multiple crawlers to be processed together. Generally, the crawl system is often a distributed three-layer structure. :

The bottom layer is a geographically distributed data center with several crawl servers in each data center, and several sets of crawlers may be deployed on each crawl server. This constitutes a basic distributed crawl system. There are several ways to work together on different servers in a data center: 1) Master-slave (Master-slave)  master-Slave basic structure:

for master-slave, there is a dedicated master server to maintain the queue of URLs to be crawled, which is responsible for distributing URLs to different slave servers each time, while the slave server is responsible for actual Web page downloads. The master server is responsible for mediating the load on each slave server in addition to maintaining the URL queue to crawl and the distribution URL. Lest some slave server be too idle or overworked. in this mode, Master tends to become a system bottleneck. 2) Peers (peer to peer)  the basic structure of the peer:

in this mode, all the crawl servers are not different in the division of labor. Each crawl server can get the URL from the URL queue to be crawled, and then the hash value h for the URL's primary domain, and then the H mod m (where M is the number of servers, for example, M is 3), the number calculated is the host number that handles the URL. For example: Assuming that for URL www.baidu.com, the calculator hash value h=8,m=3, the H mod m=2, so the link is fetched by the server numbered 2. Assuming that this is the No. 0 server to get this URL, then it will be transferred to server 2, the server 2 crawl. There is a problem with this mode, when a server freezes or adds a new server, the hash of all URLs is the result of a change. In other words, this approach is poorly scaled. In response to this situation, there is also a proposed improvement scheme. This improved scenario is consistent hashing to determine the server division of labor. Its basic structure:

A consistent hash hashes the primary domain name of the URL and maps it to a number that ranges between 0-232. The average allocation of this range to the M server, according to the value of the hash of the URL primary domain name to determine which server to crawl. If there is a problem with a server, the Web page that is supposed to be owned by that server is deferred clockwise and crawled by the next server. In this case, there is a problem with the server in time, and it will not affect other work. 5. Reference Content[1] Wawlian: web crawler Fundamentals (i) (ii);
[2] Guisu: Search engine-web crawler;[3] "This is the search engine: the core technology detailed."

[Search engine] web crawler for search engine technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.