Basic principles of Web crawler (II.)

Last Update:2015-11-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Iv. Update Strategy The Internet is a real-time change with a strong dynamic nature. The Web page update policy is primarily about deciding when to update pages that have been downloaded before. The following three kinds of common update strategies are: 1. Historical reference policies as the name implies, update the data based on past history of the page to predict when the page will change in the future. In general, it is modeled through the Poisson process to predict. 2. User Experience Strategy
Although search engines can return huge amounts of results for a query condition, users tend to focus only on the results of the previous pages. Therefore, the crawl system can first update those pages that are real in the first few pages of the query results, and then update those later pages. This update strategy also requires the use of historical information. The user experience policy retains multiple historical versions of the Web page and, based on the impact of the previous content changes on the search quality, gives an average value that is used as the basis for determining when to re-crawl.
3. The two update strategies mentioned earlier in the cluster sampling strategy have a prerequisite: the historical information of the Web page is required. There are two problems: first, if the system saves multiple versions of the historical information for each system, it will undoubtedly add a lot of system burden; second, if the new Web page has no historical information at all, the update strategy cannot be determined. This strategy believes that Web pages with many properties, similar to the properties of the Web page, can be considered to update the frequency is similar. To calculate the frequency of updates for a particular category of pages, you only need to sample this type of Web page, with their update cycle as the entire category update cycle. Basic ideas

Five, the structure of distributed grasping system
In general, crawling systems need to face hundreds of millions of pages across the Internet. It is not possible for a single crawler to complete such a task. Often requires multiple crawlers to be processed together. Generally, the crawl system is often a distributed three-layer structure. ：

The bottom layer is a geographically distributed data center with several crawl servers in each data center, and several sets of crawlers may be deployed on each crawl server. This constitutes a basic distributed crawl system. There are several ways to work together for different servers in a data center: 1. Master-Slave (Master-slave) master-Slave basic structure:

For master-Slave, there is a dedicated master server to maintain the queue of URLs to be crawled, which is responsible for distributing URLs to different slave servers each time, while the slave server is responsible for actual Web page downloads. The master server is responsible for mediating the load on each slave server in addition to maintaining the URL queue to crawl and the distribution URL. Lest some slave server be too idle or overworked. In this mode, Master tends to become a system bottleneck. 2. Peer-to-peer basic structure:

In this mode, all the crawl servers are not different in the division of labor. Each crawl server can get the URL from the URL queue to be crawled, and then the hash value h for the URL's primary domain, and then the H mod m (where M is the number of servers, for example, M is 3), the number calculated is the host number that handles the URL. For example: Assuming that for URL www.baidu.com, the calculator hash value h=8,m=3, the H mod m=2, so the link is fetched by the server numbered 2. Assuming that this is the No. 0 server to get this URL, then it will be transferred to server 2, the server 2 crawl. There is a problem with this mode, when a server freezes or adds a new server, the hash of all URLs is the result of a change. In other words, this approach is poorly scaled. In response to this situation, there is also a proposed improvement scheme. This improved scenario is consistent hashing to determine the server division of labor. Its basic structure:

A consistent hash hashes the primary domain name of the URL and maps it to a number that ranges between 0-232. The average allocation of this range to the M server, according to the value of the hash of the URL primary domain name to determine which server to crawl. If there is a problem with a server, the Web page that is supposed to be owned by that server is deferred clockwise and crawled by the next server. In this case, there is a problem with the server in time, and it will not affect other work.

Bibliography:

1. "This is the search engine-core technology detailed" Zhang Junlin electronic Industry Press

2. "Search Engine Technology Basics" Liu Yiqun, Tsinghua University Press

Basic principles of Web crawler (II.)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic principles of Web crawler (II.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Basic principles of Web crawler (II.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support