GOOGLE working principle analysis

Last Update:2017-01-13 Source: Internet

Author: User

Tags website domain names

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, let's talk about the origins of GOOGLE spider:

When GOOGLE's search engine was first established, it had this very powerful server,

It releases a large number of spider every day. We call it the No. 1 Spider. Its crawling speed is very fast,

The daily collection of information on the entire Internet shows how fast the server is. In fact, the most important thing is that GOOGLE

It extends servers to many cities, so now you can find that GOOGLE's computing speed is ahead of time.

The server classifies and sorts the collected information to a large database.

One of these databases is used to store website domain names.

As long as the domain name is indexed by the search engine, it will be automatically stored in this database.

This database is the core of spider 1.

It is divided into 10 small databases with various PR levels. Although it is a small database, it is also terrible!

Databases of 10 levels have different cycles.

Basically, for a website with PR = 4, the probability that spider 1 crawls is also 7 days.

Therefore, you will also find that the record is recorded on a day within seven days.

Careful webmaster will find that sometimes 7 days is quite accurate, but only for PR = 4

The higher the PR, the shorter the cycle, and the lower the PR, the longer the cycle,

Of course, when talking about this, many webmasters may have such doubts. They may feel that the spider sometimes collects his website on a daily basis.

Here, we will include the next article about spider 2.

Spider 2 is usually released during crawling,

It is mainly used for external links of websites crawled by spider 1.

PS since it is said that the No. 2 spider must be much smaller than the No. 1 crawlers.

★Of course, not just the 2, but also the 3

The so-called No. 3 crawls site A and site 1 to site B, and site B crawls site 2 to Site C.

Currently, in order to limit its infinite circulation, GOOGLE only divides the spider into three levels and has a clear standard for its level of crawling rate.

In addition, Spider 2 and Spider 3 have the characteristics of crawling in chronological order.

★For example:

The last article on website A crawled by spider 1 is

When website A is crawled by the No. 2 Spider from another website, it is possible that

Several articles recently published, such as and May 30, will carry out 2nd and 3rd visits.

And then crawls information after-1. If your website does not have any updates, it crawls the changes in the last month twice.

If there are more Spider 2 and 3 from outside, the same article may be crawled several times.

The following are official data provided by GOOGLE: <Secret>

★1 Spider

The basic capture rate is between 5% and ~ 10%

There is no import link based on PR = 0 and the submission may be crawled for 6 months ~ 12 months

There is no import link based on PR = 1 and it is possible to be crawled every time for 4 months ~ 8 months

There is no import link based on PR = 2 and the submission may be crawled for a period of 2 months ~ 4 months

Based on PR = 3, there is no import link or the submission may be crawled for one month ~ 2 months

Based on PR = 4, there is no import link and the cycle of the captured zone may be one week ~ 1 month

Of course, websites without any import links cannot do PR = 4

The maximum value is PR = 3.

The above data is only a base provided by GOOGLE.

This means that spider 1 takes the initiative to crawl the number of cycles on your website.

Crawlers 2 or 3 crawl your website based on your import link.

So you will find that your website is sometimes updated every day.

★Spider 2

The basic capture rate is 2.5% ~ 5% <re-collect based on the data records crawled by the 1 Spider, and re-access before and after the last collection date>

★Spider 3

The basic capture rate is 1.25% ~ 2.5% <re-collect the data records of spider 1 and Spider 2, and re-access the data before and after the last collection date>

GOOGLE currently has three levels of spider

Of course, spider has different Spider

Here, only webpage spider is here, because I am only interested in this.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More