Analyze how search engines first crawl the most important pages?

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Crawl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Search engines face a huge amount of web pages, they are not parallel to crawl every page, because no matter how the search engine database expansion, are unable to keep pace with the growth of the Web page, the search engine will first crawl the most important pages, on the one hand to save the database, on the one hand to the ordinary users They don't need a huge amount of results, they just need the most important results. So a good collection strategy is to prioritize the collection of important pages so that you can crawl the most important pages in the shortest amount of time.

So how do search engines first crawl the most important pages?

Through the massive web page features analysis, the search engine that important pages like the basic features, although not necessarily completely accurate, but most of the time is true:

1 pages by other links to the characteristics of the page, if the number of links are more or is important to link the page, it is very important web pages;

2 A page of the parent page has been linked to more or more important pages are linked, such as a Web page is a site within the page, but its homepage is linked to more times, and the homepage also linked to this page, it is also more important to explain this page;

3 The content of the Web page is reproduced and disseminated widely.

4 The directory depth of the Web page is small, easy for users to browse. This defines "URL directory Depth": The directory hierarchy in which the domain name portion is removed from the URL of the Web page, that is, the URL is http://www.domain.com, the directory depth is 0, and if it is Http://www.domain.com/cs, the directory depth is 1, and so on. What needs to be explained is, the URL directory depth Small page is not always important, the directory depth of the page is not all unimportant, some academic papers page URL has a very long directory depth. Most important web pages will have these 4 features at the same time.

5 priority to collect the homepage of the website, and give the high weight value of the home page. The number of Web sites is much smaller than the number of pages, and important pages are bound to be from the home page links to the past, so the collection of work should be given priority to get as many home page.

The problem arises here, when a search engine starts crawling a Web page, it may not know what the page is linked to or what is being reproduced, in other words, at the beginning, he doesn't know the features of the first 3 items, which can only be known after getting a Web page or almost all of the web link structures. So how do you solve the problem? That is, features 4 and 5 can be found in the crawl time, only feature 4 is not need to know the content of the Web page (before crawling the page) can determine whether a URL meets the "important" criteria, and the Web page URL directory depth calculation is the processing of strings, The statistic results show that the average URL length is less than 256 characters, which makes the identification of URL directory depth easy to realize. So for the determination of the collection strategy, features 4 and 5 are the most important guiding factors to consider.

However, features 4 and 5 have limitations because the depth of the link does not fully indicate how important the page is. So how do you solve this problem? Search engines use the following approach:

1 URL Weight set: According to the URL of the directory depth to decide, the depth is how much, the weight of the reduction, the weight of the minimum is zero.

2 set URL initial weight to a fixed value.

3 the character "/", "?", or "&" 1 times in the URL, the weight minus a value, out

Now "search", "proxy", or "gate" 1 times, the weight of a value minus a number, up to zero. (contains "?",

or "&" URL is a form with parameters, need to be requested by the program services to obtain the Web page, not the search engine system focus on static pages, so the weight of the corresponding reduction. Contains "search", "proxy", or "gate", which indicates that the page is extremely likely to be retrieved by search engine results page, proxy page, so to reduce the weight value.

4 Select a policy that does not have access to the URL. Because the weight of small does not necessarily mean unimportant, so it is necessary

Give a certain opportunity to collect the unused URL with a small weight value. The policy of choosing an unnamed URL can be taken by rotation, one at a time according to the weight value, one at a time, or n times randomly selected.

When the search engine crawls a large number of pages, and then into a stage, the page for the first 3 characteristics of the interpretation, and then through a large number of algorithms to judge the quality of the page, and then give a relative ranking.

This article by 51 Lotus leaf Tea http://www.51heyecha.com/Webmaster original offer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More