Analyze how search engines first crawl the most important pages?

Source: Internet
Author: User
Keywords Crawl

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Search engines face a huge amount of web pages, they are not parallel to crawl every page, because no matter how the search engine database expansion, are unable to keep pace with the growth of the Web page, the search engine will first crawl the most important pages, on the one hand to save the database, on the one hand to the ordinary users They don't need a huge amount of results, they just need the most important results. So a good collection strategy is to prioritize the collection of important pages so that you can crawl the most important pages in the shortest amount of time.

So how do search engines first crawl the most important pages?

Through the massive web page features analysis, the search engine that important pages like the basic features, although not necessarily completely accurate, but most of the time is true:

1 pages by other links to the characteristics of the page, if the number of links are more or is important to link the page, it is very important web pages;

2 A page of the parent page has been linked to more or more important pages are linked, such as a Web page is a site within the page, but its homepage is linked to more times, and the homepage also linked to this page, it is also more important to explain this page;

3 The content of the Web page is reproduced and disseminated widely.

4 The directory depth of the Web page is small, easy for users to browse. This defines "URL directory Depth": The directory hierarchy in which the domain name portion is removed from the URL of the Web page, that is, the URL is http://www.domain.com, the directory depth is 0, and if it is Http://www.domain.com/cs, the directory depth is 1, and so on. What needs to be explained is, the URL directory depth Small page is not always important, the directory depth of the page is not all unimportant, some academic papers page URL has a very long directory depth. Most important web pages will have these 4 features at the same time.

5 priority to collect the homepage of the website, and give the high weight value of the home page. The number of Web sites is much smaller than the number of pages, and important pages are bound to be from the home page links to the past, so the collection of work should be given priority to get as many home page.

The problem arises here, when a search engine starts crawling a Web page, it may not know what the page is linked to or what is being reproduced, in other words, at the beginning, he doesn't know the features of the first 3 items, which can only be known after getting a Web page or almost all of the web link structures. So how do you solve the problem? That is, features 4 and 5 can be found in the crawl time, only feature 4 is not need to know the content of the Web page (before crawling the page) can determine whether a URL meets the "important" criteria, and the Web page URL directory depth calculation is the processing of strings, The statistic results show that the average URL length is less than 256 characters, which makes the identification of URL directory depth easy to realize. So for the determination of the collection strategy, features 4 and 5 are the most important guiding factors to consider.

However, features 4 and 5 have limitations because the depth of the link does not fully indicate how important the page is. So how do you solve this problem? Search engines use the following approach:

1 URL Weight set: According to the URL of the directory depth to decide, the depth is how much, the weight of the reduction, the weight of the minimum is zero.

2 set URL initial weight to a fixed value.

3 the character "/", "?", or "&" 1 times in the URL, the weight minus a value, out

Now "search", "proxy", or "gate" 1 times, the weight of a value minus a number, up to zero. (contains "?",

or "&" URL is a form with parameters, need to be requested by the program services to obtain the Web page, not the search engine system focus on static pages, so the weight of the corresponding reduction. Contains "search", "proxy", or "gate", which indicates that the page is extremely likely to be retrieved by search engine results page, proxy page, so to reduce the weight value.

4 Select a policy that does not have access to the URL. Because the weight of small does not necessarily mean unimportant, so it is necessary

Give a certain opportunity to collect the unused URL with a small weight value. The policy of choosing an unnamed URL can be taken by rotation, one at a time according to the weight value, one at a time, or n times randomly selected.

When the search engine crawls a large number of pages, and then into a stage, the page for the first 3 characteristics of the interpretation, and then through a large number of algorithms to judge the quality of the page, and then give a relative ranking.

This article by 51 Lotus leaf Tea http://www.51heyecha.com/Webmaster original offer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.