How index page links complement all mechanisms

Source: Internet
Author: User
Keywords Search engine Baidu Search Spider ranking logic
Tags added block class continue data find http index

I. BACKGROUND

Spider is located in the most upstream of the search engine data stream, responsible for the collection of resources on the Internet to local, to provide follow-up search, is the search engine's most important source of data. The goal of the spider system is to discover and crawl all the valuable pages in the Internet, to achieve this goal, the first is to find links to valuable web pages, the current spider has a variety of link discovery mechanism to find resources as fast as possible link, this article mainly describes one of the links to specific index page complement mechanism, The proposed processing specification for this particular type of index page is given to optimize the collection effect.

Most Internet sites currently organize web site resources in the form of index pages and page flips, and when new resources are added, old resources go back to the page-flipping series.

As shown in the following illustration:

Chart 1 http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml

Figure 2 is the fourth page of the page in 18 hours, with more than three pages of resources added over the period, and the resources in the red matrix circled in Figure 1 have been moved back in order to the Red Square on page 4th after 18 hours.

Chart 2 18 hours later fourth page

Http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml

For spider, this particular type of index page is an effective channel for discovering resource links, but since spider is regularly checking these pages to get new resource links, the cycle of checking is inevitably different from the cycle of resource links being published (Spider will try to detect the publishing cycle of the page, to a reasonable frequency to check the Web page, the cycle is different, the resource link is likely to be pushed to the page, so spider need to this special type of paging series for the full page, so as to ensure the complete collection of resources.

Ii. Main Ideas

This article mainly discusses this kind of resources according to publish time orderly arrangement of pages, that is, the new release of resources in the page 1th (or the last page), the old resources back (or forward) sequential index page of the completion mechanism. The main idea is to look at the whole page of pages as a whole, comprehensive determination of their crawl status, by recording the discovery of each page to crawl the resource link, and then the discovery of resources linked to the historical discovery of resources linked to compare, if there is intersection, that the crawl found all the new resources; Shows that the crawl did not find all the new resources and needed to continue to grab a page or even the next page to discover all the new resources.

2.1 Resource links are sorted by time

To determine whether a resource is scheduled to be published as a requirement for such a page, how do you determine if the resource is scheduled to be published? As shown in Figure 1 above, in some pages after each resource link followed by the corresponding release time, through the resource link corresponding time set, to determine whether the time set by large to small or small to large sort, If so, the resources in the Web page are organized in the published time and vice versa. In Figure 1, resources are getting smaller from top to bottom, that is, resources are ordered by release time.

There are also a class of Web pages, as shown in Figure 3 below, where the content of the Web page is sorted by order of sales, sorted by price, such as number of comments, sorted by time. By identifying and extracting the current sort, and then determining whether the current sort is sorted by time, if so, the resources in the Web page are organized in chronological order, and vice versa. The sort method in Figure 3 is sorted by the time of the shelf, which is the time sort, so the resources published by the Web page are ordered according to the time of publication.

In addition, according to the resources linked to capture back after the release time of the comprehensive judgment.

Chart 3 index pages for multiple sorting methods

2.2 Complementary mechanisms

How do you ensure that newly released resources are included in the resource link for the indexed page series ordered by publishing time? As mentioned above, after 18 hours, the resource link in Figure 1 has been moved in order to page 4th, so see, this time has added the page 2,3,4 index resource link, then, Spider need to fully include these additional resources;

First, when Spider crawls the 1th page after 18 hours, the newly discovered resource link collection, compared to the collection of resource links from the previous 18-hour 1th page schedule record, finds that the resource links found by the two schedule are not intersecting, so there may be a leaky chain. In turn, we need to continue to launch the 2nd page of the schedule, 2nd page found that the collection of resources linked to it still does not intersect, so there may also be a leaky chain, continue to launch the 3rd page, the 4th page of the schedule, and finally as shown in Figure 2, the link in the red box and the previous index page scheduling records of resource links have intersection, Therefore, it can be concluded that the new resources have been completed this period of time, thus ending the paging series of scheduling, and to ensure that the paging series of all links to the completion of the search products to enhance the collection effect.

Identification of 2.3 page-turn strips and link sequence blocks corresponding to page-turn strips

In order to achieve the above effect, in addition to the need to identify the paging series is not sorted by time, you also need to identify the index page and its corresponding link block.

Because there is no page-turning of the identification, the spider system will not be able to link the page sequence of all the links, the overall consideration of their state, then the results of the scheduling crawl is random, thus can not guarantee the full effect of the current through the page in a series of features, The method of machine learning is used to identify page-flipping blocks and depth of pages, as well as the links of the previous page and the next page to provide basic data for the completion mechanism.

On the other hand, even if there is an identification of the page, there is no corresponding link block recognition, the completion mechanism is still unable to work, because the above mechanism needs to compare the discovery of the set of links to determine the termination conditions, so it is also necessary to identify the page of the corresponding link block, so as to

In special cases, a Web page may contain multiple paging bars, which requires the corresponding paging and link blocks.

III. recommended methodologies and standards

The current Baidu spider system on the type of Web pages, page pages in the position of the paging bar, the corresponding index list, and whether the list is sorted according to the time will do the corresponding judgment, and according to the actual situation to deal with, but the machine automatic judgment method can not do 100% recognition accuracy rate, So if the webmaster can add some Baidu recommended tags to the page to mark the corresponding functional areas, you can greatly improve the accuracy of our identification, so as to improve the spider system on the site resources to discover the immediacy, thereby enhancing the site's collection effect.

Spider link complement is currently the most concerned about the page pages and the paging bar corresponding to the index link list of blocks, so you can use the elements of the block (such as Div,ul) class attributes to mark the corresponding characteristics for Baidu Spider identification use, we recommend the following attributes to mark:

Table 1 class Extended Properties supported

For example, Baidu News page can be set up like this:

The block element p corresponding to the page-turn bar can be set to the class attribute baidu_paging_indicator, the block element div that corresponds to the main page of the paging bar, set the Baidu_paging_content_indicator orderby_ Posttime, so that the page and corresponding link blocks on the corresponding, and told Baidu is in accordance with the release time, so that can optimize the spider system crawl behavior, improve the site's collection effect.

Iv. Summary

In addition to the above mentioned link discovery method, Baidu's crawl system there are a lot of other means to ensure that the coverage of valuable sites, the above method is only for specific index page type to take a specific means, the Internet webmaster can refer to use. Webmaster can also through the Spider Webmaster platform to learn how to get faster and better Web site collection effect, such as direct through the Sitemap protocol push links. Webmaster platform Address: http://zhanzhang.baidu.com/, just revised, new features present.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.