Page-flipping web search engine how to crawl

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The goal of the spider system is to find and crawl all the valuable Web pages in the Internet, and Baidu officials have made it clear that spiders can only crawl as many valuable resources as possible and keep the consistency of the system and the actual environment in the same time without the pressure on the website experience, which means spiders will not crawl all pages This spider has a lot of crawling strategies to find resource links as quickly and completely as possible to improve crawl efficiency. Only in this way can spiders try to meet most of the site, which is why we have to do a good job of the link structure of the site, then wood seo only for a spider on the page-like grasp mechanism to make a point of view. (This paper does not test the rate of other crawl mechanism, single point analysis)

Why do I need this crawl mechanism?

At present, most websites use the form of paging to distribute Web resources in an orderly manner, and when new articles are added, old resources go back to the page-flipping series. For spiders, this particular type of index page is an effective channel for crawling, but spiders crawl frequency and website article update frequency is not the same, the article link is likely to be pushed to the page, so the spider can not climb from 1th to 80th every day, and then an article of the crawl, to the database comparison, This is too wasteful spider time, but also waste your site collection time, so the spider needs to this particular type of page-flipping pages to an additional crawl mechanism, so as to ensure the complete collection of resources.

How can I tell if it's an ordered page?

It is a necessary condition to determine whether an article is arranged in an orderly fashion at the time of release, as will be said below. So how do you determine if resources are arranged in an orderly fashion at release time? Some pages in each article link followed by the corresponding release time, through the link to the corresponding time set, to determine whether the time set by large to small or small to large sort, if it is, then the resources in the Web page is published by the time of the orderly arrangement, and vice versa. Even if the time is not published, Spider writing can be judged by the actual release time of the article itself.

The principle of the grasping mechanism?

For this page of paging, spiders are mainly through the record every crawl of the page found the article link, and then the discovery of the article links and historical links found in comparison, if there is intersection, that the crawl found all the new articles, you can stop to the back of the page to crawl; Shows that the crawl did not find all the new articles, you need to continue to grab a page or even the next few pages to discover all the new articles.

May sound a little bit understand, wood-wood seo to give a very simple example, for example, the new page in the pages directory added 29 articles, that is, the last one is the 30th, and spiders are a one-time crawl 10 article links, so that the first time the spider grabbed 10, and the last did not intersect, continue to crawl, The second time I grabbed 10 more, that is, a total of 20, or with the last time there is no intersection, and then continue to crawl, this time caught the 30th, which is the intersection with the last, which means spiders have crawled from the last crawl to this site updated all 29 articles.

Recommendations

Current Baidu spider on the type of Web page, pages in the position of the page, pages corresponding to the link, and whether the list is sorted according to the time will do the corresponding judgments, and according to the actual situation to deal with, but after all, spiders can not do 100% recognition accuracy, so if the webmaster in the page when do not use JS, Not to use Falsh, at the same time to have the frequency of the article update, with spiders crawl, so you can greatly improve the accuracy of spider recognition, so as to improve the spider in your site's crawl efficiency.

Again to remind you that this article is only a spider from a crawl mechanism of the commentary, does not represent spiders to this kind of grasping mechanism, in the actual situation is a lot of mechanisms at the same time. Author: wood-Wood seo Http://blog.sina.com.cn/mumuhouzi

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.