How to avoid Web page pages being repeatedly crawled

Source: Internet
Author: User
Keywords Avoid crawl Page analyze website

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Observation analysis of the site's log, found that the site page pages by spiders repeatedly crawl many, this kind of site optimization is not very good. So how do we avoid web pages being crawled by spiders?

First, through the robots file to screen out this page, the specific way syntax format:

Disallow:/page/#限制抓取Wordpress分页如查你的网站有需要也可以把下面的语句一并写上, avoid too many duplicate pages. * Disallow:/category/*/page/* #限制抓取分类的分页 * disallow:/tag/#限制抓取标签页面 * Disallow: */trackback/#限制抓取Trackback内容 * Disallow :/category/* #限制抓取所有分类列表 What is a spider, also known as a reptile, is actually a program. The function of this program is to follow the URL layer of your website to read some information, do a simple processing, and then feed back to the background server for centralized processing. We must understand the spider's preferences, to optimize the site to do a better job. Next we talk about the spider's working process.

Second, the spider encountered dynamic page

Spiders are dealing with Dynamic Web page information is a difficult problem. A Dynamic Web page is a page that is automatically generated by a program. Now the Internet developed Program development script language more and more, naturally developed Dynamic Web page types are more and more, such as JSP, ASP, PHP and so on some languages. Spiders can be difficult to handle the Web pages generated by these scripting languages. Optimization in the optimization of the time, always stressed as far as possible not to use JS code, spiders to improve the processing of these languages, need to have their own script program. In the site optimization, reduce some unnecessary scripting code so that spiders crawl crawl, less lead to page pages repeat crawl!

The time of the spider

The content of the website often changes, not update is to change the template. Spiders are constantly updating and crawling the content of the Web page, the spider's developers will set an update cycle for the crawler, so that the time to scan the Web site, to see the comparison of which pages are needed to update the work, such as: the title of the home page has changed, which pages are new Web pages, Which pages are expired dead links, and so on. A powerful search engine's update cycle is constantly optimized, because the search engine's update cycle has a great impact on search engine recall. However, if the update cycle is too long, it will reduce search engine accuracy and integrity, there will be a number of newly generated web pages can not be searched, if the update cycle is too short, the technology is more difficult to achieve, but also to the bandwidth, the server's resources cause waste.

Four, spiders do not repeat crawl strategy

The number of Web pages is very large, spiders crawl is a very large project, the Web page to crawl the cost of a lot of bandwidth, hardware resources, time resources and so on. If you repeatedly crawl the same Web page will not only greatly reduce the efficiency of the system, but also caused the problem of high precision. The common search engine system designs the strategy of not repeating the web crawl, which is to ensure that the same page is crawled only once in a certain period of time.

about how to avoid the site page pages are repeatedly crawled on the introduction here, the article by the Global Trade network editor.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.