Be careful not to let robots.txt block the crawl of the link

Source: Internet
Author: User
Keywords Shielding

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

We know that a large part of the webmaster is looking for a way not to allow spiders to crawl their web pages on their own sites, as well as by using robot.txt files. While this is really a good practice, the problem is also present: confusion about using robot.txt to stop google/yahoo!/msn or some other search engine spiders crawling! The following brief description:

Robots.txt to prevent crawling: some URL addresses do not want to be accessed, but can still crawl and appear on the Search engine results page.

Block the noindex of meta Tags: accessible, but don't want to be crawled, and don't want to be listed in search results.

Stop crawling by stopping the links on the page: This is not a very sensible move, because there are other chains that want to index it by grabbing the page! (If you don't care if it wastes the spider's time on your page, you can do the same, but don't think it will make it out of the search engine's results page)

Here's a simple example of a robot.txt that limits spiders ' crawling but still appears in Google's search results.

  

(robot.txt file is also valid for child domain)

We can see that this about.com/library/nosearch/file has been blocked, as shown in the following figure when we search Google for the results of the URL address in this file:

  

Note that Google still has 2,760 search results in the so-called organized directory. They didn't crawl these pages, so they saw only a simple link address, no description, no title, because Google couldn't see the contents of the pages.

Let's further imagine if you have a large number of pages that you don't want to be crawled by search engines, but these URLs will be counted, and the cumulative traffic and other other unknown independent ranking factors, but they can not continue to climb down this link, So the links poured out from them can never be seen, please look at the following figure:

  

Here are two convenient ways to:

1. Save these link streams by using the nofollow command when linking to a directory that is blocked in robot.txt.

2. If you know which of these banned pages have a fixed link flow (especially from the chain), consider using meta Noindex,follow instead, so that the spider will skip these link streams to save time to retrieve more pages that you need on your site!

This article from Reamo Personal SEO technology, Network Promotion blog: http://www.aisxin.cn reprint please specify the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.