Can robots completely screen search spiders crawl and crawl it?

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

When it comes to shielding search spiders, it's natural to think of robots.txt documents. What is robots.txt? In fact, in the previous Huizhou SEO Ye Jianhui has also been the basis of the explanation. Robots.txt is a text file stored in the root directory of the Web site, which is a protocol that tells the search spider Web Station which can be crawled and crawled, and which cannot be crawled. However, here, Ye Jianhui has such a question, robots.txt can completely shield spider crawling crawl?

Robots.txt can shield spiders crawl

Recently, friends Xiangyang SEO tea sister encountered such a problem: "I clearly used robots.txt shielding the catalog file in WordPress, why the spider is still the same as usual crawling?" Then, Ye Jianhui naturally Check the website robots.txt document, the following is the WordPress directory screen settings:

Disallow:/wp-admin

Disallow:/wp-content

Disallow:/wp-includes

After viewing, found such a problem, this is the directory file screen, however, this screen setting seems to be missing/, and Ye Jianhui for consultation, the Friend is so that: "The table of contents file Plus/on it, and then add not all the same AH." Ye Jianhui is another view, in the back plus and not plus/, for spiders are two concepts, plus is to tell the spider, this is a folder, and not to tell the spider this is a file, also led to clearly in the robots.txt on the set, but not effective shielding. Of course, this is only Ye Jianhui's personal opinion.

Then, the friend listens to the suggestion to revise robots.txt to:

Disallow:/wp-admin/

Disallow:/wp-content/

Disallow:/wp-includes/

After the successful modification of the Baidu Webmaster platform for the submission of the Rebuild operation (note is, if not actively submitted to generate, by the spider's own production, the time spent longer OH), in the same day also came into force. After the next day to check the log, found that the spider on the three WordPress under the directory really no longer crawling crawl.

From this point of view, in our website SEO optimization, we can not ignore any details, only one/, can bring is not the same effect.

Robots.txt can't completely shield spider crawling crawl

And in this problem solved, the friend in this file screen but there is another problem: "Clearly a directory blocked Ah, why spiders or can crawl included in the directory of a file?"

So, here Ye Jianhui need to make a note, robots.txt protocol is not a standard, a specification, is only a rule, usually the search engine will recognize the file, but there are some special circumstances. (such as the previous 360 event is not the content of this discussion)

Whether Baidu or Google, a page as long as there are other sites linked to the page, the same may be indexed and included. To completely block the paging file from being indexed by Google (even if there are other sites linked to the page file), you need to insert the NOINDEX meta tag or x-robots-tag in the page head. Follows:

<meta name= "Googlebot" content= "Noindex" >

When Google Spider sees the NOINDEX meta tag on the page, it discards the page completely from Google search results, ignoring whether there are other pages linked to the page.

and Baidu? For Baidu, does not support such as Google, through Noindex completely remove the page from the index, only support noarchive META tag to prohibit Baidu to display a snapshot of the Web page. The specific statements are as follows:

<meta name= "Baiduspider" content= "noarchive" >

The above tag only prohibits Baidu from displaying the snapshot of the page, but Baidu will still index it and display a summary of the pages in the search results.

As we all know, Taobao through the robots.txt screen Baidu Spider, but why we search Taobao in Baidu when the first result is also Taobao home address? And when we look at the snapshot of the page, it appears blank? So it seems that the site can only prohibit the display of Baidu snapshot, But can not do to prohibit Baidu for the Web site index.

  

Concluding:

Back to Ye Jianhui in the article in the first paragraph of the question, robots.txt whether can completely screen spider crawling crawl? I believe in seeing such a problem, there will be such a part of the friend's answer is yes. And this can only say, we are lack of good to find the heart, and for the failed to effectively screen spiders crawl, can only explain that we do not do optimization work fine.

By so two robots.txt small problem, Ye Jianhui thinks, stop us to step for SEO ace perhaps is that is good at discovering the problem heart and meticulous execution.

Original source from Huizhou SEO blog http://www.huizhouseo.cn/seo/823.html Ye Jianhui original feeds.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.