A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

Source: Internet
Author: User

Website construction is good, of course, hope that the Web page is indexed by the search engine, the more the better, but sometimes we will also encounter the site does not need to be indexed by the search engine situation.

For example, you want to enable a new domain name to do the mirror site, mainly for the promotion of PPC, this time will be a way to block search engine spiders crawl and index all the pages of our mirror site. Because if the mirror site is also indexed by the search engine, it is likely to affect the official website in the search engine weight, which is certainly not what we want to see results.

Here are a few ideas to block the main search engine crawler (spider) Crawl/index/ingest Web pages. Note: The whole station is blocked, and is as far as possible to shield off all the mainstream search engine crawler (spider).

1, through the robots.txt file screen

It can be said that the robots.txt file is the most important channel (can establish direct dialogue with the search engine). I made the following suggestions by analyzing the server log files of my own blog (also welcome to add the user):

User-agent:baiduspider

Disallow:/

User-agent:googlebot

Disallow:/

User-agent:googlebot-mobile

Disallow:/

User-agent:googlebot-image

disallow:/

User-agent:mediapartners-google

Disallow:/

User-agent:adsbot-google

Disallow:/

User-agent:feedfetcher-google

Disallow:/

User-agent:yahoo! slurp

Disallow:/

User-agent:yahoo! slurp China

Disallow:/

User-agent:yahoo!-adcrawler

Disallow:/

User-agent:youdaobot

Disallow:/

User-agent:sosospider

Disallow:/

User-agent:sogou Spider

Disallow:/

User-agent:sogou Web Spider

Disallow:/

User-agent:msnbot

Disallow:/

User-agent:ia_archiver

Disallow:/

User-agent:tomato Bot

Disallow:/

User-agent: *

Disallow:/

2, through the META tag shield

Add the following statement to all the page header files:

<meta name= "Robots" content= "Noindex, nofollow" >

3, through the server (such as: Linux/nginx) configuration file Settings

Filter the IP segment of the spider/robots directly.

Small bet: 1th strokes and 2nd strokes only for "gentleman" effective, to prevent "villain" to use the 3rd strokes ("Gentleman" and "villain" respectively refers to abide by and do not comply with the robots.txt agreement spider/robots), so the site after the online to keep track of the analysis of the log, screening out these Badbot IP, and then block it.

Here's a Badbot IP database: http://www.spam-whackers.com/bad.bots.htm

4, through the search engine provides webmaster tools, delete the webpage snapshot

For example, sometimes Baidu does not strictly abide by the robots.txt agreement, you can use Baidu to provide "web complaints" portal to delete Web snapshots. Baidu website Complaint Center: Http://tousu.baidu.com/webmaster/add

As one of my web complaints:

About 3 days or so in the past, the Baidu snapshot of this page has also been deleted, indicating that this method can also be effective, of course, this is not and for it, belongs to the mend.

5. Supplemental Updates

It is possible to detect if the http_user_agent is a crawler/spider access, and then directly return the 403 Status Code screen. For example: Due to API permissions and micro-blog Information privacy protection reasons, Xweibo 2.0 version is forbidden Search engine included.

about how to block the search engine crawler (spider) Crawl/index/Ingest Web page, you have any other better suggestions or methods, also welcome to comment! We look forward to communicating with you.

This article Bruce

Original address: http://www.wuzhisong.com/blog/67/

Copyright NOTICE: Welcome to reprint, but must be hyperlinked to indicate the original source of this article!

A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.