A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

Last Update:2014-10-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Website construction is good, of course, hope that the Web page is indexed by the search engine, the more the better, but sometimes we will also encounter the site does not need to be indexed by the search engine situation.

For example, you want to enable a new domain name to do the mirror site, mainly for the promotion of PPC, this time will be a way to block search engine spiders crawl and index all the pages of our mirror site. Because if the mirror site is also indexed by the search engine, it is likely to affect the official website in the search engine weight, which is certainly not what we want to see results.

Here are a few ideas to block the main search engine crawler (spider) Crawl/index/ingest Web pages. Note: The whole station is blocked, and is as far as possible to shield off all the mainstream search engine crawler (spider).

1, through the robots.txt file screen

It can be said that the robots.txt file is the most important channel (can establish direct dialogue with the search engine). I made the following suggestions by analyzing the server log files of my own blog (also welcome to add the user):

User-agent:baiduspider

Disallow:/

User-agent:googlebot

Disallow:/

User-agent:googlebot-mobile

Disallow:/

User-agent:googlebot-image

disallow:/

User-agent:mediapartners-google

Disallow:/

User-agent:adsbot-google

Disallow:/

User-agent:feedfetcher-google

Disallow:/

User-agent:yahoo! slurp

Disallow:/

User-agent:yahoo! slurp China

Disallow:/

User-agent:yahoo!-adcrawler

Disallow:/

User-agent:youdaobot

Disallow:/

User-agent:sosospider

Disallow:/

User-agent:sogou Spider

Disallow:/

User-agent:sogou Web Spider

Disallow:/

User-agent:msnbot

Disallow:/

User-agent:ia_archiver

Disallow:/

User-agent:tomato Bot

Disallow:/

User-agent: *

Disallow:/

2, through the META tag shield

Add the following statement to all the page header files:

3, through the server (such as: Linux/nginx) configuration file Settings

Filter the IP segment of the spider/robots directly.

Small bet: 1th strokes and 2nd strokes only for "gentleman" effective, to prevent "villain" to use the 3rd strokes ("Gentleman" and "villain" respectively refers to abide by and do not comply with the robots.txt agreement spider/robots), so the site after the online to keep track of the analysis of the log, screening out these Badbot IP, and then block it.

Here's a Badbot IP database: http://www.spam-whackers.com/bad.bots.htm

4, through the search engine provides webmaster tools, delete the webpage snapshot

For example, sometimes Baidu does not strictly abide by the robots.txt agreement, you can use Baidu to provide "web complaints" portal to delete Web snapshots. Baidu website Complaint Center: Http://tousu.baidu.com/webmaster/add

As one of my web complaints:

About 3 days or so in the past, the Baidu snapshot of this page has also been deleted, indicating that this method can also be effective, of course, this is not and for it, belongs to the mend.

5. Supplemental Updates

It is possible to detect if the http_user_agent is a crawler/spider access, and then directly return the 403 Status Code screen. For example: Due to API permissions and micro-blog Information privacy protection reasons, Xweibo 2.0 version is forbidden Search engine included.

about how to block the search engine crawler (spider) Crawl/index/Ingest Web page, you have any other better suggestions or methods, also welcome to comment! We look forward to communicating with you.

This article Bruce

Original address: http://www.wuzhisong.com/blog/67/

A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support