Website construction is good, of course, hope that the Web page is indexed by the search engine, the more the better, but sometimes we will also encounter the site does not need to be indexed by the search engine situation.
For example, you want to enable a new domain name to do the mirror site, mainly for the promotion of PPC, this time will be a way to block search engine spiders crawl and index all the pages of our mirror site. Because if the mirror site is also indexed by the search engine, it is likely to affect the official website in the search engine weight, which is certainly not what we want to see results.
Here are a few ideas to block the main search engine crawler (spider) Crawl/index/ingest Web pages. Note: The whole station is blocked, and is as far as possible to shield off all the mainstream search engine crawler (spider).
1, through the robots.txt file screen
It can be said that the robots.txt file is the most important channel (can establish direct dialogue with the search engine). I made the following suggestions by analyzing the server log files of my own blog (also welcome to add the user):
User-agent:baiduspider
Disallow:/
User-agent:googlebot
Disallow:/
User-agent:googlebot-mobile
Disallow:/
User-agent:googlebot-image
disallow:/
User-agent:mediapartners-google
Disallow:/
User-agent:adsbot-google
Disallow:/
User-agent:feedfetcher-google
Disallow:/
User-agent:yahoo! slurp
Disallow:/
User-agent:yahoo! slurp China
Disallow:/
User-agent:yahoo!-adcrawler
Disallow:/
User-agent:youdaobot
Disallow:/
User-agent:sosospider
Disallow:/
User-agent:sogou Spider
Disallow:/
User-agent:sogou Web Spider
Disallow:/
User-agent:msnbot
Disallow:/
User-agent:ia_archiver
Disallow:/
User-agent:tomato Bot
Disallow:/
User-agent: *
Disallow:/
2, through the META tag shield
Add the following statement to all the page header files:
<meta name= "Robots" content= "Noindex, nofollow" >
3, through the server (such as: Linux/nginx) configuration file Settings
Filter the IP segment of the spider/robots directly.
Small bet: 1th strokes and 2nd strokes only for "gentleman" effective, to prevent "villain" to use the 3rd strokes ("Gentleman" and "villain" respectively refers to abide by and do not comply with the robots.txt agreement spider/robots), so the site after the online to keep track of the analysis of the log, screening out these Badbot IP, and then block it.
Here's a Badbot IP database: http://www.spam-whackers.com/bad.bots.htm
4, through the search engine provides webmaster tools, delete the webpage snapshot
For example, sometimes Baidu does not strictly abide by the robots.txt agreement, you can use Baidu to provide "web complaints" portal to delete Web snapshots. Baidu website Complaint Center: Http://tousu.baidu.com/webmaster/add
As one of my web complaints:
About 3 days or so in the past, the Baidu snapshot of this page has also been deleted, indicating that this method can also be effective, of course, this is not and for it, belongs to the mend.
5. Supplemental Updates
It is possible to detect if the http_user_agent is a crawler/spider access, and then directly return the 403 Status Code screen. For example: Due to API permissions and micro-blog Information privacy protection reasons, Xweibo 2.0 version is forbidden Search engine included.
about how to block the search engine crawler (spider) Crawl/index/Ingest Web page, you have any other better suggestions or methods, also welcome to comment! We look forward to communicating with you.
This article Bruce
Original address: http://www.wuzhisong.com/blog/67/
Copyright NOTICE: Welcome to reprint, but must be hyperlinked to indicate the original source of this article!
A brief discussion on the methods of blocking search engine crawler (spider) Crawl/index/Ingest Web page