The robots.txt file is a dialog between each website and a search engine spider following the robots protocol.Code.
Let's take a look at a column and let the search engine capture all the content. The Code is as follows:
User-Agent :*
Allow :/
Here, the User-Agent is followed by the name of the Spider. If all Spider comply with the name, you can use * to replace all Spider, if only for a specific spider,
You only need to name the spider. If you do not want the spider to crawl, you only need to change allow to disallow and prohibit crawling./The content that follows is
Content that is prohibited or allowed to be crawled.
Sometimes, when crawlers crawl too frequently, we need to add the crawl-delay code, which means to tell the spider how many seconds to wait before crawling. Let's look at the instance:
User-Agent :*
Crawl-delay: 500
The preceding content is the same. The difference is that crawl-delay can only be followed by numbers and can only be positive integers.
Common codes include User-Agent, disallow, allow, and crawl-delay.
The best practice is to try setting up the robots.txt file. Add the following content to the robots.txt file on the website:
Take Baidu Spider as an Example
User-Agent: baiduspider
Disallow :/
If this method cannot completely block Baidu crawlers, that is, if the SPIDER does not comply with the robots protocol, we need to completely block Baidu crawlers. some statements can be added to htaccess. The following describes two methods.
Method 1:
Rewriteengine on
Rewritecond % {http_user_agent} ^ baiduspider [Nc]
Rewriterule. *-[F]
Method 2:
Setenvifnocase User-Agent "^ baiduspider" bad_bot
Order allow, deny
Allow from all
Anhui Children's Network Co., http://www.ahyuer.com.