Interview a soft, the interviewer asked: "You do crawlers, do you know that many sites have a robots file?"
Answer: I don't know.
So the interviewer gave me a demo.
Then he died. The defeat of the opener.
Down to the Wikipedia, a basic understanding of robots. Https://zh.wikipedia.org/wiki/Robots.txt
For example, the Bing search www.bing.com root directory has such a file: Http://www.bing.com/robots.txt, the content is as follows:
User-agent:msnbot-media Disallow:/allow:/shopping/$Allow:/shopping$allow:/th? User-agent:twitterbotdisallow: user-agent: *disallow:/account/disallow:/bfp/searchdisallow:/ Bing-site-safetydisallow:/blogs/search/disallow:/entities/searchdisallow:/fd/disallow:/historyDisallow:/hotels /search ...
The purpose of this file is to tell the search engine which files under the domain name can be crawled and which do not.
The following excerpt from Wikipedia:
robots.txt(Unified lowercase) is an ASCII-encoded text file stored in the root directory of the Web site, which usually tells the web search engine's bots (also known as web spiders), which content in this site should not be captured by search engine bots, which can be obtained by the bots. Because URLs in some systems are case-sensitive, the file names for robots.txt should be uniform to lowercase. The robots.txt should be placed in the root directory of the Web site. If you want to define the behavior of a search engine's bots to access subdirectories individually, you can merge the custom settings into the robots.txt at the root, or use robots metadata (Metadata, also known as meta data).
The robots.txt protocol is not a specification, but a conventional one, so it does not guarantee the privacy of the site. Note that robots.txt is a string comparison to determine whether to get the URL, so the end of the directory with no Slash "/" represents a different URL. Robots.txt allows wildcard characters like "Disallow: *.gif" to be used
Reprint Please specify: Kangrui tribe? Robots under the website
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Robots under the Web