Robot is an automated program that helps search engines collect Web pages. when accessing a Web site, it extracts most of the content from the site by following the link in the webpage, create indexes for these webpages and store them in the search engine database. In some cases, the Web administrator or webpage author may not want the Robot to extract some content from the site for some reason. In this case, you can use some methods to limit the access scope of the Robot.
There are two methods to restrict Robot access to the Web site. One is the Robot restriction protocol used by the Web administrator of the site. Currently, most robots comply with this Protocol, the other is the Robot META tag used by the webpage author. Currently, only a small portion of the Robot supports this tag.
Robot Protocol
The key of the robotlimit agreement is to place a file robot.txt in the root directory of the web site. When a Robot accesses a website, it first reads the file, analyzes the content, and does not access certain files according to the Web administrator's regulations. The following is an example of robot.txt:
# Http://www.yoursite.com/robots.txt
User-agent:
Disallow:/tmp/# these files will soon be deleted
Disallow:/test.html
User-agent: InfoSeek rorobot 1.0
Disallow :/
The content after "#" is a comment. The User-agent command is used to specify the Robot to which the Disallow command under it is valid. "" indicates that it is valid for all robots, in the preceding example, the second User-agent command indicates that the Disallow command is only valid for the Robot version 1.0 of Infoseek. The Disallow command is used to specify which directories or files cannot be accessed. If "/" is specified, all files are not allowed to be accessed. The Disallow command can only put one directory or one file in one row, if there are multiple directories, they must be placed in several rows.
The role's robot.txt file is currently being used in the early version of the Robot restrictions protocol, and a draft of the Internet on How to restrict the Robot is being developed, it has expanded the earlier version of the Robot protocol, but has not yet entered the practical stage.
Robot META tag
Or use the Robot META tag.
META tag is a tag used to place invisible information in an HTML file. It must be placed in the Head of the HTML file. The Robot META tag is a special META tag. The following are some examples:
<Meta name = "robots" content = "index, follow"> ″〉
<Meta name = "robots" content = "noindex, follow"> ″〉
<Meta name = "robots" content = "index, nofollow"> ″〉
<Meta name = "robots" content = "noindex, nofollow"> ″〉
The name part of the Robot META tag is "robots", and the content part can be a combination of "index", "noindex", "follow", and "nofollow. "Index" indicates that the search engine can index the HTML file. "follow" indicates that the search engine can use the link in the HTML file to access other files, "noindex" and "nofollow" are the opposite of "index" and "follow. When using these commands in combination, there cannot be logical conflicts, that is, you cannot specify "index", "noindex", "follow", and "nofollow" at the same time ". In addition, if you want to specify "index, follow", you can use "all" instead. If you want to specify "noindex" and "nofollow", you can use "none" instead.
The disadvantage of using the Robot META tag is that it is troublesome to modify every HTML file. In addition, many robots do not support this tag.