How to restrict Robot access to Web sites

Last Update:2013-11-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Robot is an automated program that helps search engines collect Web pages. when accessing a Web site, it extracts most of the content from the site by following the link in the webpage, create indexes for these webpages and store them in the search engine database. In some cases, the Web administrator or webpage author may not want the Robot to extract some content from the site for some reason. In this case, you can use some methods to limit the access scope of the Robot.

There are two methods to restrict Robot access to the Web site. One is the Robot restriction protocol used by the Web administrator of the site. Currently, most robots comply with this Protocol, the other is the Robot META tag used by the webpage author. Currently, only a small portion of the Robot supports this tag.

Robot Protocol

The key of the robotlimit agreement is to place a file robot.txt in the root directory of the web site. When a Robot accesses a website, it first reads the file, analyzes the content, and does not access certain files according to the Web administrator's regulations. The following is an example of robot.txt:

# Http://www.yoursite.com/robots.txt

User-agent:

Disallow:/tmp/# these files will soon be deleted

Disallow:/test.html

User-agent: InfoSeek rorobot 1.0

Disallow :/

The content after "#" is a comment. The User-agent command is used to specify the Robot to which the Disallow command under it is valid. "" indicates that it is valid for all robots, in the preceding example, the second User-agent command indicates that the Disallow command is only valid for the Robot version 1.0 of Infoseek. The Disallow command is used to specify which directories or files cannot be accessed. If "/" is specified, all files are not allowed to be accessed. The Disallow command can only put one directory or one file in one row, if there are multiple directories, they must be placed in several rows.

The role's robot.txt file is currently being used in the early version of the Robot restrictions protocol, and a draft of the Internet on How to restrict the Robot is being developed, it has expanded the earlier version of the Robot protocol, but has not yet entered the practical stage.

Robot META tag

Or use the Robot META tag.

META tag is a tag used to place invisible information in an HTML file. It must be placed in the Head of the HTML file. The Robot META tag is a special META tag. The following are some examples:

<Meta name = "robots" content = "index, follow"> ″〉

<Meta name = "robots" content = "noindex, follow"> ″〉

<Meta name = "robots" content = "index, nofollow"> ″〉

<Meta name = "robots" content = "noindex, nofollow"> ″〉

The name part of the Robot META tag is "robots", and the content part can be a combination of "index", "noindex", "follow", and "nofollow. "Index" indicates that the search engine can index the HTML file. "follow" indicates that the search engine can use the link in the HTML file to access other files, "noindex" and "nofollow" are the opposite of "index" and "follow. When using these commands in combination, there cannot be logical conflicts, that is, you cannot specify "index", "noindex", "follow", and "nofollow" at the same time ". In addition, if you want to specify "index, follow", you can use "all" instead. If you want to specify "noindex" and "nofollow", you can use "none" instead.

The disadvantage of using the Robot META tag is that it is troublesome to modify every HTML file. In addition, many robots do not support this tag.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to restrict Robot access to Web sites

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to restrict Robot access to Web sites

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support