ROBOTS. TXT Guide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Robots.txt Guide
When a search engine accesses a Web site, it first checks to see if there is a plain text file called robots.txt under the root domain of the site. The Robots.txt file is used to limit the search engine's access to its Web site, which tells the search engine which files are allowed to be retrieved (downloaded). This is what you often see on the web, "reject the standard of the Robots" (Exclusion Standard). Below we refer to res for short. Robots.txt file Format: Robots.txt file format is very special, it consists of records. These records are separated by a blank line. Each of these records consists of two domains:
1 a user-agent (user agent) string line;
2) Several disallow string lines.
Record format is:<field> ":" <value>
Let's make a further explanation of these two fields separately.
User-agent (user agent):
User-agent Line (user agent line) used to specify the name of the search engine robot, Google's search procedures Googlebot For example, there are: User-agent:googlebot
There must be at least one user-agent record in a robots.txt. If you have more than one user-agent record, there are multiple robot that are limited by the RES standard. Of course, if you want to specify all the robot, just use a wildcard "*" to get it done, that is: user-agent: *
Disallow (Deny access declaration):
In the Robots.txt file, the second field of each record is disallow: the instruction line. These disallow lines declare files and/or directories in the Web site that you do not want to be accessed. For example, "Disallow:email.htm" declares access to files and prohibits spiders from downloading email.htm files on the Web site. The "Disallow:/cgi-bin/" then declares access to the Cgi-bin directory, denying spiders access to the directory and its subdirectories. The disallow declaration line also has wildcard functionality. For example, in the previous example, "Disallow:/cgi-bin/" declares a denial of search engine access to the Cgi-bin directory and its subdirectories, while "Disallow:/bob" rejects the search engine's access to/bob.html and/bob/indes.html ( That is, either a file named Bob or a file in the directory named Bob does not allow search engine access. If the disallow record is left blank, all parts of the site are open to the search engine.
Spaces & Notes
In the robots.txt file, rows that begin with "#" are treated as annotation content, as is customary in UNIX. But you need to pay attention to two questions:
1 The RES standard allows the annotation content to be placed at the end of the indicated line, but not all spiders can support it. For example, not all spiders can correctly understand the "Disallow:bob #comment" such an instruction. Some spiders will be misunderstood as disallow is "bob#comment". The best way is to make annotations from one line to the other.
2 The RES standard permits the existence of spaces at the beginning of a command line, like "Disallow:bob #comment", but we do not recommend it.
Robots.txt File creation:
It is important to note that the Robots.txt plain text file should be created in Unix command line terminal mode. A good text editor can generally provide UNIX-mode functionality, or your FTP client software should be able to switch over for you. If you're trying to generate your robots.txt plain text file with an HTML editor that doesn't have a text-editing pattern, you're the blind man playing the mosquito--wasting your energy.
Extensions to the RES standard:
Although some extension criteria have been proposed, such as allow row or robot version control (for example, case and version numbers should be ignored), they have not yet been formally approved by the RES workgroup.
Appendix I. Robots.txt Usage Examples:
Use the wildcard character "*" To set access rights to all robot.
User-agent: *
Disallow:
Show: Allow all search engines to access all content under the site.
User-agent: *
Disallow:/
Indicates: prohibits all search engines from accessing all pages under the site.
User-agent: *
Disallow:/cgi-bin/disallow:/images/
Indicates: prohibits all search engines from entering the site's cgi-bin and images directories and all their subdirectories. It should be noted that each directory must be separately declared.
User-agent:roverdog
Disallow:/
Indicates that Roverdog access to any file on the Web site is prohibited.
User-agent:googlebot
Disallow:cheese.htm
Indicates that Google's Googlebot is prohibited from accessing cheese.htm files under its website.
Some simple settings are described above, and for more complex settings, refer to some large sites such as CNN or LookSmart robots.txt files (www.cnn.com/robots.txt, www.looksmart.com/robots.txt)
Appendix II. Related robots.txt Articles reference:
1. Robots.txt Common problem analysis
2. Use of Robots Meta tag
3. Robots.txt Test Procedure

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

ROBOTS. TXT Guide

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

ROBOTS. TXT Guide

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support