ROBOTS. TXT Guide

Source: Internet
Author: User
Tags format command line string version ftp client access

Robots.txt Guide
When a search engine accesses a Web site, it first checks to see if there is a plain text file called robots.txt under the root domain of the site. The Robots.txt file is used to limit the search engine's access to its Web site, which tells the search engine which files are allowed to be retrieved (downloaded). This is what you often see on the web, "reject the standard of the Robots" (Exclusion Standard). Below we refer to res for short. Robots.txt file Format: Robots.txt file format is very special, it consists of records. These records are separated by a blank line. Each of these records consists of two domains:
1 a user-agent (user agent) string line;
2) Several disallow string lines.
Record format is:<field> ":" <value>
Let's make a further explanation of these two fields separately.
User-agent (user agent):
User-agent Line (user agent line) used to specify the name of the search engine robot, Google's search procedures Googlebot For example, there are: User-agent:googlebot
There must be at least one user-agent record in a robots.txt. If you have more than one user-agent record, there are multiple robot that are limited by the RES standard. Of course, if you want to specify all the robot, just use a wildcard "*" to get it done, that is: user-agent: *
Disallow (Deny access declaration):
In the Robots.txt file, the second field of each record is disallow: the instruction line. These disallow lines declare files and/or directories in the Web site that you do not want to be accessed. For example, "Disallow:email.htm" declares access to files and prohibits spiders from downloading email.htm files on the Web site. The "Disallow:/cgi-bin/" then declares access to the Cgi-bin directory, denying spiders access to the directory and its subdirectories. The disallow declaration line also has wildcard functionality. For example, in the previous example, "Disallow:/cgi-bin/" declares a denial of search engine access to the Cgi-bin directory and its subdirectories, while "Disallow:/bob" rejects the search engine's access to/bob.html and/bob/indes.html ( That is, either a file named Bob or a file in the directory named Bob does not allow search engine access. If the disallow record is left blank, all parts of the site are open to the search engine.
Spaces & Notes
In the robots.txt file, rows that begin with "#" are treated as annotation content, as is customary in UNIX. But you need to pay attention to two questions:
1 The RES standard allows the annotation content to be placed at the end of the indicated line, but not all spiders can support it. For example, not all spiders can correctly understand the "Disallow:bob #comment" such an instruction. Some spiders will be misunderstood as disallow is "bob#comment". The best way is to make annotations from one line to the other.
2 The RES standard permits the existence of spaces at the beginning of a command line, like "Disallow:bob #comment", but we do not recommend it.
Robots.txt File creation:
It is important to note that the Robots.txt plain text file should be created in Unix command line terminal mode. A good text editor can generally provide UNIX-mode functionality, or your FTP client software should be able to switch over for you. If you're trying to generate your robots.txt plain text file with an HTML editor that doesn't have a text-editing pattern, you're the blind man playing the mosquito--wasting your energy.
Extensions to the RES standard:
Although some extension criteria have been proposed, such as allow row or robot version control (for example, case and version numbers should be ignored), they have not yet been formally approved by the RES workgroup.
Appendix I. Robots.txt Usage Examples:
Use the wildcard character "*" To set access rights to all robot.
User-agent: *
Disallow:
Show: Allow all search engines to access all content under the site.
User-agent: *
Disallow:/
Indicates: prohibits all search engines from accessing all pages under the site.
User-agent: *
Disallow:/cgi-bin/disallow:/images/
Indicates: prohibits all search engines from entering the site's cgi-bin and images directories and all their subdirectories. It should be noted that each directory must be separately declared.
User-agent:roverdog
Disallow:/
Indicates that Roverdog access to any file on the Web site is prohibited.
User-agent:googlebot
Disallow:cheese.htm
Indicates that Google's Googlebot is prohibited from accessing cheese.htm files under its website.
Some simple settings are described above, and for more complex settings, refer to some large sites such as CNN or LookSmart robots.txt files (www.cnn.com/robots.txt, www.looksmart.com/robots.txt)
Appendix II. Related robots.txt Articles reference:
1. Robots.txt Common problem analysis
2. Use of Robots Meta tag
3. Robots.txt Test Procedure



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.