What's robots.txt?

Last Update:2017-02-28 Source: Internet

Author: User

Tags root directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Robots.txt Basic Introduction

Robots.txt is a plain text file in which site managers can declare portions of the site that they do not want to be accessed by robots, or specify the search engine to include only the specified content.

when a search robot (some called a search spider) accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will follow the contents of the file to determine the scope of the access; If the file does not exist, the search robot will crawl along the link.

In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.

robots.txt Writing Grammar

first, let's look at a robots.txt example : Http://www.csswebs.org/robots.txt

visit the above specific address, we can see the specific contents of robots.txt are as follows:

# Robots.txt file from http://www.csswebs.org
# All robots would spider the domain

user-agent: *
Disallow:

The above text is meant to allow all search bots to access all the files under the www.csswebs.org site.

The specific grammatical analysis: in which the following text is the description information; User-agent: The name of the search robot, followed by *, refers to all the search robot; Disallow: A file directory that is not allowed to be accessed later.

below, I'll enumerate some specific uses of robots.txt:

allow all robot to access

user-agent: *
Disallow:

or you can build an empty file "/robots.txt" files

prohibit all search engines from accessing any part of the site

user-agent: *
Disallow:/

prohibit all search engines from accessing several parts of the site (01, 02, 03 directories in the following example)

user-agent: *
Disallow:/01/
Disallow:/02/
Disallow:/03/

prohibit access to a search engine (Badbot in the following example)

User-agent:badbot
Disallow:/

allow only one search engine to access (crawler in the following example)

User-agent:crawler
Disallow:

user-agent: *
Disallow:/

In addition, I think it is necessary to expand the description of the Robots meta introduction :

The Robots meta tags are mainly for a specific page. Like the other meta tags (such as the language used, the description of the page, the keywords, and so on), the robots meta tag is placed in the of the page, specifically to tell the search engine how to crawl the content of the page.

the wording of the Robots meta Tags:

There is no case in the robots meta tag, name= "Robots" means that all search engines can be written as name= "Baiduspider" for a specific search engine. The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".

The INDEX instruction tells the search robot to crawl the page;

The FOLLOW instruction indicates that the search robot can continue to crawl along the link on the page;

the default value for the Robots Meta tag is index and follow, except for Inktomi, the default value is Index,nofollow.

in this way, there are a total of four combinations:

which

It seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the robot meta tags, currently not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example :

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More