Standardized format of robots.txt file (control search engine inclusion)

Source: Internet
Author: User

Canonicalized search engine record and robots.txt File

First, the file name is in the lower case of robots.txt, which is located in the root directory of the site, such as a web folder.
The access path is http: // www. XX site. com/robots.txt

Second, the text is ASCII encoding (recommended in foreign countries is the text editor in UNIX mode ),
The content of robots.txt is as follows, and there cannot be spaces after the colon:

Scenario: Comments starting,
The comment should be written in a separate line and cannot be added at the end of the line.
Do not add comments. [Example]
# Robots.txt file from http: // www. XX site. com

Scenario 1: allow access by all search engineers
User-Agent :*
Disallow:

Or create an empty file of robots.txt.

Scenario 2: Prohibit all search engines from accessing any part of the website
User-Agent :*
Disallow :/

Scenario 3: Prohibit all search workers from accessing a folder or file on the website,
The directory file or folder must be in the same case.
Except for special files, you can specify the folder name without listing it all.
[Example]
User-Agent :*
Disallow: 12345.html
Disallow:/001/
Disallow:/Photo/
Disallow:/cgi-bin/images/
Pay attention to the difference between the backslash (/) and backslash:
Disallow:/help
Cannot capture help.html and/help/index.html

Disallow:/help/
Allow capturing help.html without capturing/help/index.html

Scenario 4: Prohibit Access by a search engineer
[For example] disabling searchbot searchers
User-Agent: searchbot
Disallow :/

Scenario 5: only access by a search engineer is allowed.
[Example] searchbot search is allowed
User-Agent: searchbot
Disallow:

User-Agent :*
Disallow :/

[Other situations]
<Meta name = "Robots" content = "index, follow"> <meta name = "Robots" content = "all">
<Meta name = "Robots" content = "noindex, follow">
<Meta name = "Robots" content = "index, nofollow">
<Meta name = "Robots" content = "noindex, nofollow"> <meta name = "Robots" content = "NONE">
[Robots meta tag usage] warn the searcher about how to capture the page content.
Currently, few search engineers support this tag, mainly supported by Google.
1. We recommend that you use lower case letters between 2. When name = "Robots" indicates all search engineers
3. When name = "googlebot" indicates a search worker, such as Google Search
4. content has the following attribute values:
Index allows search workers to capture the page;
Noindex does not allow search workers to capture the page;
Follow allows search engineers to capture links to this page;
Nofollow rejects searching for links to capture the page;
Archive allows search engineers to capture snapshots of this page. (currently, only Google supports this syntax)
Noarchive refuses to search for a snapshot of the webpage to be crawled by a search engineer. (currently, only Google supports this syntax)
[Example]
<Meta name = "googlebot" content = "index, follow, noarchive">
5. Currently, robots does not have other commands such as allow.
6. Use of wildcards (currently, only Google supports this syntax)
[Example]
User-Agent: googlebot
Disallow: *. cgi

The final value is that not all searchers support the robots.txt protocol,
Some spiders often disguise themselves as client browsers.

[Reference content]
Http://www.robotstxt.org
Http://www.google.com/webmasters/tools
Http://www.mcanerin.com/EN/search-engine/robots-txt.asp
Http://whois.domaintools.com
[Common Spider] common search engine robots, search spider, search engineer
Googlebot
Baidusp
Yahoo! Slurp
Msnbot
Outfoxbot
Slurp chin.
Sogou agent
Sohu agent
Iaskspider
Disguised as Firefox or IE...
Bytes ---------------------------------------------------------------------------------------------------
Statement:
Robots.txt is an ASCII text file stored in the root directory of a website ), the content on this website cannot be obtained by the search engine's roaming server and can be obtained by the roaming server. The content of a typical file is as follows:

User-Agent:
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/

For some systems, the urlis sensitive to big and small writing. Therefore, the file names of robots.txt should be written in a uniform way, that is, robots.txt. Robots.txt should be placed under the root directory of the website. If you want to define the behavior of the search engine's roaming folder sub-directory, You can merge your settings to the robots.txt in the root directory or use the robots metadata.

The robots.txt protocol is not a standard, but a convention, so the privacy of the website cannot be guaranteed. Robots.txt uses a string comparison to determine whether to obtain the URL. Therefore, there are two types of URLs at the end of the directory with or without the slash "/", which indicates different URLs and cannot be used with "disallow :. a wildcard such as GIF.

Other methods that affect the behavior of search engines include using robots metadata: this protocol is not a standard, but a convention. Generally, the search engine recognizes this metadata and does not index this page, and the link-out page of the page.

Robots.txt is a plain text file in which the website administrator can declare that the website does not want to be accessed by robots, or specify a search engine to include only specified content.

When a search robot (called a search spider) crawls a site, it first checks that the site root directory contains robots.txt. If so, the search robot determines the access range based on the content in the file. If the file does not exist, the search robot crawls the link.

In addition, robots.txt must be placed in the root directory of a site, and all file names must be in lowercase.

Robots.txt writing syntax

First, let's look at a robots.txt example: http://www.csswebs.org/robots.txt

The specific content of robots.txt is as follows:

# Robots.txt file from http://www.csswebs.org
# All robots will spider the domain

User-Agent :*
Disallow:

The above statement allows all search robots to search all files on www.css webs.org.

Specific syntax analysis: The # text is the description information, the User-Agent is the name of the search robot, and the * text is the name of all search robots. disallow: the following is the file directory that cannot be accessed.

Next, let me list the specific usage of robots.txt:

Allow access by all robots

User-Agent :*
Disallow:

Alternatively, you can create an empty file "/robots.txt" file.

Prohibit all search engines from accessing any part of the website

User-Agent :*
Disallow :/

Prohibit all search engines from accessing the website (in the following example, the 01, 02, and 03 Directories)

User-Agent :*
Disallow:/01/
Disallow:/02/
Disallow:/03/

Prohibit Access to a search engine (badbot in the following example)

User-Agent: badbot
Disallow :/

Only access to a search engine is allowed (The crawler in the following example)

User-Agent: Crawler
Disallow:

User-Agent :*
Disallow :/

In addition, I think it is necessary to expand the description to introduce robots meta:

The robots meta tag mainly targets specific pages. Like other meta tags (such as the language used, page description, and keywords), robots meta tags are also placed in the

Syntax of the robots meta tag:

The robots meta tag is case-insensitive. Name = "Robots" indicates all search engines. You can enter name = "baiduspider" for a specific search engine ". The content part has four Command Options: Index, noindex, follow, and nofollow. commands are separated by commas.

The Index Command tells the search robot to capture the page;

The follow command indicates that the search robot can continue crawling along the link on the page;

The default values of the robots meta tag are index and follow, except Inktomi. The default values are index and nofollow.

In this way, there are four combinations:

<Meta name = "Robots" content = "index, follow">
<Meta name = "Robots" content = "noindex, follow">
<Meta name = "Robots" content = "index, nofollow">
<Meta name = "Robots" content = "noindex, nofollow">

Where

<Meta name = "Robots" content = "index, follow"> <meta name = "Robots" content = "all">;

<Meta name = "Robots" content = "noindex, nofollow"> <meta name = "Robots" content = "NONE">

Currently, a huge number of search engine robots comply with the robots.txt rules. Currently, the robots meta tag does not support much, but is gradually increasing. For example, Google, a famous search engine, is fully supported, in addition, Google also adds the command "ARCHIVE" to limit whether Google retains web snapshots. For example:
<Meta name = "googlebot" content = "index, follow, noarchive">

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.