Robots exclusion Protocol)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(1) Introduction to the robots exclusion protocol Protocol
When a robot accesses a Web site, such as http://www.some.com/, first check the file http://www.some.com/robots.txt. If the file exists, it will be analyzed according to the record format:

User-Agent: * disallow:/cgi-bin/disallow:/tmp/disallow :/~ JOE/

To determine whether it should retrieve the site files. These records are specially viewed by web robot. Generally, viewers will never see this file.
Only one "/robots.txt" file can be found on a website, and each letter in the file name must be in lowercase. In the robot record format, each individual "disallow" line indicates the URL you do not want the robot to access. Each URL must be a single line and cannot contain "disallow: /cgi-bin/tmp. At the same time, empty rows cannot appear in a record, because empty rows are the marker of multiple records.
The User-Agent line indicates the name of the robot or other proxies. In the User-Agent line, '*' indicates all robots.
Below are several examples of robot.txt:
Deny all robots: User-Agent: * disallow:/on the server :/
Allow all robots to access the entire site: User-Agent: * disallow:; or generate an empty "/robots.txt" file.
Part of the server content allows all robots to access User-Agent: * disallow:/cgi-bin/disallow:/tmp/disallow:/private/
Deny a specific ROBOT: User-Agent: badbotdisallow :/
Only one robot can patronize: User-Agent: webcrawlerdisallow: User-Agent: * disallow :/
Finally, we provideRobots.txt:

# For use by search.w3.org
User-Agent: w3crobot/1
Disallow:
User-Agent :*
Disallow:/member/# This is restricted to W3C members only
Disallow:/member/# This is restricted to W3C members only
Disallow:/team/# This is restricted to W3C team only
Disallow:/tands/member # This is restricted to W3C members only
Disallow:/tands/team # This is restricted to W3C team only
Disallow:/Project
Disallow:/Systems
Disallow:/Web
Disallow:/team
The format of the robots meta tag is as follows:
<Meta name = "Robots" content = "noindex, nofollow"> "〉
Like other meta tags, it should be placed in the head area of the HTML file:
<HTML> 〉
<Head> 〉
<Meta name = "Robots" content = "noindex, nofollow"> "〉
<Meta name = "Description" content = "this page..."> ...."〉
<Title>... </title> 〉
</Head> 〉
<Body> 〉
...

The robots meta tag commands are separated by commas. The available Commands include [No] index and [no] follow. The index command specifies whether an indexed robot can index this page. The follow command indicates whether the robot can track the link on this page. The default values are index and follow. For example:

<Meta name = "Robots" content = "index, follow"> "〉
<Meta name = "Robots" content = "noindex, follow"> "〉
<Meta name = "Robots" content = "index, nofollow"> "〉
<Meta name = "Robots" content = "noindex, nofollow"> "〉
Management of Web robot programs should be considered when creating and maintaining web applications.

------------------------------------
Create a robots.txt file,
If your website is www.yingz.com and you refuse to access the IMG directory:
Write the following code in robots.txt:
# For use by search.yingz.com
Disallow:
User-Agent :*
Disallow:/img/
----------------------------------

Transferred from:

Http://blog.sina.com.cn/s/blog_4cd012fc010085mb.html

Bytes -----------------------------------------------------------------------------------------------------------------

(2) Overview of the robots exclusion protocol Protocol

Is the robots.txt file?

The search engine uses a program robot (also called Spider) to automatically access webpages on the Internet and obtain webpage information.

You can create a pure robot file robots.txt on your website, which declares that the website does not want to be accessed by the robot. In this way, some or all of the content of the website will not be indexed by the search engine, or the specified search engine only contains the specified content.

2. Where can I store the robots.txt file?

The robots.txt file should be placed in the root directory of the website. For example, when robots accesses a website (such as a http://www.abc.com), it first checks whether the website contains a website.

Website URL	URL of robots.txt
Http://www.w3.org/	Http://www.w3.org/robots.txt
Http://www.w3.org: 80/	Http://www.w3.org: 80/robots.txt.
Http://www.w3.org: 1234/	Http://www.w3.org: 1234/robots.txt
Http://w3.org/	Http://w3.org/robots.txt

Iii. robots.txt File Format

The "robots.txt" file contains one or more records separated by empty rows (with Cr, CR/NL, or NL as the terminator). The format of each record is as follows:

"<Field >:< optionalspace> <value> <optionalspace> ".

In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more lines of User-Agent, followed by several disallow lines. The details are as follows:

User-Agent:
The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if Multiple User-Agent records indicate that multiple robots are restricted by this Protocol, for this file, there must be at least one User-Agent record. If this parameter is set to *, the Protocol is valid for all machines. In the "robots.txt" file, only one record such as "User-Agent: *" can exist.

Disallow:
The value of this item is used to describe a URL that you do not want to access. This URL can be a complete or partial path, any URL starting with disallow will not be accessed by the robot. For example, "disallow:/help" does not allow access by search engines to/help.html and/help/index.html, while "disallow:/help/" allows robot access to/help.html, but cannot access/help/index.html.
If any disallow record is blank, it means that all parts of the website are allowed to be accessed. At least one disallow record is required in the "/robots.txt" file. If "/robots.txt" is an empty file, the website is open to all search engine robots.

Iv. Examples of robots.txt File Usage

Example 1. Prohibit all search engines from accessing any part of the website 　　　Download the robots.txt File	User-Agent :* Disallow :/
Example 2. Allow all robots to access (Or you can create an empty file "/robots.txt" file)	User-Agent :* Disallow:
Example 3. Disable access to a search engine	User-Agent: badbot Disallow :/
Example 4.Allow access to a search engine	User-Agent: baiduspider Disallow: User-Agent :* Disallow :/
Example 5. A simple example In this example, the website has three directories that restrict access to the search engine, that is, the search engine does not access these three directories. Note that each directory must be declared separately, rather than "disallow:/cgi-bin/tmp /". User-Agent: The Post * has a special meaning and represents "any robot". Therefore, the file cannot contain "disallow:/tmp/" or "disallow :. A record like GIF appears.	User-Agent :* Disallow:/cgi-bin/ Disallow:/tmp/ Disallow :/~ JOE/

V. robots.txt file references

For more detailed settings of the robots.txt file, see the following link:

· Web server administrator's guide to the robots exclusion Protocol
· HTML author's guide to the robots exclusion Protocol
· The original 1994 Protocol description, as currently deployed
· The Revised Internet-draft specification, which is not yet completed or implemented

Transferred from:

Http://hi.baidu.com/devil_1832/blog/item/bf98e91f75cfc805314e1531.html

Bytes -----------------------------------------------------------------------------------------------------------------------

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Robots exclusion Protocol)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Robots exclusion Protocol)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support