Robots exclusion Protocol)

Source: Internet
Author: User

(1) Introduction to the robots exclusion protocol Protocol
When a robot accesses a Web site, such as http://www.some.com/, first check the file http://www.some.com/robots.txt. If the file exists, it will be analyzed according to the record format:

User-Agent: * disallow:/cgi-bin/disallow:/tmp/disallow :/~ JOE/

To determine whether it should retrieve the site files. These records are specially viewed by web robot. Generally, viewers will never see this file.
Only one "/robots.txt" file can be found on a website, and each letter in the file name must be in lowercase. In the robot record format, each individual "disallow" line indicates the URL you do not want the robot to access. Each URL must be a single line and cannot contain "disallow: /cgi-bin/tmp. At the same time, empty rows cannot appear in a record, because empty rows are the marker of multiple records.
The User-Agent line indicates the name of the robot or other proxies. In the User-Agent line, '*' indicates all robots.
Below are several examples of robot.txt:
Deny all robots: User-Agent: * disallow:/on the server :/
Allow all robots to access the entire site: User-Agent: * disallow:; or generate an empty "/robots.txt" file.
Part of the server content allows all robots to access User-Agent: * disallow:/cgi-bin/disallow:/tmp/disallow:/private/
Deny a specific ROBOT: User-Agent: badbotdisallow :/
Only one robot can patronize: User-Agent: webcrawlerdisallow: User-Agent: * disallow :/
Finally, we provideRobots.txt:

# For use by search.w3.org
User-Agent: w3crobot/1
Disallow:
User-Agent :*
Disallow:/member/# This is restricted to W3C members only
Disallow:/member/# This is restricted to W3C members only
Disallow:/team/# This is restricted to W3C team only
Disallow:/tands/member # This is restricted to W3C members only
Disallow:/tands/team # This is restricted to W3C team only
Disallow:/Project
Disallow:/Systems
Disallow:/Web
Disallow:/team
The format of the robots meta tag is as follows:
<Meta name = "Robots" content = "noindex, nofollow"> "〉
Like other meta tags, it should be placed in the head area of the HTML file:
<HTML> 〉
<Head> 〉
<Meta name = "Robots" content = "noindex, nofollow"> "〉
<Meta name = "Description" content = "this page..."> ...."〉
<Title>... </title> 〉
</Head> 〉
<Body> 〉
...

The robots meta tag commands are separated by commas. The available Commands include [No] index and [no] follow. The index command specifies whether an indexed robot can index this page. The follow command indicates whether the robot can track the link on this page. The default values are index and follow. For example:

<Meta name = "Robots" content = "index, follow"> "〉
<Meta name = "Robots" content = "noindex, follow"> "〉
<Meta name = "Robots" content = "index, nofollow"> "〉
<Meta name = "Robots" content = "noindex, nofollow"> "〉
Management of Web robot programs should be considered when creating and maintaining web applications.

------------------------------------
Create a robots.txt file,
If your website is www.yingz.com and you refuse to access the IMG directory:
Write the following code in robots.txt:
# For use by search.yingz.com
Disallow:
User-Agent :*
Disallow:/img/
----------------------------------

Transferred from:

Http://blog.sina.com.cn/s/blog_4cd012fc010085mb.html

Bytes -----------------------------------------------------------------------------------------------------------------

(2) Overview of the robots exclusion protocol Protocol

Is the robots.txt file?

The search engine uses a program robot (also called Spider) to automatically access webpages on the Internet and obtain webpage information.

You can create a pure robot file robots.txt on your website, which declares that the website does not want to be accessed by the robot. In this way, some or all of the content of the website will not be indexed by the search engine, or the specified search engine only contains the specified content.

2. Where can I store the robots.txt file?

The robots.txt file should be placed in the root directory of the website. For example, when robots accesses a website (such as a http://www.abc.com), it first checks whether the website contains a website.

Website URL URL of robots.txt

Http://www.w3.org/

Http://www.w3.org/robots.txt

Http://www.w3.org: 80/

Http://www.w3.org: 80/robots.txt.

Http://www.w3.org: 1234/

Http://www.w3.org: 1234/robots.txt

Http://w3.org/

Http://w3.org/robots.txt

Iii. robots.txt File Format

The "robots.txt" file contains one or more records separated by empty rows (with Cr, CR/NL, or NL as the terminator). The format of each record is as follows:

"<Field >:< optionalspace> <value> <optionalspace> ".

In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more lines of User-Agent, followed by several disallow lines. The details are as follows:

User-Agent:
The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if Multiple User-Agent records indicate that multiple robots are restricted by this Protocol, for this file, there must be at least one User-Agent record. If this parameter is set to *, the Protocol is valid for all machines. In the "robots.txt" file, only one record such as "User-Agent: *" can exist.

Disallow:
The value of this item is used to describe a URL that you do not want to access. This URL can be a complete or partial path, any URL starting with disallow will not be accessed by the robot. For example, "disallow:/help" does not allow access by search engines to/help.html and/help/index.html, while "disallow:/help/" allows robot access to/help.html, but cannot access/help/index.html.
If any disallow record is blank, it means that all parts of the website are allowed to be accessed. At least one disallow record is required in the "/robots.txt" file. If "/robots.txt" is an empty file, the website is open to all search engine robots.

Iv. Examples of robots.txt File Usage

Example 1. Prohibit all search engines from accessing any part of the website

   Download the robots.txt File

User-Agent :*
Disallow :/

Example 2. Allow all robots to access

(Or you can create an empty file "/robots.txt" file)

User-Agent :*
Disallow:

Example 3. Disable access to a search engine

User-Agent: badbot
Disallow :/

Example 4.Allow access to a search engine

User-Agent: baiduspider
Disallow:

User-Agent :*
Disallow :/

Example 5. A simple example

In this example, the website has three directories that restrict access to the search engine, that is, the search engine does not access these three directories.
Note that each directory must be declared separately, rather than "disallow:/cgi-bin/tmp /".
User-Agent: The Post * has a special meaning and represents "any robot". Therefore, the file cannot contain "disallow:/tmp/*" or "disallow :*. A record like GIF appears.

User-Agent :*
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow :/~ JOE/

V. robots.txt file references

For more detailed settings of the robots.txt file, see the following link:

· Web server administrator's guide to the robots exclusion Protocol
· HTML author's guide to the robots exclusion Protocol
· The original 1994 Protocol description, as currently deployed
· The Revised Internet-draft specification, which is not yet completed or implemented

Transferred from:

Http://hi.baidu.com/devil_1832/blog/item/bf98e91f75cfc805314e1531.html

Bytes -----------------------------------------------------------------------------------------------------------------------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.