Methods to prohibit search engine inclusion

Last Update:2016-04-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A What is a robots.txt file?

The search engine uses a program robot (aka Spider) to automatically access Web pages on the Internet and obtain web information.

You can create a plain text file in your site robots.txt, in this file to declare that the site does not want to be robot access to the section, so that the site part or all of the content can not be included in the search engine, or the designated search engine only included the specified content.

Two. Where do I put the robots.txt file?

The robots.txt file should be placed in the root directory of the Web site. For example, when robots visits a Web site (such as http://www.abc.com), it first checks to see if the Http://www.abc.com/robots.txt file exists in the site, and if the robot finds the file, It will determine the scope of its access rights based on the contents of the file.

Site URL	URL of the corresponding robots.txt
http://www.w3.org/	Http://www.w3.org/robots.txt
http://www.w3.org:80/	Http://www.w3.org:80/robots.txt
http://www.w3.org:1234/	Http://www.w3.org:1234/robots.txt
http://w3.org/	Http://w3.org/robots.txt

Three. Format of the robots.txt file

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows:

"<field>:<optionalspace><value><optionalspace>".

You can use # for annotations in this file, using the same methods as in Unix. The records in this file typically start with one or more lines of user-agent, followed by several disallow lines, as detailed below:

User-agent:
The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user-agent records stating that more than one robot is limited by the protocol, there must be at least one user-agent record for the file. If the value of the key is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".

Disallow:
The value of the item is used to describe a URL that you do not want to be accessed, which can be a full path or part, and any URL beginning with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access to/help.html and/help/index.html, and "Disallow:/help/" allows robot access to/help.html without access to/help/ Index.html.
Any one of the disallow records is empty, stating that all parts of the site are allowed to be accessed, and that at least one disallow record must be in the "/robots.txt" file. If "/robots.txt" is an empty file, then for all search engine robot, the site is open.

Four. Examples of robots.txt file usage

Example 1. prohibit all search engines from accessing any part of the site 　　　Download the robots.txt file	User-agent: * Disallow:/
Example 2. allow all robot to access (or you can build an empty file, "/robots.txt")	User-agent: * Disallow:
Example 3. Prohibit access to a search engine	User-agent:badbot Disallow:/
Example 4. allow access to a search engine	User-agent:baiduspider Disallow: User-agent: * Disallow:/
Example 5. a simple example In this example, the site has three directories that restrict access to search engines, which means that search engines do not access these three directories. It is important to note that each directory must be declared separately and not written as "Disallow:/cgi-bin//tmp/". User-agent: After the * has a special meaning, on behalf of "any robot", so in this file can not have "Disallow:/tmp/" or "Disallow: .gif" Such a record appears.	User-agent: * Disallow:/cgi-bin/ Disallow:/tmp/ Disallow:/~joe/

Five. robots.txt File Reference

For more specific settings for the robots.txt file, see the following links:

· Web Server Administrator ' s Guide to the Robots exclusion Protocol
· HTML Author ' s Guide to the Robots exclusion Protocol
· The original 1994 protocol description, as currently deployed
· The revised Internet-draft specification, which is not yet completed or implemented

Methods to prohibit search engine inclusion

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Methods to prohibit search engine inclusion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Methods to prohibit search engine inclusion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support