Detailed wording of the robots.txt

Source: Internet
Author: User

What is robots.txt?

Search engines use the spider program to automatically access Web pages on the Internet and get web information. When a spider visits a Web site, it first checks to see if there is a plain text file called robots.txt in the root domain of the Web site. You can create a plain text file in your website robots.txt, stating in the file that the site does not want to be accessed by robot or the specified search engine only contains specific sections.

Please note that you need to use the robots.txt file only if your site contains content that you do not want to be indexed by search engines. If you want search engines to ingest all the content on your site, do not create a robots.txt file or a robots.txt file with empty content.

Robots.txt Drop Location

The robots.txt file should be placed in the root directory of the Web site. For example, when a spider accesses a Web site (such as http://www.abc.com), it first checks to see if the Web site has a http://www.abc.com/robots.txt file, and if the spider finds it, It will determine the scope of its access rights based on the contents of the file.

robots.txt format

The file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows: ":". You can use # for annotations in this file. The records in this file typically start with one or more lines of user-agent, followed by a number of disallow and allow lines, as detailed below.

User-agent:

The value of the item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple user-agent records stating that multiple robot are subject to the "robots.txt" limit, there must be at least one user-agent record for the file. If the value of the item is set to *, it is valid for any robot, and in the "robots.txt" file, there can be only one record of "user-agent:*". If you include "User-agent:somebot" and several disallow, allow lines in the "robots.txt" file, then the name "Somebot" is only followed by the "User-agent:somebot" Restrictions on disallow and allow lines. Disallow: The value of this item is used to describe a set of URLs that you do not want to be accessed, which can be either a full path or a non-unprecedented prefix of the path, and URLs that begin with the value of the Disallow entry will not be accessed robot. For example, "Disallow:/help" prohibits robot access to/help*.html,/help/index.html, and "Disallow:/help/" allows robot access/help*.html and cannot access/help/ Index.html.

"Disallow:" description allows robot to access all URLs for the site, and at least one Disallow record in the "/robots.txt" file. If "/robots.txt" does not exist or is an empty file, the site is open for all search engine robot.

Allow:

The value of the item is used to describe the set of URLs that you want to be accessed, similar to the disallow key, which can be either a full path or a prefix to the path, and a URL that begins with the value of an allow entry that allows robot access. For example, "Allow:/hibaidu" allows robot access to/hibaidu.htm,/hibaiducom.html,/hibaidu/com.html. All URLs for a site are allowed by default, so allow is typically used in conjunction with disallow, enabling the ability to access a subset of Web pages while also prohibiting access to all other URLs.

It is important to note that the order of disallow and allow lines is meaningful, and robot will determine whether to access a URL based on the first successful allow or disallow row of a match.

Use "*" and "$": Baiduspider supports using wildcard characters "*" and "$" to blur matching URLs. "$" matches the line terminator. "*" matches 0 or more arbitrary characters.

robots.txt File Usage Examples:

1. Allow all robot to access

User-agent: * Allow:/or user-agent: * Disallow:

2. Prohibit all search engines from accessing any part of the site

User-agent: *

Disallow:/

3. Prohibit Baiduspider access to your website only

User-agent:baiduspider

Disallow:/

4. Allow Baiduspider to access your website only

User-agent:baiduspider

Disallow:

5. Prevent spiders from accessing specific directories

User-agent: *

Disallow:/cgi-bin/

Disallow:/tmp/

Disallow:/~joe/

6. Allow access to some URLs in a specific directory

User-agent: *

Allow:/cgi-bin/see

Allow:/tmp/hi

Allow:/~joe/look

Disallow:/cgi-bin/

Disallow:/tmp/

Disallow:/~joe/

7. Use "*" to restrict access to URLs

Disable access to all URLs (including subdirectories) in the/cgi-bin/directory that are suffixed with ". htm".

User-agent: *

Disallow:/cgi-bin/*.htm

8. Restrict access to URLs with "$"

Only URLs with the suffix ". htm" are allowed to be accessed.

User-agent: *

Allow:. htm$

Disallow:/

Example 9. Disable access to all dynamic pages in the site

User-agent: *

Disallow:/*?*

10. Prohibit Baiduspider crawl all pictures on the website

Only crawl pages are allowed and no images are captured.

User-agent:baiduspider

Disallow:. jpg$

Disallow:. jpeg$

Disallow:. gif$

Disallow:. png$

Disallow:. bmp$

11. Only allow Baiduspider crawl Web pages and. gif format pictures

Allows crawling of web pages and GIF images, and does not allow other format images to be crawled

User-agent:baiduspider

Allow:. gif$

Disallow:. jpg$

Disallow:. jpeg$

Disallow:. png$

Disallow:. bmp$

12. Only prohibit baiduspider crawl. jpg format pictures

User-agent:baiduspider

Disallow:. jpg$

Note: The robots.txt is case-sensitive, the default file name is all lowercase, the rules should pay attention to the case, such as prohibiting http://lbicc.com/Abc.html, in the rules written abc.html words that will not effect, will only prohibit abc.html,abc.html or effective.

Detailed wording of the robots.txt

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.