What is robots.txt?
Search engines use the spider program to automatically access Web pages on the Internet and get web information. When a spider visits a Web site, it first checks to see if there is a plain text file called robots.txt in the root domain of the Web site. You can create a plain text file in your website robots.txt, stating in the file that the site does not want to be accessed by robot or the specified search engine only contains specific sections.
Please note that you need to use the robots.txt file only if your site contains content that you do not want to be indexed by search engines. If you want search engines to ingest all the content on your site, do not create a robots.txt file or a robots.txt file with empty content.
Robots.txt Drop Location
The robots.txt file should be placed in the root directory of the Web site. For example, when a spider accesses a Web site (such as http://www.abc.com), it first checks to see if the Web site has a http://www.abc.com/robots.txt file, and if the spider finds it, It will determine the scope of its access rights based on the contents of the file.
robots.txt format
The file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows: ":". You can use # for annotations in this file. The records in this file typically start with one or more lines of user-agent, followed by a number of disallow and allow lines, as detailed below.
User-agent:
The value of the item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple user-agent records stating that multiple robot are subject to the "robots.txt" limit, there must be at least one user-agent record for the file. If the value of the item is set to *, it is valid for any robot, and in the "robots.txt" file, there can be only one record of "user-agent:*". If you include "User-agent:somebot" and several disallow, allow lines in the "robots.txt" file, then the name "Somebot" is only followed by the "User-agent:somebot" Restrictions on disallow and allow lines. Disallow: The value of this item is used to describe a set of URLs that you do not want to be accessed, which can be either a full path or a non-unprecedented prefix of the path, and URLs that begin with the value of the Disallow entry will not be accessed robot. For example, "Disallow:/help" prohibits robot access to/help*.html,/help/index.html, and "Disallow:/help/" allows robot access/help*.html and cannot access/help/ Index.html.
"Disallow:" description allows robot to access all URLs for the site, and at least one Disallow record in the "/robots.txt" file. If "/robots.txt" does not exist or is an empty file, the site is open for all search engine robot.
Allow:
The value of the item is used to describe the set of URLs that you want to be accessed, similar to the disallow key, which can be either a full path or a prefix to the path, and a URL that begins with the value of an allow entry that allows robot access. For example, "Allow:/hibaidu" allows robot access to/hibaidu.htm,/hibaiducom.html,/hibaidu/com.html. All URLs for a site are allowed by default, so allow is typically used in conjunction with disallow, enabling the ability to access a subset of Web pages while also prohibiting access to all other URLs.
It is important to note that the order of disallow and allow lines is meaningful, and robot will determine whether to access a URL based on the first successful allow or disallow row of a match.
Use "*" and "$": Baiduspider supports using wildcard characters "*" and "$" to blur matching URLs. "$" matches the line terminator. "*" matches 0 or more arbitrary characters.
robots.txt File Usage Examples:
1. Allow all robot to access
User-agent: * Allow:/or user-agent: * Disallow:
2. Prohibit all search engines from accessing any part of the site
User-agent: *
Disallow:/
3. Prohibit Baiduspider access to your website only
User-agent:baiduspider
Disallow:/
4. Allow Baiduspider to access your website only
User-agent:baiduspider
Disallow:
5. Prevent spiders from accessing specific directories
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/
6. Allow access to some URLs in a specific directory
User-agent: *
Allow:/cgi-bin/see
Allow:/tmp/hi
Allow:/~joe/look
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/
7. Use "*" to restrict access to URLs
Disable access to all URLs (including subdirectories) in the/cgi-bin/directory that are suffixed with ". htm".
User-agent: *
Disallow:/cgi-bin/*.htm
8. Restrict access to URLs with "$"
Only URLs with the suffix ". htm" are allowed to be accessed.
User-agent: *
Allow:. htm$
Disallow:/
Example 9. Disable access to all dynamic pages in the site
User-agent: *
Disallow:/*?*
10. Prohibit Baiduspider crawl all pictures on the website
Only crawl pages are allowed and no images are captured.
User-agent:baiduspider
Disallow:. jpg$
Disallow:. jpeg$
Disallow:. gif$
Disallow:. png$
Disallow:. bmp$
11. Only allow Baiduspider crawl Web pages and. gif format pictures
Allows crawling of web pages and GIF images, and does not allow other format images to be crawled
User-agent:baiduspider
Allow:. gif$
Disallow:. jpg$
Disallow:. jpeg$
Disallow:. png$
Disallow:. bmp$
12. Only prohibit baiduspider crawl. jpg format pictures
User-agent:baiduspider
Disallow:. jpg$
Note: The robots.txt is case-sensitive, the default file name is all lowercase, the rules should pay attention to the case, such as prohibiting http://lbicc.com/Abc.html, in the rules written abc.html words that will not effect, will only prohibit abc.html,abc.html or effective.
Detailed wording of the robots.txt