Introduction to robots.txt
Example: http://www.baidu.com/robots.txt
Robots.txt is a plain text file in which the website administrator can declare that the website does not want to be accessed by robots, or specify a search engine to include only specified content.
When a search robot (called a search spider) crawls a site, it first checks that the site root directory contains robots.txt. If so, the search robot determines the access range based on the content in the file. If the file does not exist, the search robot crawls the link.
In addition, robots.txt must be placed in the root directory of a site, and all file names must be in lowercase.
Robots.txt writing syntax
First, let's look at a robots.txt example: http://www.seovip.cn/robots.txt
The specific content of robots.txt is as follows:
# Robots.txt file from http://www.seovip.cn
# All robots will spider the domain
User-Agent :*
Disallow:
The above text indicates that all search robots are allowed to access all files under www.seovip.cn.
Specific syntax analysis: The # text is the description information, the User-Agent is the name of the search robot, and the * text is the name of all search robots. disallow: the following is the file directory that cannot be accessed.
Next, let me list the specific usage of robots.txt:
Allow access by all robots
User-Agent :*
Disallow:
Alternatively, you can create an empty file "/robots.txt" file.
Prohibit all search engines from accessing any part of the website
User-Agent :*
Disallow :/
Prohibit all search engines from accessing the website (in the following example, the 01, 02, and 03 Directories)
User-Agent :*
Disallow:/01/
Disallow:/02/
Disallow:/03/
Prohibit Access to a search engine (badbot in the following example)
User-Agent: badbot
Disallow :/
Only access to a search engine is allowed (The crawler in the following example)
User-Agent: Crawler
Disallow:
User-Agent :*
Disallow :/
In addition, I think it is necessary to expand the description to introduce robots meta:
The robots meta tag mainly targets specific pages. Like other meta tags (such as the language used, page description, and keywords), robots meta tags are also placed in the
Syntax of the robots meta tag:
The robots meta tag is case-insensitive. Name = "Robots" indicates all search engines. You can enter name = "baiduspider" for a specific search engine ". The content part has four Command Options: Index, noindex, follow, and nofollow. commands are separated by commas.
The Index Command tells the search robot to capture the page;
The follow command indicates that the search robot can continue crawling along the link on the page;
The default values of the robots meta tag are index and follow, except Inktomi. The default values are index and nofollow.
In this way, there are four combinations:
<Meta name = "Robots" content = "index, follow">
<Meta name = "Robots" content = "noindex, follow">
<Meta name = "Robots" content = "index, nofollow">
<Meta name = "Robots" content = "noindex, nofollow">
Where
<Meta name = "Robots" content = "index, follow"> <meta name = "Robots" content = "all">;
<Meta name = "Robots" content = "noindex, nofollow"> <meta name = "Robots" content = "NONE">
Currently, a huge number of search engine robots comply with the robots.txt rules. Currently, the robots meta tag does not support much, but is gradually increasing. For example, Google, a famous search engine, is fully supported, in addition, Google also adds the command "ARCHIVE" to limit whether Google retains web snapshots. For example:
<Meta name = "googlebot" content = "index, follow, noarchive">
It indicates that the page in the site is crawled along the link on the page, but the page snapshot is not retained on goolge.
# Html/XHTML/XML Column