Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
for Web site managers and content providers, sometimes there will be some site content, do not want to be crawled by robots and open. To solve this problem, the robots development community offers two options: one is robots.txt and the other is Therobotsmeta label.
robots.txt
1, what is robots.txt?
Robots.txt is a plain text file, by declaring in this file that the site does not want to be accessed by robots, so that some or all of the content of the site may not be included in the search engine, or the designated search engine only contains the specified content.
when a search robot accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will follow the contents of the file to determine the scope of the access, if the file does not exist, then the search robot will crawl along the link.
robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.
Web site URL
the corresponding robots.txt URL
http://www.w3.org/
Http://www.w3.org/robots.txt
http://www.w3.org:80/
Http://www.w3.org:80/robots.txt
http://www.w3.org:1234/
Http://www.w3.org:1234/robots.txt
http://w3.org/
Http://w3.org/robots.txt
2, robots.txt grammar
The
"robots.txt" file contains one or more records that are separated by a blank line (CR,CR/NL,ORNL as The Terminator), and each record is formatted as follows:
":"。
can be annotated using # in this file, using the same methods as the Conventions in UNIX. The records in this file usually start with one or more lines user, followed by several disallow lines, as detailed below:
User:
the value of this entry is used to describe the name of the search engine robot, in the "robots.txt" file, if there are multiple user records stating that more than one robot will be restricted by the protocol, there must be at least one user record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".
Disallow:
the value of this entry is used to describe a URL that you do not want to be visited, which can be either a complete path or a partial one, and any URLs that begin with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, while "disallow:/help/" allows robot to access/help.html, not access/help/ Index.html.
Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, the site is open to all search engine robot.
below are some robots.txt basic usage:
L prohibits all search engines from accessing any part of the site:
user-agent:*
disallow:/
L allows all robot access to
user-agent:*
Disallow:
or you can also build an empty file "/robots.txt" Files
L prohibits all search engines from accessing several parts of the site (CGI, TMP, Private directory in the following example)
user-agent:*
disallow:/cgi-bin/
disallow:/tmp/
disallow:/private/
L prohibits access to a search engine (Badbot in the following example)
User-agent:badbot
disallow:/
l only allow access to a certain search engine (WebCrawler in the following example)
User-agent:webcrawler
Disallow:
user-agent:*
disallow:/
3, common search engine robot name
Name search engine
baiduspiderhttp://www.baidu.com
scooterhttp://www.altavista.com
ia_archiverhttp://www.alexa.com
googlebothttp://www.google.com
Inktomi slurp http://www.yahoo.com
Fast-webcrawler http://www.alltheweb.com
slurphttp://www.inktomi.com
MSNBOT http://search.msn.com
4, robots.txt examples
below are some of the famous sites of Robots.txt:
Http://www.cnn.com/robots.txt
Http://www.google.com/robots.txt
Http://www.ibm.com/robots.txt
Http://www.sun.com/robots.txt
Http://www.eachnet.com/robots.txt
5, Common robots.txt error
l Reversed Order:
Error Written
user-agent:*
Disallow:googlebot
right should be:
User-agent:googlebot
disallow:*
L put multiple forbidden commands on one line:
For example, wrongly written
disallow:/css//cgi-bin//images/
the right thing to do is
.
disallow:/css/
disallow:/cgi-bin/
disallow:/images/
There are a lot of spaces before
L line
for example written
disallow:/cgi-bin/
Although it is not mentioned in the standard, this way is very easy to go wrong.
l404 Redirect to another page:
is automatically 404 redirected to another HTML page when robot accesses many sites that do not have robots.txt files set. Robot often handles the HTML page file in the same way as robots.txt files. Although this is generally not a problem, but it is best to put a blank robots.txt file in the site root directory.
L uses uppercase. such as
User-agent:excite
DISALLOW:
Although the standard is not case-sensitive, the directory and file name should be lowercase:
User-agent:googlebot
Disallow:
L Grammar only disallow, no allow!
the wrong writing is:
User-agent:baiduspider
disallow:/john/
allow:/jane/
L FORGOT the slash/
Wrong writing:
User-agent:baiduspider
Disallow:css
the right thing to do is
.
User-agent:baiduspider
disallow:/css/
The following gadget specifically checks the validity of robots.txt files:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
II, Robotsmeta label
1, what is Robotsmeta label
Robots.txt file is mainly restricted to the entire site or directory search engine access, and Robotsmeta tag is mainly for a specific page. As with the other META tags (such as the language used, the description of the page, the keywords, and so on), the Robotsmeta tag is also placed in the page, specifically to tell the search engine how to crawl the content of the page. The exact form is similar (see bold part):
2, Robotsmeta label writing:
Robotsmeta tag is not the case, name= "Robots" means that all search engines can be written for a specific search engine name= "Baiduspider." The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".
The index instruction tells the search robot to crawl the page;
The
follow instruction indicates that the search robot can continue to crawl along the link on the page;
The default value for the
Robotsmeta label is index and follow, except for Inktomi, for which the default is Index,nofollow.
in this way, there are four kinds of combinations:
which
can be written as
;
can be written as
need to note that the above robots.txt and Robotsmeta tags limit the search engine robot (robots) crawl site content is only a rule, need to cooperate with search engine robots, not every robot to follow.
at present, most of the search engine robot to comply with the rules of robots.txt, and for the Robotsmeta tag, there is not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example:
means to crawl the page in the site and crawl along the page, but not on the Google to keep a snapshot of the page.
Example:
#robots, Scram
user-agent:*
Disallow:/cgi-bin
disallow:/transcripts
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
DISALLOW:/PR
disallow:/interactive
disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
user-agent:mozilla/3.01 (hotwired-test/0.1)
Disallow:/cgi-bin
disallow:/transcripts
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
DISALLOW:/PR
disallow:/interactive
disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
user-agent:slurp
Disallow:/cgi-bin
disallow:/transcripts
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
DISALLOW:/PR
disallow:/interactive
disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
User-agent:scooter
Disallow:/cgi-bin
disallow:/transcripts
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
DISALLOW:/PR
disallow:/interactive
disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
User-agent:ultraseek
Disallow:/cgi-bin
#Disallow:/transcripts
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
DISALLOW:/PR
disallow:/interactive
disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
disallow:/virtual
disallow:/digest
Disallow:/quicknews
Disallow:/search
User-agent:smallbear
Disallow:/cgi-bin
Disallow:/java
disallow:/images
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/webmaster_logs
disallow:/virtual
Disallow:/shockwave
disallow:/transcripts
Disallow:/newscenter
disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
disallow:/alt_index.html
User-agent:googlebot
Disallow:/cgi-bin
Disallow:/java
disallow:/images
disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/webmaster_logs
disallow:/virtual
Disallow:/shockwave
disallow:/transcripts
Disallow:/newscenter
disallow:/virtual
disallow:/digest
disallow:/quicknews
Disallow:/search
disallow:/alt_index.html