Search engine spiders and robots detailed

Source: Internet
Author: User
Keywords Search engine Robots.txt

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

for Web site managers and content providers, sometimes there will be some site content, do not want to be crawled by robots and open. To solve this problem, the robots development community offers two options: one is robots.txt and the other is Therobotsmeta label.








robots.txt





1, what is robots.txt?








Robots.txt is a plain text file, by declaring in this file that the site does not want to be accessed by robots, so that some or all of the content of the site may not be included in the search engine, or the designated search engine only contains the specified content.








when a search robot accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will follow the contents of the file to determine the scope of the access, if the file does not exist, then the search robot will crawl along the link.








robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.





Web site URL


the corresponding robots.txt URL





http://www.w3.org/


Http://www.w3.org/robots.txt





http://www.w3.org:80/


Http://www.w3.org:80/robots.txt





http://www.w3.org:1234/


Http://www.w3.org:1234/robots.txt





http://w3.org/


Http://w3.org/robots.txt





2, robots.txt grammar







The
"robots.txt" file contains one or more records that are separated by a blank line (CR,CR/NL,ORNL as The Terminator), and each record is formatted as follows:





    ":"。








can be annotated using # in this file, using the same methods as the Conventions in UNIX. The records in this file usually start with one or more lines user, followed by several disallow lines, as detailed below:








User:








the value of this entry is used to describe the name of the search engine robot, in the "robots.txt" file, if there are multiple user records stating that more than one robot will be restricted by the protocol, there must be at least one user record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".








Disallow:








the value of this entry is used to describe a URL that you do not want to be visited, which can be either a complete path or a partial one, and any URLs that begin with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, while "disallow:/help/" allows robot to access/help.html, not access/help/ Index.html.





Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, the site is open to all search engine robot.








below are some robots.txt basic usage:





L prohibits all search engines from accessing any part of the site:


user-agent:*


disallow:/





L allows all robot access to


user-agent:*


Disallow:


or you can also build an empty file "/robots.txt" Files





L prohibits all search engines from accessing several parts of the site (CGI, TMP, Private directory in the following example)


user-agent:*


disallow:/cgi-bin/


disallow:/tmp/


disallow:/private/





L prohibits access to a search engine (Badbot in the following example)


User-agent:badbot


disallow:/





l only allow access to a certain search engine (WebCrawler in the following example)


User-agent:webcrawler


Disallow:





user-agent:*


disallow:/








3, common search engine robot name








Name search engine





baiduspiderhttp://www.baidu.com





scooterhttp://www.altavista.com





ia_archiverhttp://www.alexa.com





googlebothttp://www.google.com





Inktomi slurp http://www.yahoo.com





Fast-webcrawler http://www.alltheweb.com





slurphttp://www.inktomi.com





MSNBOT http://search.msn.com








4, robots.txt examples





below are some of the famous sites of Robots.txt:





Http://www.cnn.com/robots.txt





Http://www.google.com/robots.txt





Http://www.ibm.com/robots.txt





Http://www.sun.com/robots.txt





Http://www.eachnet.com/robots.txt








5, Common robots.txt error








l Reversed Order:


Error Written


user-agent:*


Disallow:googlebot





right should be:


User-agent:googlebot


disallow:*





L put multiple forbidden commands on one line:


For example, wrongly written


disallow:/css//cgi-bin//images/





the right thing to do is
.

disallow:/css/


disallow:/cgi-bin/


disallow:/images/




There are a lot of spaces before
L line


for example written


disallow:/cgi-bin/


Although it is not mentioned in the standard, this way is very easy to go wrong.





l404 Redirect to another page:


is automatically 404 redirected to another HTML page when robot accesses many sites that do not have robots.txt files set. Robot often handles the HTML page file in the same way as robots.txt files. Although this is generally not a problem, but it is best to put a blank robots.txt file in the site root directory.





L uses uppercase. such as


User-agent:excite


DISALLOW:


Although the standard is not case-sensitive, the directory and file name should be lowercase:


User-agent:googlebot


Disallow:





L Grammar only disallow, no allow!


the wrong writing is:


User-agent:baiduspider


disallow:/john/


allow:/jane/





L FORGOT the slash/


Wrong writing:


User-agent:baiduspider


Disallow:css





the right thing to do is
.

User-agent:baiduspider


disallow:/css/





The following gadget specifically checks the validity of robots.txt files:





http://www.searchengineworld.com/cgi-bin/robotcheck.cgi





II, Robotsmeta label








1, what is Robotsmeta label








Robots.txt file is mainly restricted to the entire site or directory search engine access, and Robotsmeta tag is mainly for a specific page. As with the other META tags (such as the language used, the description of the page, the keywords, and so on), the Robotsmeta tag is also placed in the page, specifically to tell the search engine how to crawl the content of the page. The exact form is similar (see bold part):








2, Robotsmeta label writing:








Robotsmeta tag is not the case, name= "Robots" means that all search engines can be written for a specific search engine name= "Baiduspider." The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".





The index instruction tells the search robot to crawl the page;




The
follow instruction indicates that the search robot can continue to crawl along the link on the page;




The default value for the
Robotsmeta label is index and follow, except for Inktomi, for which the default is Index,nofollow.








in this way, there are four kinds of combinations:








which





can be written as











can be written as











need to note that the above robots.txt and Robotsmeta tags limit the search engine robot (robots) crawl site content is only a rule, need to cooperate with search engine robots, not every robot to follow.








at present, most of the search engine robot to comply with the rules of robots.txt, and for the Robotsmeta tag, there is not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example:











means to crawl the page in the site and crawl along the page, but not on the Google to keep a snapshot of the page.





Example:


#robots, Scram





user-agent:*


Disallow:/cgi-bin


disallow:/transcripts


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/java


Disallow:/shockwave


Disallow:/jobs


DISALLOW:/PR


disallow:/interactive


disallow:/alt_index.html


Disallow:/webmaster_logs


Disallow:/newscenter


disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search





user-agent:mozilla/3.01 (hotwired-test/0.1)


Disallow:/cgi-bin


disallow:/transcripts


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/java


Disallow:/shockwave


Disallow:/jobs


DISALLOW:/PR


disallow:/interactive


disallow:/alt_index.html


Disallow:/webmaster_logs


Disallow:/newscenter


Disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search





user-agent:slurp


Disallow:/cgi-bin


disallow:/transcripts


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/java


Disallow:/shockwave


Disallow:/jobs


DISALLOW:/PR


disallow:/interactive


disallow:/alt_index.html


Disallow:/webmaster_logs


Disallow:/newscenter


disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search


User-agent:scooter


Disallow:/cgi-bin


disallow:/transcripts


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/java


Disallow:/shockwave


Disallow:/jobs


DISALLOW:/PR


disallow:/interactive


disallow:/alt_index.html


Disallow:/webmaster_logs


Disallow:/newscenter


disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search





User-agent:ultraseek


Disallow:/cgi-bin


#Disallow:/transcripts


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/java


Disallow:/shockwave


Disallow:/jobs


DISALLOW:/PR


disallow:/interactive


disallow:/alt_index.html


Disallow:/webmaster_logs


Disallow:/newscenter


disallow:/virtual


disallow:/digest


Disallow:/quicknews


Disallow:/search








User-agent:smallbear


Disallow:/cgi-bin


Disallow:/java


disallow:/images


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/webmaster_logs


disallow:/virtual


Disallow:/shockwave


disallow:/transcripts


Disallow:/newscenter


disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search


disallow:/alt_index.html


User-agent:googlebot


Disallow:/cgi-bin


Disallow:/java


disallow:/images


disallow:/development


Disallow:/third


Disallow:/beta


Disallow:/webmaster_logs


disallow:/virtual


Disallow:/shockwave


disallow:/transcripts


Disallow:/newscenter


disallow:/virtual


disallow:/digest


disallow:/quicknews


Disallow:/search


disallow:/alt_index.html





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.