Robots.txt and Robots meta tags

Source: Internet
Author: User
Keywords Search engines tags robots.txt meta tags

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Author: Pingwen




We know that search engines have their own "search bots", and they build their own databases by constantly crawling information on the web along the pages of links (typically HTTP and src links). For web site managers and content providers, sometimes there will be some site content, do not want to be crawled by robots and open. To solve this problem, the robots development community offers two options: one is robots.txt and the other is the robots meta tag.




I. robots.txt




1. What is robots.txt?




Robots.txt is a plain text file, by declaring in this file that the site does not want to be accessed by robots, so that some or all of the content of the site may not be included in the search engine, or the designated search engine only contains the specified content.




When a search robot visits a site, it will first check the site root directory exists robots.txt, if found, the search robot will follow the contents of the file to determine the scope of access, if the file does not exist, then the search robot to crawl along the link.




Robots.txt must be placed at the root of a site, and the filename must be all lowercase.




Web site URL
corresponding robots.txt URL







http://www.w3.org/


Http://www.w3.org/robots.txt


http://www.w3.org:80/


Http://www.w3.org:80/robots.txt


http://www.w3.org:1234/


Http://www.w3.org:1234/robots.txt


http://w3.org/


Http://w3.org/robots.txt


2, robots.txt Grammar




The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and each record is formatted as follows:




"<field>:<optionalspace><value><optionalspace>".




You can use # to annotate in this file, using the same method as in Unix. The records in this file usually start with one or more lines of user, followed by a number of disallow lines, as detailed below:




User:




The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user records stating that a plurality of robot will be restricted by the protocol, there must be at least one user record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".




Disallow:




The value of the item is used to describe a URL that you do not want to be visited, which can be a complete path or partial, and any URL that starts with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, and "Disallow:/help/" allows robot to access/help.html, not/help/ Index.html.




Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, the site is open to all search engine robot.




Here are some robots.txt basic uses:




• Prohibit all search engines from accessing any part of the site:
User: *
Disallow:/




L Allow all robot access
User: *
Disallow:
Or you can build an empty file "/robots.txt" files




• Prohibit all search engines from accessing several parts of the site (CGI, TMP, Private directory in the following example)
User: *
Disallow: CGI
Disallow:/tmp/
Disallow:/private/




• Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Disallow:/




L only allow access to a certain search engine (WebCrawler in the following example)
User-agent:webcrawler
Disallow:




User: *
Disallow:/




3, the common search engine robot robots name




Name Search engine







Baiduspider http://www.baidu.com


Scooter http://www.altavista.com


Ia_archiver http://www.alexa.com


Googlebot http://www.google.com


Fast-webcrawler http://www.alltheweb.com


slurp http://www.inktomi.com


MSNBOT http://search.msn.com


4, robots.txt examples




Here are the robots.txt of some famous sites:







Http://www.cnn.com/robots.txt


Http://www.google.com/robots.txt


Http://www.ibm.com/robots.txt


Http://www.sun.com/robots.txt


Http://www.eachnet.com/robots.txt


5. Common robots.txt Errors




L REVERSED the order:
Error written
User: *
Disallow:googlebot




The right thing to do is:
User-agent:googlebot
Disallow: *




L put multiple forbidden commands on one line:
For example, wrongly written
Disallow:/css/CGI/images/




The right thing to do is
Disallow:/css/
Disallow: CGI
Disallow:/images/




There are lots of spaces before L line.
For example, write
Disallow: CGI
Although this is not mentioned in the standard, this approach can easily be problematic.




L 404 Redirect to another page:
When robot accesses many sites that do not have a robots.txt file set, it is automatically redirected to another HTML page by 404. Robot often handles the HTML page file in the same way as robots.txt files. Although this is generally not a problem, but it is best to put a blank robots.txt file in the site root directory.




L Use uppercase. For example
User-agent:excite
DISALLOW:
Although the standard is not case-sensitive, the directory and file name should be lowercase:
User-agent:googlebot
Disallow:




L only disallow in grammar, no allow!
The wrong wording is:
User-agent:baiduspider
Disallow:/john/
Allow:/jane/




L FORGOT the slash/
Wrong writing:
User-agent:baiduspider
Disallow:css




The right thing to do is
User-agent:baiduspider
Disallow:/css/




The following gadget specifically checks the validity of the robots.txt file:




http://www.searchengineworld.com/cgi-bin/robotcheck.cgi




Ii. Robots META Tags




1. What is a robots meta tag




Robots.txt files are mainly restricted to the entire site or directory of search engine access, and the robots meta tags are mainly for a specific page. Like the other meta tags (such as the language used, the description of the page, the keywords, and so on), the robots meta tags are placed in the <head></head> of the page, specifically to tell the search engine how to crawl the content of the page. The exact form is similar (see boldface section):




<html>




<head>




<title> Times Marketing--Network Marketing professional portal </title>




<meta name= "Robots" content= "Index,follow" >




<meta http-equiv= "Content-type" content= "HTML; charset=gb2312 ">




<meta name= "keywords" content= marketing ... ">




<meta name= "description" content= "ERA marketing Network is ..." >




<link rel= "stylesheet" href= "/public/css.css" type= "Text/css" >




</head>




<body>




...




</body>




</html>




2, the wording of the META tag:




There is no case in the robots meta tag, name= "Robots" means that all search engines can be written as name= "Baiduspider" for a specific search engine. The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".




The INDEX instruction tells the search robot to crawl the page;




The FOLLOW instruction indicates that the search robot can continue to crawl along the link on the page;




The default value for the Robots Meta tag is index and follow, except for Inktomi, which defaults to Index,nofollow.




In this way, there are four kinds of combinations:




<meta name= "ROBOTS" content= "Index,follow" >




<meta name= "ROBOTS" content= "Noindex,follow" >




<meta name= "ROBOTS" content= "Index,nofollow" >




<meta name= "ROBOTS" content= "Noindex,nofollow" >




which




<meta name= "ROBOTS" content= "Index,follow" > can be written




<meta name= "ROBOTS" content= "All" >;




<meta name= "ROBOTS" content= "Noindex,nofollow" > can be written




<meta name= "ROBOTS" content= "NONE" >




Note that the above robots.txt and Android meta tags limit search engine bots to crawl site content is only a rule, need to cooperate with search engine robots, not every robot to follow.




It seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the robot meta tags, currently not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example:




<meta name= "Googlebot" content= "index,follow,noarchive" >




Represents crawling a page in the site and crawling along the page, but not keeping a snapshot of the page on Google




Web Robot Design Roadmap for Web sites


the large role of small meta in HTML document


Robots.txt Guide

Use of
Robots Meta tag

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.