Robots meta tags and robots.txt files

Source: Internet
Author: User
Tags contains empty html page lowercase access root directory
We know that search engines have their own "search bots", and they build their own databases by constantly crawling information on the web along the pages of links (typically HTTP and src links).

For web site managers and content providers, sometimes there will be some site content, do not want to be crawled by robots and open. To solve this problem, the robots development community offers two options: one is robots.txt and the other is the robots meta tag.

First, robots.txt
1, what is robots.txt?

Robots.txt is a plain text file, by declaring in this file that the site does not want to be accessed by robots, so that some or all of the content of the site may not be included in the search engine, or the designated search engine only contains the specified content.

When a search robot visits a site, it will first check the site root directory exists robots.txt, if found, the search robot will follow the contents of the file to determine the scope of access, if the file does not exist, then the search robot to crawl along the link.

Robots.txt must be placed under the root of a site, and the file name must be all lowercase.

Web site URL
The URL of the corresponding robots.txt

http://www.w3.org/
Http://www.w3.org/robots.txt

http://www.w3.org:80/
Http://www.w3.org:80/robots.txt

http://www.w3.org:1234/
Http://www.w3.org:1234/robots.txt

http://w3.org/
Http://w3.org/robots.txt

2, the robots.txt grammar

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and each record is formatted as follows:

"<field>:<optionalspace><value><optionalspace>".


You can use # to annotate in this file, using the same method as in Unix. The records in this file usually start with one or more lines of user-agent, followed by a number of disallow lines, as detailed below:

User-agent:

The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user-agent records stating that a plurality of robot will be restricted by the protocol, there must be at least one user-agent record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".

Disallow:

The value of the item is used to describe a URL that you do not want to be visited, which can be a complete path or a partial one, and any URL beginning with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, while "Disallow:/help/" allows robot to access/help.html, not access/help/ Index.html.

Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, then for all search engine robot, the site is open.

Here are some robots.txt basic uses:

• Prohibit all search engines from accessing any part of the site:
User-agent: *
Disallow:/

L Allow all robot access
User-agent: *
Disallow:
Or you can build an empty file "/robots.txt" files

• Prohibit all search engines from accessing several parts of the site (Cgi-bin, TMP, Private directory in the following example)
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/

• Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Disallow:/

L only allow access to a certain search engine (WebCrawler in the following example)
User-agent:webcrawler
Disallow:

User-agent: *
Disallow:/

3, the common search engine robot robots name

Name Search engine

Baiduspider http://www.baidu.com

Scooter http://www.altavista.com

Ia_archiver http://www.alexa.com

Googlebot http://www.google.com

Fast-webcrawler http://www.alltheweb.com

Slurp http://www.inktomi.com

MSNBot http://search.msn.com

4, robots.txt examples

Here are the robots.txt of some famous sites:

Http://www.cnn.com/robots.txt

Http://www.google.com/robots.txt

Http://www.ibm.com/robots.txt

Http://www.sun.com/robots.txt

Http://www.eachnet.com/robots.txt

5. Common robots.txt Errors

L REVERSED the order:
Error written
User-agent: *
Disallow:googlebot

The correct one should be:
User-agent:googlebot
Disallow: *

L put multiple forbidden commands on one line:
For example, wrongly written
Disallow:/css//cgi-bin//images/

The right one should be
Disallow:/css/
Disallow:/cgi-bin/
Disallow:/images/

There are lots of spaces before L line.
For example, write
Disallow:/cgi-bin/
Although this is not mentioned in the standard, this approach can easily be problematic.

L 404 Redirect to another page:
When robot accesses many sites that do not have a robots.txt file set, it is automatically redirected to another HTML page by 404. Robot often handles the HTML page file in the same way that it handles robots.txt files. Although this is generally not a problem, but it is best to put a blank robots.txt file in the site root directory.

L Use uppercase. For example
User-agent:excite
Disallow:
Although the standard is not case-sensitive, the directory and file name should be lowercase:
User-agent:googlebot
Disallow:

l have only disallow in grammar, no allow!.
The wrong wording is:
User-agent:baiduspider
Disallow:/john/
Allow:/jane/

L FORGOT the slash/
Wrong to write:
User-agent:baiduspider
Disallow:css

The right one should be
User-agent:baiduspider
Disallow:/css/

The following gadget specifically checks the validity of the robots.txt file:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

second, the Robots meta tags
1. What is a robots meta tag

Robots.txt files are mainly restricted to the entire site or directory of search engine access, and the robots meta tags are mainly for a specific page. Like the other meta tags (such as the language used, the description of the page, the keywords, and so on), the robots meta tags are placed in the
<title> Times Marketing--Network Marketing professional portal </title>
<meta name= "Robots" content= "Index,follow" >
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 ">
<meta name= "keywords" content= marketing ... ">
<meta name= "description" content= "ERA marketing Network is ..." >
<link rel= "stylesheet" href= "/public/css.css" type= "Text/css" >
<body>
...
</body>

2, the wording of the META tag:

There is no case in the robots meta tag, name= "Robots" means that all search engines can be written as name= "Baiduspider" for a specific search engine. The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".

The INDEX instruction tells the search robot to crawl the page;

The FOLLOW instruction indicates that the search robot can continue to crawl along the link on the page;

The default value for the Robots Meta tag is index and follow, except for Inktomi, the default value is Index,nofollow.

In this way, there are a total of four combinations:

<meta name= "ROBOTS" content= "Index,follow" >

<meta name= "ROBOTS" content= "Noindex,follow" >

<meta name= "ROBOTS" content= "Index,nofollow" >

<meta name= "ROBOTS" content= "Noindex,nofollow" >

which

<meta name= "ROBOTS" content= "Index,follow" > can be written

<meta name= "ROBOTS" content= "All" >;

<meta name= "ROBOTS" content= "Noindex,nofollow" > can be written

<meta name= "ROBOTS" content= "NONE" >

It is important to note that the above robots.txt and robotics meta tags limit the search engine robot to crawl the content of the site is only a rule, need to cooperate with search engine robots, not every robot to comply with.

It seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the robot meta tags, currently not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example:

<meta name= "Googlebot" content= "index,follow,noarchive" >

Represents crawling a page in the site and crawling along the page, but not keeping a snapshot of the page on the Goolge.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.