Robots.txt and Robots meta tags

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Search engines tags robots.txt meta tags

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Author: Pingwen

We know that search engines have their own "search bots", and they build their own databases by constantly crawling information on the web along the pages of links (typically HTTP and src links). For web site managers and content providers, sometimes there will be some site content, do not want to be crawled by robots and open. To solve this problem, the robots development community offers two options: one is robots.txt and the other is the robots meta tag.

I. robots.txt

1. What is robots.txt?

Robots.txt is a plain text file, by declaring in this file that the site does not want to be accessed by robots, so that some or all of the content of the site may not be included in the search engine, or the designated search engine only contains the specified content.

When a search robot visits a site, it will first check the site root directory exists robots.txt, if found, the search robot will follow the contents of the file to determine the scope of access, if the file does not exist, then the search robot to crawl along the link.

Robots.txt must be placed at the root of a site, and the filename must be all lowercase.

Web site URL
corresponding robots.txt URL

http://www.w3.org/

Http://www.w3.org/robots.txt

http://www.w3.org:80/

Http://www.w3.org:80/robots.txt

http://www.w3.org:1234/

Http://www.w3.org:1234/robots.txt

http://w3.org/

Http://w3.org/robots.txt

2, robots.txt Grammar

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and each record is formatted as follows:

"<field>:<optionalspace><value><optionalspace>".

You can use # to annotate in this file, using the same method as in Unix. The records in this file usually start with one or more lines of user, followed by a number of disallow lines, as detailed below:

User:

The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user records stating that a plurality of robot will be restricted by the protocol, there must be at least one user record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".

Disallow:

The value of the item is used to describe a URL that you do not want to be visited, which can be a complete path or partial, and any URL that starts with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, and "Disallow:/help/" allows robot to access/help.html, not/help/ Index.html.

Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, the site is open to all search engine robot.

Here are some robots.txt basic uses:

• Prohibit all search engines from accessing any part of the site:
User: *
Disallow:/

L Allow all robot access
User: *
Disallow:
Or you can build an empty file "/robots.txt" files

• Prohibit all search engines from accessing several parts of the site (CGI, TMP, Private directory in the following example)
User: *
Disallow: CGI
Disallow:/tmp/
Disallow:/private/

• Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Disallow:/

L only allow access to a certain search engine (WebCrawler in the following example)
User-agent:webcrawler
Disallow:

User: *
Disallow:/

3, the common search engine robot robots name

Name Search engine

Baiduspider http://www.baidu.com

Scooter http://www.altavista.com

Ia_archiver http://www.alexa.com

Googlebot http://www.google.com

Fast-webcrawler http://www.alltheweb.com

slurp http://www.inktomi.com

MSNBOT http://search.msn.com

4, robots.txt examples

Here are the robots.txt of some famous sites:

Http://www.cnn.com/robots.txt

Http://www.google.com/robots.txt

Http://www.ibm.com/robots.txt

Http://www.sun.com/robots.txt

Http://www.eachnet.com/robots.txt

5. Common robots.txt Errors

L REVERSED the order:
Error written
User: *
Disallow:googlebot

The right thing to do is:
User-agent:googlebot
Disallow: *

L put multiple forbidden commands on one line:
For example, wrongly written
Disallow:/css/CGI/images/

The right thing to do is
Disallow:/css/
Disallow: CGI
Disallow:/images/

There are lots of spaces before L line.
For example, write
Disallow: CGI
Although this is not mentioned in the standard, this approach can easily be problematic.

L 404 Redirect to another page:
When robot accesses many sites that do not have a robots.txt file set, it is automatically redirected to another HTML page by 404. Robot often handles the HTML page file in the same way as robots.txt files. Although this is generally not a problem, but it is best to put a blank robots.txt file in the site root directory.

L Use uppercase. For example
User-agent:excite
DISALLOW:
Although the standard is not case-sensitive, the directory and file name should be lowercase:
User-agent:googlebot
Disallow:

L only disallow in grammar, no allow!
The wrong wording is:
User-agent:baiduspider
Disallow:/john/
Allow:/jane/

L FORGOT the slash/
Wrong writing:
User-agent:baiduspider
Disallow:css

The right thing to do is
User-agent:baiduspider
Disallow:/css/

The following gadget specifically checks the validity of the robots.txt file:

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

Ii. Robots META Tags

1. What is a robots meta tag

Robots.txt files are mainly restricted to the entire site or directory of search engine access, and the robots meta tags are mainly for a specific page. Like the other meta tags (such as the language used, the description of the page, the keywords, and so on), the robots meta tags are placed in the <head></head> of the page, specifically to tell the search engine how to crawl the content of the page. The exact form is similar (see boldface section):

<html>

<head>

<title> Times Marketing--Network Marketing professional portal </title>

</head>

<body>

...

</body>

</html>

2, the wording of the META tag:

There is no case in the robots meta tag, name= "Robots" means that all search engines can be written as name= "Baiduspider" for a specific search engine. The Content section has four instruction options: index, NOINDEX, follow, nofollow, and the instruction is separated by ",".

The INDEX instruction tells the search robot to crawl the page;

The FOLLOW instruction indicates that the search robot can continue to crawl along the link on the page;

The default value for the Robots Meta tag is index and follow, except for Inktomi, which defaults to Index,nofollow.

In this way, there are four kinds of combinations:

which

Note that the above robots.txt and Android meta tags limit search engine bots to crawl site content is only a rule, need to cooperate with search engine robots, not every robot to follow.

It seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the robot meta tags, currently not much support, but is gradually increasing, such as the famous search engine Google fully support, and Google also added a directive " Archive ", you can limit whether Google retains a snapshot of the page. For example:

Represents crawling a page in the site and crawling along the page, but not keeping a snapshot of the page on Google

Web Robot Design Roadmap for Web sites

the large role of small meta in HTML document

Robots.txt Guide

Use of
Robots Meta tag

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More