What is robots.txt

Last Update:2014-12-23 Source: Internet

Author: User

Keywords Website optimization SEO

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

This is great when search engines often visit your site and index your content, but often in some cases, when the index section, your online content is not what you want. For example, if you have two versions of a Web page (one for viewing, in a browser and one for printing), you would prefer to have the print version excluded from crawling, otherwise you might be forced to impose repeated content of the penalty. Also, if you happen to have confidential information on your site that you do not want the world to see, you will also like, search engines do not index these pages (although in this case, the only reliable way is not to index sensitive data, is to keep offline on a separate machine). Also, if you want to save some bandwidth on the image, style and JavaScript from the index, you also need to have a way to tell the spider, away from these items.

One way to tell the search engine which files and folders to use on your site is to avoid using robot metatags. But because not all search engines read metatags, machine robot metatags can simply be overlooked. Better way to tell the search engine to you will be to use the robots.txt file.

What is robots.txt?

Robots.txt is a text (instead of HTML) file that you put on your site to tell search bots which pages you want them not to visit. Robots.txt is by no means mandatory search engines, but generally, search engines obey what they don't want to do. It is necessary to clarify that the robots.txt is no way to stop the search engine from crawling your site (ie it is not a firewall, or a password protection) and the fact that you put the robots.txt file is the same thing once again stating: "Please, do not enter", on a unlocked door-for example , you cannot prevent burglars from entering, but the good guys will not openly interrogate and enter. That's why we say that if you really have special education needs for sensitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.

The location of the robots.txt is very important.

It must be in the home directory, because otherwise the user agent (search engine) will not be able to find it-they do not search the entire site for a file named R obots.txt. Instead, they first in the home directory (ie http://mydomain.com/robots.txt), if they do not feel there is, they just assume that the site does not have a robots.txt file, so they index the way to find ways to go. So if you don't put robots.txt in the right place, don't be surprised that the search engine indexes your entire site.

The concept and structure of the robots.txt has been developed more than 10 years ago, if you are interested to learn more about it, please visit http://www.robotstxt.org/or you can go straight to the standard for robot repulsion, Because in this article we will only deal with the most important aspects of robots.txt files. We will continue with the structure robots.txt files in the future.

robots.txt file Structure

A robots.txt structure is fairly simple (barely flexible)-it is a never-ending list of user agents and banned files and directories. Basically, the syntax is as follows:

User:

Disallow:

"User" is the search engine's crawl tool, and is not allowed: listed files and directories are excluded from indexing. In addition, "User:" and "Disallow:": "entries, you can include the comment line-just put the number on the first route:

#所有用户代理是不准看 Temp directory.

User: *

Disallow:/temp/

E Trap robots.txt File

When you start making complex files-that is, you decide to have different user agents contact different directories-The problem can be started if you don't pay special attention to the trap of the R o bots.txt file. Common mistakes include typos and contradicting directives. Common mistakes include typos and conflicting instructions. Typos are misspelled user agents, directories, missing colon, user agent and refutation, typos, etc., may be more difficult to find, but in some cases, validation tools help.

A more serious problem is logic errors. For example:

User: *

Disallow:/temp/

User-agent:googlebot

Disallow:/images/

Disallow:/temp/

Disallow: CGI

The above example is from a robots.txt, let all agents to get everything on the site except/Temp directory. Up to here, it is nice, but later there is another record that stipulates stricter conditions when Googlebot begins to read robots.txt, and it will see all the user agents (including Googlebot themselves), so that all folders except/temp/. It's not enough for Googlebot to know, so it won't read the file except/images/and/cgi-bin/, where you think you tell it not to touch. You see, the structure of the robots.txt file is simple, but still make a serious mistake that can be easily obtained.

Tool to generate and validate robots.txt files

Considering the simple syntax of the robots.txt file, you can also read it anytime, and it sees that if everything is okay, but it is very easy to use a validator, like this: http://tool.motoricerca.info/robots-checker.phtml. These tools report common bugs like missing slash or colon, and if not compromised, find your efforts. For example, if you enter:

User Agent: *

Disallow:/temp/

This is not true because "user" and "agent" and syntax are incorrect.

In this case, when you have a complex robots.txt file-that is, you give different instructions to different user agents or you have a long list of directories and subdirectories to exclude, manual can be a real pain. But you don't have to worry-there are tools that will generate files for you. What's more, there are visual tools to make points and choose which files and folders are excluded. But even if you don't feel like you bought a graphical tool for the robots.txt generation, there are online tools to help you. For example, the server-side robot generator provides a drop-down list of user agents and a text box for you to list files that you do not want to index. Frankly, this is not much help unless you want to set specific rules for different search engines, because in any case it is up to you to type a list of directories but more than nothing.

Author website www.mingrenzhuanji.cn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More