Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Web site managers seem to have not caused much attention to robots.txt, at the request of some friends, today want to pass this article to simply talk about
About Robots.txt's writing.
Robots.txt Basic Introduction
Robots.txt is a plain text file in which site managers can declare parts of the site that they do not want to be accessed by robots,
or the specified search engine only contains the specified content.
When a search robot (some called a search spider) accesses a site, it first checks to see if the site root directory exists
Robots.txt, if present, the search robot will follow the contents of the file to determine the scope of the access;
The search robot then crawls along the link.
In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.
Robots.txt Writing Grammar
First, let's look at a robots.txt example: Http://www.beidou365.cn/robots.txt
Visit the above specific address, we can see the specific contents of robots.txt are as follows:
# Robots.txt file from http://www.beidou365.cn
# All robots would spider the domain
User: *
Disallow:
The above text is meant to allow all search bots to access all the files under the www.beidou365.cn site.
Specific syntax analysis: where # behind the text for descriptive information; User: The name of the search robot, followed by a *, then pan
Refers to all search bots; Disallow: The following is a directory of files that are not allowed to be accessed.
Below, I'll enumerate some specific uses of robots.txt:
Allow all robot access
User: *
Disallow:
Or you can build an empty file "/robots.txt" files
Prohibit all search engines from accessing any part of the site
User: *
Disallow:/
Prohibit all search engines from accessing several parts of the site (01, 02, 03 directories in the following example)
User: *
Disallow:/01/
Disallow:/02/
Disallow:/03/
Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Disallow:/
Allow only one search engine access (crawler in the following example)
User-agent:crawler
Disallow:
User: *
Disallow:/
In addition, I think it is necessary to expand the description of the Robots meta Introduction:
The Robots meta tags are focused on specific pages. and other meta tags (such as the language used, description of the page, off
Key words, and so on), the robots meta tags are also placed in the page, which is designed to tell search engine robots
How to crawl the contents of the page.
The wording of the Robots meta Tags:
There is no case in the name= meta tag, and "Robots" means all search engines, which can be targeted at a specific search
The engine is written as name= "Baiduspider". The Content section has four instruction options: index, NOINDEX, follow, nofollow,
The instructions are separated by ",".
The INDEX instruction tells the search robot to crawl the page;
The FOLLOW instruction indicates that the search robot can continue to crawl along the link on the page;
The default value for the Robots Meta tag is index and follow, except for Inktomi, which defaults to Index,nofollow.
In this way, there are four kinds of combinations:
which
All ">
Content= "NONE" >
It seems that the vast majority of search engine robots follow the robots.txt rules, and for the Android META tag, the current support
Not much, but it's growing, as the famous search engine Google fully supports, and Google has added a directive
Archive, you can limit whether Google retains a snapshot of a Web page. For example: