Search Engine robots.txt Basic writing

Source: Internet
Author: User
Tags require

First of all, we refer to its use method from its definition, it is convenient for the broad seoer to use rotbots.txt more precisely.

First, the definition:

robots.txt is stored in the site root directory of a plain text file, let search spiders read TXT file, filename must be lowercase "robots.txt".

Second, the role:

Through Robots.txt can control search engine content, tell spiders which files and directories can be included, which cannot be included.

Iii. position of robots.txt placement

The robots.txt file should be placed in the root directory of the site. For example, when Spider visits a Web site (such as http://www.abc.com), it first checks to see if there is a Http://www.111cn.net/robots.txt file in the Web site, and if Spider finds this file, It will determine the scope of its access rights based on the contents of the file.

Four, robots.txt format

The file contains one or more records, which are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and each record is formatted as follows: ":". You can use # to annotate in this file. The records in this file usually start with one or more lines of user-agent, followed by several disallow and allow rows, as detailed below.

User-agent:

The value of the item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple user-agent records stating that more than one robot is subject to the "robots.txt" limit, there must be at least one user-agent record for the file. If the value of the item is set to *, it is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*". If you add "User-agent:somebot" and a number of disallow, allow rows in the "robots.txt" file, the name "Somebot" is only followed by the "User-agent:somebot" Restrictions on disallow and allow lines. Disallow: The value of the item is used to describe a set of URLs that are not intended to be accessed, either as a complete path or as a non-unprecedented suffix of the path, and URLs that begin with the value of the disallow entry are not accessed by robot. For example, "Disallow:/help" prohibits robot access to/help*.html,/help/index.html, and "Disallow:/help/" allows robot to access/help*.html and cannot access/help/ Index.html.

The "Disallow:" description allows robot to access all URLs of the site, and at least one disallow record in the "/robots.txt" file. If "/robots.txt" does not exist or is an empty file, then for all search engine robot, the site is open.

Allow:

The value of the item is used to describe a set of URLs that you want to access, similar to the disallow item, which can be either a full path or a prefix to the path, and a URL that starts with the value of the Allow entry is allowed robot access. For example, "Allow:/hibaidu" allows robot access to/hibaidu.htm,/hibaiducom.html, and/hibaidu/com.html. All URLs for a Web site are allow by default, so allow is usually used with disallow to enable access to a subset of the pages while preventing access to all other URLs.

It is important to note that the order of the disallow and allow rows makes sense, robot determines whether to access a URL based on the first allow or disallow row that matches successfully.

Use "*" and "$": Baiduspider supports using wildcard "*" and "$" to blur matching URLs. "$" matches the line terminator. "*" matches 0 or more arbitrary characters.

V. Examples:

1.user-agent: *//Prohibit all search engines searching directory 1, directory 2, directory 3

Disallow:/directory Name 1/

Disallow:/directory Name 2/

Disallow:/directory name 3/

2. User-agent:baiduspider//Prohibit Baidu search secret content under the directory

Disallow:/secret/

3. User-agent: *//prohibits all search engines from searching the CGI directory, but allows slurp to search all

Disallow:/cgi/

User-agent:slurp

Disallow:

4. User-agent: *//Prohibit all search engines from searching haha directory, but allow search haha directory under test directory

Disallow:/haha/

allow:/haha/test/

Examples of file writing

User-agent: * All of the search engine types represented here * are a wildcard

Disallow:/admin/This definition is a directory that is not allowed to crawl under the admin directory

Disallow:/require/This definition is prohibited from crawling the directory below the Require directory

Disallow:/abc/This definition is prohibited from crawling the directory below the ABC directory

Disallow:/cgi-bin/*.htm prevents access to all URLs (including subdirectories) under the ". htm" suffix in the/cgi-bin/directory.

Disallow:/*?* prohibit access to all Web sites containing question marks (?)

Disallow:/.jpg$ prohibit crawling of all. jpg pictures of Web pages

disallow:/ab/adc.html the adc.html file under the AB folder is not allowed to crawl.

Allow:/cgi-bin/This definition is to allow crawling of directories below the Cgi-bin directory

Allow:/tmp Here is defined as the entire directory where TMP is allowed to crawl

Allow:. htm$ only allow access to URLs with ". htm" as the suffix.

Allow:. gif$ allows crawling of web pages and GIF format pictures

Sitemap: Site Map Tell crawler this page is site map PS: This is usually forgotten.
5, the common search engine spider code

# Google Googlebot Google Android

# MSN MSNBot MSN Robot

# 360Spider 360 Spider

# Sogou spider Sogou Spider

# Baiduspider Baidu Spider

# Bingbot Bing Crawler

Summarize:

The above is Rotbots.txt use method, may be asked, I am sure to let search engine all crawl, that rotbots.txt for me, it is not a role? In fact, the setting of Rotbots.txt, for SEO is a certain reason. An example is provided:

1, in the site optimization, often appear many different URLs to similar web pages, which for search engines, will waste their server resources, the site itself, is no good.

2, the website revision, the URL static, will remain many dead links as well as the unfriendly search engine link, we need to put them all to shield off, still must use the rotbots.txt to carry on the setting.

3, many do not have the keyword Settings page, shielding its URL after the SEO effect will be better.

4, many stations will have a full station search, this page is a dynamic page, but also is an indeterminate temporary page, with rotbots.txt screen their pages, the effect of the site optimization will be good.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.