Where can I write robots.txt?

Source: Internet
Author: User

Introduction to robots.txt

Example: http://www.baidu.com/robots.txt
Robots.txt is a plain text file in which the website administrator can declare that the website does not want to be accessed by robots, or specify a search engine to include only specified content.

When a search robot (called a search spider) crawls a site, it first checks that the site root directory contains robots.txt. If so, the search robot determines the access range based on the content in the file. If the file does not exist, the search robot crawls the link.

In addition, robots.txt must be placed in the root directory of a site, and all file names must be in lowercase.

Robots.txt writing syntax

First, let's look at a robots.txt example: http://www.seovip.cn/robots.txt

The specific content of robots.txt is as follows:

# Robots.txt file from http://www.seovip.cn
# All robots will spider the domain

User-Agent :*
Disallow:

The above text indicates that all search robots are allowed to access all files under www.seovip.cn.

Specific syntax analysis: The # text is the description information, the User-Agent is the name of the search robot, and the * text is the name of all search robots. disallow: the following is the file directory that cannot be accessed.

Next, let me list the specific usage of robots.txt:

Allow access by all robots

User-Agent :*
Disallow:

Alternatively, you can create an empty file "/robots.txt" file.

Prohibit all search engines from accessing any part of the website

User-Agent :*
Disallow :/

Prohibit all search engines from accessing the website (in the following example, the 01, 02, and 03 Directories)

User-Agent :*
Disallow:/01/
Disallow:/02/
Disallow:/03/

Prohibit Access to a search engine (badbot in the following example)

User-Agent: badbot
Disallow :/

Only access to a search engine is allowed (The crawler in the following example)

User-Agent: Crawler
Disallow:

User-Agent :*
Disallow :/

In addition, I think it is necessary to expand the description to introduce robots meta:

The robots meta tag mainly targets specific pages. Like other meta tags (such as the language used, page description, and keywords), robots meta tags are also placed in the

Syntax of the robots meta tag:

The robots meta tag is case-insensitive. Name = "Robots" indicates all search engines. You can enter name = "baiduspider" for a specific search engine ". The content part has four Command Options: Index, noindex, follow, and nofollow. commands are separated by commas.

The Index Command tells the search robot to capture the page;

The follow command indicates that the search robot can continue crawling along the link on the page;

The default values of the robots meta tag are index and follow, except Inktomi. The default values are index and nofollow.

In this way, there are four combinations:

<Meta name = "Robots" content = "index, follow">
<Meta name = "Robots" content = "noindex, follow">
<Meta name = "Robots" content = "index, nofollow">
<Meta name = "Robots" content = "noindex, nofollow">

Where

<Meta name = "Robots" content = "index, follow"> <meta name = "Robots" content = "all">;

<Meta name = "Robots" content = "noindex, nofollow"> <meta name = "Robots" content = "NONE">

Currently, a huge number of search engine robots comply with the robots.txt rules. Currently, the robots meta tag does not support much, but is gradually increasing. For example, Google, a famous search engine, is fully supported, in addition, Google also adds the command "ARCHIVE" to limit whether Google retains web snapshots. For example:

<Meta name = "googlebot" content = "index, follow, noarchive">

It indicates that the page in the site is crawled along the link on the page, but the page snapshot is not retained on goolge.

# Html/XHTML/XML Column

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.