Methods to prohibit search engine inclusion

Source: Internet
Author: User

A What is a robots.txt file?

The search engine uses a program robot (aka Spider) to automatically access Web pages on the Internet and obtain web information.

You can create a plain text file in your site robots.txt, in this file to declare that the site does not want to be robot access to the section, so that the site part or all of the content can not be included in the search engine, or the designated search engine only included the specified content.

Two. Where do I put the robots.txt file?

The robots.txt file should be placed in the root directory of the Web site. For example, when robots visits a Web site (such as http://www.abc.com), it first checks to see if the Http://www.abc.com/robots.txt file exists in the site, and if the robot finds the file, It will determine the scope of its access rights based on the contents of the file.

Site URL URL of the corresponding robots.txt
http://www.w3.org/ Http://www.w3.org/robots.txt
http://www.w3.org:80/ Http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ Http://www.w3.org:1234/robots.txt
http://w3.org/ Http://w3.org/robots.txt

Three. Format of the robots.txt file

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows:

"<field>:<optionalspace><value><optionalspace>".

You can use # for annotations in this file, using the same methods as in Unix. The records in this file typically start with one or more lines of user-agent, followed by several disallow lines, as detailed below:

User-agent:
The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user-agent records stating that more than one robot is limited by the protocol, there must be at least one user-agent record for the file. If the value of the key is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".

Disallow:
The value of the item is used to describe a URL that you do not want to be accessed, which can be a full path or part, and any URL beginning with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access to/help.html and/help/index.html, and "Disallow:/help/" allows robot access to/help.html without access to/help/ Index.html.
Any one of the disallow records is empty, stating that all parts of the site are allowed to be accessed, and that at least one disallow record must be in the "/robots.txt" file. If "/robots.txt" is an empty file, then for all search engine robot, the site is open.

Four. Examples of robots.txt file usage

Example 1. prohibit all search engines from accessing any part of the site

   
Download the robots.txt file

User-agent: *
Disallow:/

Example 2. allow all robot to access

(or you can build an empty file, "/robots.txt")

User-agent: *
Disallow:

Example 3. Prohibit access to a search engine

User-agent:badbot
Disallow:/

Example 4. allow access to a search engine

User-agent:baiduspider
Disallow:

User-agent: *
Disallow:/

Example 5. a simple example

In this example, the site has three directories that restrict access to search engines, which means that search engines do not access these three directories.
It is important to note that each directory must be declared separately and not written as "Disallow:/cgi-bin//tmp/".
User-agent: After the * has a special meaning, on behalf of "any robot", so in this file can not have "Disallow:/tmp/*" or "Disallow: *.gif" Such a record appears.

User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/

Five. robots.txt File Reference

For more specific settings for the robots.txt file, see the following links:

· Web Server Administrator ' s Guide to the Robots exclusion Protocol
· HTML Author ' s Guide to the Robots exclusion Protocol
· The original 1994 protocol description, as currently deployed
· The revised Internet-draft specification, which is not yet completed or implemented

Methods to prohibit search engine inclusion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.