Robots protocol and forbidden search engine Indexing

Source: Internet
Author: User

Supplement the prohibition of search engines,

Is the robots.txt file?

A search engineProgramRobot (also known as Spider) automatically accesses webpages on the Internet and obtains webpage information.

You can create a pure robot file robots.txt on your website, which declares that the website does not want to be accessed by the robot. In this way, some or all of the content of the website will not be indexed by the search engine, or the specified search engine only contains the specified content.

2. Where can I store the robots.txt file?

The robots.txt file should be placed in the root directory of the website. For example, when robots accesses a website (suchHttp://www.abc.com), First check whether the website existsHttp://www.abc.com/robots.txtIf the robot finds this file, it will determine its access permission range based on the file content.

The URL of robots.txt corresponding to the website URL
Http://www.w3.org/ Http://www.w3.org/robots.txt
Http://www.w3.org: 80/ Http://www.w3.org: 80/robots.txt.
Http://www.w3.org: 1234/ Http://www.w3.org: 1234/robots.txt
Http://w3.org/ Http://w3.org/robots.txt

Iii. robots.txt File Format

The “robots.txt file contains one or more records separated by empty rows (with Cr, CR/NL, or NL as the terminator). The format of each record is as follows:

";:;".

In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more lines of User-Agent, followed by several disallow lines. The details are as follows:

User-Agent:
The value of this item is used by the worker to search for the name of the engine robot. in the "Robot robots.txt" file, if Multiple User-Agent records indicate that multiple robots are restricted by this Protocol, at least one User-Agent record is required for this file. If the value of this item is set to *, the Agreement applies to all robots. In the robot robots.txt file, there can be only one record such as "User-Agent.
Disallow:
The value of this item is used to describe a URL that you do not want to access. This URL can be a complete or partial path, no URLs starting with disallow will be accessed by the robot. For example, "disallow:/help" does not allow access by search engines to/help.html and/help/index.html, while "disallow:/help/" allows robot access to/help.html, but cannot access/help/index.html.
If any disallow record is blank, it means that all parts of the website are allowed to be accessed. At least one disallow record is required in the "/robots.txt" file. If "/robots.txt" is an empty file, the website is open to all search engine robots.

Iv. Examples of robots.txt File Usage

Example 1. prohibit all search engines from accessing any part of the website

Download the robots.txt file User-Agent :*
Disallow :/

Example 2. Allow all robots to access

(Or you can create an empty file "/robots.txt" file)

User-Agent :*
Disallow:

Example 3. Disable access to a search engine
User-Agent: badbot
Disallow :/

Example 4: Allow a search engine to access User-Agent: baiduspider
Disallow:

User-Agent :*
Disallow :/

Example 5: A simple example

In this example, the website has three directories that restrict access to the search engine, that is, the search engine does not access these three directories.
Note that each directory must be declared separately, rather than "disallow:/cgi-bin/tmp /".
User-Agent: * after a special meaning, represents "any robot", so the file cannot contain "disallow:/tmp/*" or "disallow :*. GIF.
User-Agent :*
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow :/~ JOE/

Bytes -------------------------------------------------------------------------------------------------------------------------------

Example of robots.txt File Usage
  1. Example 1. prohibit all search engines from accessing any part of the website
    download the robots.txt file
User-Agent: *
disallow:/
Example 2. allow all robots to access (or you can create an empty file "/robots.txt" file) User-Agent: *
disallow:
example 3. Disable access to a search engine User-Agent: badbot
disallow:/
Example 4. allow access to a search engine User-Agent: baiduspider
disallow:
User-Agent: *
disallow: /
Example 5. A simple example
in this example, the website has three directories that limit the access to the search engine, that is, the search engine does not access these three directories.
note that each directory must be declared separately, rather than "disallow:/cgi-bin/tmp /".
User-Agent: the "*" after "has a special meaning, which indicates" any robot ". Therefore, the file cannot contain" disallow:/tmp/* "or" disallow: *. GIF.
User-Agent: *
disallow:/cgi-bin/
disallow:/tmp/
disallow :/~ JOE/

what is the name of a hundred-degree spider in robots.txt?
"baiduspider" is a lowercase letter.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.