Robots.txt find the background

Last Update:2013-11-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Is the robots.txt file?

The search engine uses a program robot (also called spider) to automatically access webpages on the Internet and obtain webpage information.
You can create a pure robot file robots.txt on your website, which declares that the website does not want to be accessed by the robot. In this way, some or all of the content of the website will not be indexed by the search engine, or the specified search engine only contains the specified content.

# Where is the robots.txt file stored?

The robots.txt file should be placed in the root directory of the website. For example, when robots accesses a website (such as a http://www.abc.com), it first checks whether the website has a range that determines its access permissions.

# Robots.txt File Format

The "robots.txt" file contains one or more records separated by empty rows (with CR, CR/NL, or NL as the terminator). The format of each record is as follows:
"<Field >:< optionalspace> <value> <optionalspace> ".

In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more lines of User-agent, followed by several Disallow lines. The details are as follows:

User-agent:
The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records, it indicates that multiple robots will be restricted by this agreement, for this file, there must be at least one User-agent record. If this parameter is set to *, the Protocol is valid for all machines. In the "robots.txt" file, only one record such as "User-agent: *" can exist.

Disallow:
The value of this item is used to describe a URL that you do not want to access. This URL can be a complete or partial path, any URL starting with Disallow will not be accessed by the robot. For example, "Disallow:/help" does not allow access by search engines to/help.html and/help/index.html, while "Disallow:/help/" allows robot access to/help.html, but cannot access/help/index.html. If any Disallow record is blank, it means that all parts of the website are allowed to be accessed. At least one Disallow record is required in the "/robots.txt" file. If "/robots.txt" is an empty file, the website is open to all search engine robots.

For example, http://www.slenk.net/robots.txtcan be pulled.
If this directory exists, the following directory is displayed:
User-agent :*
Disallow:/admini/
The Disallow statement is followed by a directory that the Administrator does not want us to know.
After we get the directory, we can use the method again and use the tool to scan it again to see the background.
If the postmaster changes the background file name to @ # $ D $ # $, you have to worry about it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More