Robots.txt prohibit search engine to collect the method

Source: Internet
Author: User
Tags format contains empty net access root directory
Search engine one. What is a robots.txt file?

Search engine through a program robot (also known as Spider), automatic access to Web pages on the Internet and get web information.

You can create a plain text file robots.txt in your Web site, in which you declare the part of the site that you do not want to be robot, so that some or all of the content of the site is not included in the search engine, or the specified search engine contains only the specified content.

two. Where is the robots.txt file?

The robots.txt file should be placed in the root directory of the site. For example, when a robots visit a website (for example, http://www.abc.com, you will first check to see if the site exists Http://www.abc.com/robots.txtThis file, if the robot finds the file, it will determine the scope of its access rights based on the contents of the file.

URL of the URL corresponding to the robots.txt Web site
http://www.w3.org/Http://www.w3.org/robots.txt
http://www.w3.org:80/Http://www.w3.org:80/robots.txt
http://www.w3.org:1234/Http://www.w3.org:1234/robots.txt
http://w3.org/Http://w3.org/robots.txt

three. robots.txt file format

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and each record is formatted as follows:

"<field>:<optionalspace><value><optionalspace>".

You can use # to annotate in this file, using the same method as in Unix. The records in this file usually start with one or more lines of user-agent, followed by a number of disallow lines, as detailed below:

User-agent:
The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user-agent records stating that a plurality of robot will be restricted by the protocol, there must be at least one user-agent record for the file. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*".

Disallow:
The value of the item is used to describe a URL that you do not want to be visited, which can be a complete path or a partial one, and any URL beginning with disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access for both/help.html and/help/index.html, while "Disallow:/help/" allows robot to access/help.html, not access/help/ Index.html.
Any disallow record is empty, indicating that all parts of the site are allowed to be accessed, and that in the "/robots.txt" file, there must be at least one disallow record. If "/robots.txt" is an empty file, then for all search engine robot, the site is open.

four. robots.txt File usage examples

Example 1. Prohibit all search engines from accessing any part of the site

Download the robots.txt file user-agent: *
Disallow:/

Example 2. Allow all robot to access

(or you can build an empty file "/robots.txt" files)

User-agent: *
Disallow:

Example 3. Prohibit access to a search engine
User-agent:badbot
Disallow:/

Example 4. Allow access to a search engine User-agent:baiduspider
Disallow:

User-agent: *
Disallow:/

Example 5. A simple example

In this example, the site has three directories that restrict access to search engines, that is, search engines do not access these three directories.
It is important to note that each directory must be declared separately and not written as "Disallow:/cgi-bin//tmp/".
User-agent: The following * has a special meaning that represents "any robot", so there can be no "disallow:/tmp/*" or "Disallow: *.gif" records in the file.
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/

Five. robots.txt document reference

For more specific settings for robots.txt files, refer to the following links:

· Web Server Administrator ' s Guide to the exclusion Protocol
· HTML Author ' s Guide to the exclusion Protocol
· The original 1994 protocol description, as currently deployed
· The revised Internet-draft specification, which is not yet completed or implemented
Design a roadmap for Web robot on your home page

The internet is getting cooler, and WWW's popularity is at its zenith. Publishing corporate information on the internet and conducting E-commerce has evolved from Vogue to fashion. As a Web Master, you may be familiar with HTML, java Script, Java, ActiveX, but do you know what a Web Robot is? Do you know what the Web robot has to do with the homepage you are designing?

Tramps on the internet---Web Robot

Sometimes you will somehow find that the content of your homepage is indexed in a search engine, even if you have never had any contact with them. This is actually the credit of Web robot. Web robot is actually a program that can traverse the hypertext structure of a large number of Internet URLs and recursively retrieve all the content of the Web site. These programs are sometimes called "Spiders (Spider)", "Internet tramps (Web Wanderer)", "Network Worms" (web worms), or web crawler. Some Internet-renowned search engine sites (search engines) have dedicated web robot programs to collect information, such as Lycos,webcrawler,altavista, and Chinese search engine sites such as Polaris, NetEase, Goyoyo and so on.

Web robot is like an uninvited guest, regardless of whether you care, it will be loyal to its owner's responsibility, hard, tireless and tirelessly on the World Wide Web space, of course, will visit your homepage, retrieve the homepage content and generate the record format it needs. Perhaps some of the home page content you enjoy the world knows, but some content you do not want to be insight, index. Can you just let it be "rampant" in your home page space, can command and control the whereabouts of the Web robot? The answer is certainly yes. As long as you read this article, you can like a traffic police, decorate a signpost, tell the Web robot how to retrieve your homepage, which can be retrieved, which can not be accessed.

In fact, Web robot can understand you.

Do not think that the web robot is not organized, without a bundle of running. Many web robot software provides two ways for administrators of Web sites or Web content producers to limit the whereabouts of Web robot:

1. Exclusion Protocol Protocol

The administrator of the Web site can create a specially formatted file on the site to indicate which part of the site can be accessed by robot, which is placed under the root directory of the site, i.e. robots.txt. "Target= _blank" > Http://.../robots.txt.

2, the Robots META tag

A Web page author can use a special HTML META tag to indicate whether a Web page can be indexed, parsed, or linked.

These methods are suitable for most web Robot, as to whether these methods are implemented in the software, but also rely on the Robot developers, is not guaranteed to be effective for any Robot. If you desperately need to protect your content, consider other protection methods such as adding passwords.

Using the Exclusion Protocol protocol

When robot accesses a Web site, such as http://www.sti.net.cn/, it goes first to check the file robots.txt "target=" _blank "> Http://www.sti.net.cn/robots.txt。 If the file exists, it will be parsed according to the record format:

User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/

To determine if it should retrieve the site's files. These records are dedicated to the Web robot, the general visitors will probably never see this file, so don't be whimsical to add in the class HTML statements or "How did you do?" Where are you from? " Fake greeting.

There can be only one "/robots.txt" file on a site, and each letter of the file name requires all lowercase. Each individual "disallow" line in the robot record format represents a URL that you do not want robot to access, each URL must have a separate line, and cannot appear/cgi-bin/such as "Disallow:/tmp/wrong sentences." You cannot also have blank rows in a record because a blank row is a flag for multiple record splits.

The User-agent line indicates the name of the robot or other agent. In the user-agent line, ' * ' denotes a particular meaning---all robot.

Here are a few examples of robot.txt:

Reject all robots on the entire server:
User-agent: *
Disallow:/

Allow all robots to access the entire site:
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.

Partial content of the server allows all robot access
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/

To reject a specific robot:
User-agent:badbot
Disallow:/

Allow only one robot to patronize:
User-agent:webcrawler
Disallow:
User-agent: *
Disallow:/

In the end we give http://www.w3.org/Robots.txt on the site:
# for use by search.w3.org
User-agent:w3crobot/1
Disallow:
User-agent: *
Disallow:/member/# This are restricted to the
Disallow:/member/# This are restricted to the
Disallow:/team/# This are restricted to the consortium only
Disallow:/tands/member # This are restricted to the
Disallow:/tands/team # This are restricted to the consortium only
Disallow:/project
Disallow:/systems
Disallow:/web
Disallow:/team

Using the Robots META tag method

The Robots META tag allows an HTML Web page author to indicate whether a page can be indexed or if it can be used to find more linked files. Only a partial robot has implemented this function at the moment.

The format of the Robots META tag is:
<meta name= "ROBOTS" content= "NOINDEX, NOFOLLOW" >
Like any other meta tag, it should be placed in the head area of the HTML file:
<meta name= "Robots" content= "Noindex,nofollow" >
<meta name= "description" content= "This page ..." >
<title>...</title>
<body>
...

The Robots META tag directives are separated by commas, and the instructions that can be used include [No]index and [No]follow. The index instruction indicates whether an indexed robot can index this page, and FOLLOW instructions indicate whether robot can track the links on this page. The default scenario is index and follow. For example:
<meta name= "Robots" content= "Index,follow" >
<meta name= "Robots" content= "Noindex,follow" >
<meta name= "Robots" content= "Index,nofollow" >
<meta name= "Robots" content= "Noindex,nofollow" >

A good web site administrator should take into account robot management, so that robot for their own home page service, without compromising their own web page security.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.