Search engine spider and website robots.txt file [reprint]

Source: Internet
Author: User

We know that search engines have their own "search Robot" (ROBOTS), and through these ROBOTS on the web along the Web links (usually HTTP and src link) constantly crawl data to build their own database.

For site managers and content providers, there are sometimes site content, do not want to be robots crawl and public. To solve this problem, the robots development community offers two options: one is robots.txt and the other is the Therobotsmeta tag.

Note: robots.txt is the correct way to the search engine crawl site is very important, we try to follow the standard format to write the statement, otherwise the error may cause the search engine can not crawl the site properly; The robots.txt Detection tool in the Sitemap checks to see if a robots.txt file exists on the site and that the file is spelled correctly

First, robots.txt

1. What is robots.txt?

Robots.txt is a plain text file, by declaring in this file that the site does not want to be robots access to the section, so that the site part or all of the content can not be indexed by the search engine, or the designated search engine only contains the specified content.

When a search robot accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if found, the search robot will follow the contents of the file to determine the scope of the access, and if the file does not exist, then the search robot will crawl along the link.

robots.txt must be placed at the root of a site, and the file name must be all lowercase

Site URL
URL of the corresponding robots.txt

http://www.w3.org/
Http://www.w3.org/robots.txt

http://www.w3.org:80/
Http://www.w3.org:80/robots.txt

2. Robots.txt Grammar

The "robots.txt" file contains one or more records that are separated by a blank line (with CR,CR/NL,ORNL as The Terminator) and can be annotated with # in the file, using the same methods as in Unix. The records in this file typically start with one or more lines of user-agent, followed by several disallow lines, as detailed below:

user-agent:
The value of the item is used to describe the name of the search engine robot, and in the "robots.txt" file, if there are multiple user-agent records stating that more than one robot is limited by the protocol, there must be at least one user-agent record for the file. If the value of the key is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "user-agent:*". For search engine robots name, please refer to the article "search engine Spider program name Daquan"

Disallow:
The value of the item is used to describe a URL that you do not want to be accessed, which can be a full path or part, and any URL beginning with disallow will not be accessed by robot.For example, "Disallow:/help" does not allow search engine access to/help.html and/help/index.html, while "disallow:/help/" allows robot access/help.html without access to/help /index.html

any one of the disallow records is empty, stating that all parts of the site are allowed to be accessed, and that at least one disallow record must be in the "robots.txt" file. If "robots.txt" is an empty file, then for all search engine robot, the site is open


Here are some basic uses of robots.txt:

All search engines are prohibited from accessing any part of the site:
User-agent: *
Disallow:/

Allow all robot to access
User-agent: *
Disallow:
Or you can build an empty file: robots.txt

Prohibit all search engines from accessing several parts of the site (Cgi-bin, TMP, Private directory in the following example)
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/

Prohibit access to a search engine (Badbot in the following example)
User-agent:badbot
Disallow:/

Allow access to only one search engine (WebCrawler in the following example)
User-agent:webcrawler
Disallow:


3, common search engine robot robots name

Name Search Engine URL

Baiduspider http://www.baidu.com

Scooter http://www.altavista.com

Ia_archiver http://www.alexa.com

Googlebot http://www.google.com

Inktomi slurp http://www.yahoo.com

Fast-webcrawler http://www.alltheweb.com

Slurp http://www.inktomi.com

MSNBot http://search.msn.com


4, robots.txt example

Here are some of the famous sites robots.txt:

Http://www.google.com/robots.txt

Http://www.ibm.com/robots.txt

Http://www.sun.com/robots.txt

Http://www.eachnet.com/robots.txt

Look at the robots.txt:http://www.baidu.com/robots.txt of Baidu

Black Dream SEO Blog robots.txt file: http://www.bloghuman.com/robots.txt


5. Common robots.txt Errors

• Reversed the Order:
Error written
User-agent: *
Disallow:googlebot

The correct one should be:
User-agent:googlebot
Disallow:/

• Put multiple forbidden commands on one line:
For example, it is wrong to write
disallow:/css//cgi-bin//images/

The right one should be
disallow:/css/
disallow:/cgi-bin/
disallow:/images/

• There are plenty of spaces before the line
For example, write
disallow:/cgi-bin/
Although this is not mentioned in the standard, this approach is prone to problems.

• 404 Redirect to another page:
When robot accesses many sites that do not have a robots.txt file set, it is redirected to another HTML page by automatic 404. Robot often handles this HTML page file in a way that handles robots.txt files. Although this is generally not a problem, butIt is best to put a blank robots.txt file under the site root directory

• Use uppercase
User-agent:excite
DISALLOW:
Although the standard is not case-sensitive, directories and filenames should be lowercase:
User-agent:googlebot
Disallow:

• Only disallow in grammar, no allow!
The wrong wording is:
User-agent:baiduspider
Disallow:/john/
Allow:/jane/

• Forget the slash/
The wrong write:
User-agent:baiduspider
Disallow:css

The right one should be
User-agent:baiduspider
Disallow:/css/

The following gadget specifically checks the validity of the robots.txt file:

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi


second, Robots meta tag

1. What is robots meta tag

Robots.txt files mainly restrict the search engine access to the entire site or directory, while the Robots meta tag is mainly for a specific page. and other meta tags (such as the language used, the description of the page, keywords, etc.), Robots meta tags are also placed on the page, specifically to tell the search engine Robots How to crawl the content of the page.


2, the Robots meta tag of the wording:

Robots META tags are not case-sensitive, name= "Robots" means that all search engines can be written for a specific search engine as name= "Baiduspider". The Content section has four instruction options: index, NOINDEX, follow, nofollow, separated by "," between instructions.

The index command tells the search robot to crawl the page;

The follow instruction indicates that the search robot can continue to crawl along the link on the page;

The default value for the Robots Meta tag is index and follow, except for Inktomi, where the default value is index, nofollow.


Note that: The above robots.txt and ROBOTS meta tags restrict search engine robot (ROBOTS) Crawl Site content method is only a rule, need to cooperate with search engine robot, not every ROBOTS obey.


At present, the vast majority of search engine robots adhere to the rules of robots.txt, and for the Robotsmeta tag, the current support is not much, but is gradually increasing, such as the famous search engine Google fully support, and Google has added a directive " Archive, you can limit whether Google retains a snapshot of the page. For example:

Represents crawling a page in the site and crawling along a link in the page, but does not keep a snapshot of the page on Goolge.

Example:
#robots, Scram


user-agent:*
Disallow:/cgi-bin
Disallow:/transcripts
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
Disallow:/pr
Disallow:/interactive
Disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search




user-agent:mozilla/3.01 (hotwired-test/0.1)
Disallow:/cgi-bin
Disallow:/transcripts
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
Disallow:/pr
Disallow:/interactive
Disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search




User-agent:slurp
Disallow:/cgi-bin
Disallow:/transcripts
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
Disallow:/pr
Disallow:/interactive
Disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search
User-agent:scooter
Disallow:/cgi-bin
Disallow:/transcripts
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
Disallow:/pr
Disallow:/interactive
Disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search




User-agent:ultraseek
Disallow:/cgi-bin
#Disallow:/transcripts
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/java
Disallow:/shockwave
Disallow:/jobs
Disallow:/pr
Disallow:/interactive
Disallow:/alt_index.html
Disallow:/webmaster_logs
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search





User-agent:smallbear
Disallow:/cgi-bin
Disallow:/java
Disallow:/images
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/webmaster_logs
Disallow:/virtual
Disallow:/shockwave
Disallow:/transcripts
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search
Disallow:/alt_index.html
User-agent:googlebot
Disallow:/cgi-bin
Disallow:/java
Disallow:/images
Disallow:/development
Disallow:/third
Disallow:/beta
Disallow:/webmaster_logs
Disallow:/virtual
Disallow:/shockwave
Disallow:/transcripts
Disallow:/newscenter
Disallow:/virtual
Disallow:/digest
Disallow:/quicknews
Disallow:/search
Disallow:/alt_index.html

Search engine spider and website robots.txt file [reprint]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.