On the writing of web site robots

Source: Internet
Author: User

Stationmaster's work is the design exquisite website, for the populace displays the rich content of the website. Of course, we also want well-designed sites to get the ideal rankings, which requires us to study the rules of search engine rankings, the greatest degree to get the opportunity to show to the customer. However, there are many types of search engines, sometimes, we have a good ranking of a search engine, but on the other search engines do not get the same ranking, because the different search engine rules are different. To do this, someone copied the same content to deal with different search engine ranking rules. However, once the search engine found a large number of "clones" of the page, it will be punished, do not include these duplicate pages. On the other hand, the content of our website belongs to personal private files, not to be exposed to the search engine. At this time, robot.txt is to solve these two problems.

 A, search engine and its corresponding user-agent

So, what are the search engines and their corresponding user-agent? Below, I have listed some for reference.

Search Engine User-agent

AltaVista scooter

Baidu Baiduspider

InfoSeek InfoSeek

HotBot slurp

AOL Search Slurp

Excite Architextspider

Google Googlebot

Goto slurp

Lycos Lycos

MSN slurp

Netscape Googlebot

Northernlight Gulliver

WebCrawler Architextspider

Iwon slurp

Fast Fast

Directhit Grabber

Yahoo Web Pages Googlebot

LookSmart Web Pages slurp

 Ii. Basic concepts of robots

Robots.txt file is a file of the website, it is to search engine spider to see. Search engine Spider Crawl Road Our site is the first to crawl this file, according to the contents of the site to determine the scope of access to the file. It can protect some of our files are not exposed to the search engine, thus effectively control the spider's crawl path, for our webmaster do a good job of SEO to create the necessary conditions. In particular, our site has just been created, some content is not perfect, temporarily do not want to be indexed by search engines.

Robots.txt can also be used in a directory. Search scope settings for files in this directory.

A few notes:

The website must have a robot.txt file.

The filename is a lowercase letter.

When you need to completely block the file, you need to match the meta's robots attribute.

 Iii. the basic grammar of robots.txt

Basic format of the content item: key: Value pair.

1) user-agent Key

The following content corresponds to the name of each specific search engine crawler. Such as Baidu is Baiduspider, Google is Googlebot.

Generally we write this:

User-agent: *

Indicates that all search engine spiders are allowed to crawl and crawl. If you just want a certain search engine spider to crawl, listed in the following name can be. If it is more than one, write it repeatedly.

Note: user-agent: There will be a space behind.

In the robots.txt, the key is followed by a number, followed by a space, and the value phase is separate.

2) Disallow key

This key is used to explain the URL path that does not allow search engine spiders to crawl.

Example: Disallow:/index.php prohibit web index.php files

Allow key

The key describes the URL path that allows search engine spiders to crawl

For example: Allow:/index.php allows the site's index.php

Wildcard character *

Represents any number of characters

For example: Disallow:/*.jpg Web site All jpg files were banned.

End character $

Represents a URL that ends with the preceding character.

For example: Disallow:/?$ Web site All end of the file is prohibited.

  Four, robots.txt case analysis

Example 1. Prohibit all search engines from accessing any part of the site

User-agent: *

Disallow:/

Example 2. Allow all search engines to access any part of the site

User-agent: *

Disallow:

Example 3. Only prohibit Baiduspider access to your Web site

User-agent:baiduspider

Disallow:/

Example 4. Only allow Baiduspider to access your Web site

User-agent:baiduspider

Disallow:

Example 5. Prohibit spider access to specific directories

User-agent: *

Disallow:/cgi-bin/

Disallow:/tmp/

Disallow:/data/

Note: 1 three directories to write separately. 2) Please note that the last to bring the slash. 3 The difference between a slash and a slash.

Example 6. Allow access to partial URLs in a specific directory

I hope that the a directory only b.htm allowed access, how to write?

User-agent: *

Allow:/a/b.htm

Disallow:/a/

Note: The permitted inclusion priority is higher than the prohibition included.

Start with example 7 to illustrate the use of wildcard characters. Wildcard characters include ("$" terminator;

"*" any character)

Example 7. Disable access to all dynamic pages in a Web site

User-agent: *

Disallow:/*?*

Example 8. Prevent search engines from crawling all the pictures on the site

User-agent: *

Disallow:/*.jpg$

Disallow:/*.jpeg$

Disallow:/*.gif$

Disallow:/*.png$

Disallow:/*.bmp$

In many other cases, specific circumstances need to be analyzed. As long as you understand these grammatical rules and the use of wildcard characters, I believe that many situations can be solved.

  V. Meta-robots tags

Meta is the label content in the head tag of a Web page HTML file. It prescribes this HTML file for the crawl rules with the search engine. Unlike Robot.txt, it is only for files written in this HTML.

Writing:

<meta name= "robots" content= "/>."

... The contents are listed below

NOINDEX-prevents the page from being indexed.

Nofollow-Prevents indexing of any hyperlinks in the page.

Noarchive-Does not save the page snapshot of the page.

Nosnippet-the page's summary information is not displayed in the search results, and the page snapshot of the page is not saved.

NOODP-Does not use descriptive information in open Directory Project as its summary information in search results.

Six, the test of the robots

In Google Webmaster tools, add a site after the use of the left side of the crawl tool permissions, you can test the site's robots, detailed see figure.

  

Robots.txt and Mtea are introduced to the contents of the robots here, I believe that we have a more detailed understanding of robot. The use of good robots for our site SEO has an important role, do a good job, can effectively shield those we do not want to search engine crawling pages, that is, the user experience is not high page, which will be conducive to keyword ranking inside the page to fully display a customer, to get the search engine on the site of the In order to help us to do the keyword ranking better.

This article is written by Idsem Group Gizhigang Copyright Link: http://www.idsem.com Respect copyright reproduced please specify!!!



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.