Stationmaster's work is the design exquisite website, for the populace displays the rich content of the website. Of course, we also want well-designed sites to get the ideal rankings, which requires us to study the rules of search engine rankings, the greatest degree to get the opportunity to show to the customer. However, there are many types of search engines, sometimes, we have a good ranking of a search engine, but on the other search engines do not get the same ranking, because the different search engine rules are different. To do this, someone copied the same content to deal with different search engine ranking rules. However, once the search engine found a large number of "clones" of the page, it will be punished, do not include these duplicate pages. On the other hand, the content of our website belongs to personal private files, not to be exposed to the search engine. At this time, robot.txt is to solve these two problems.
A, search engine and its corresponding user-agent
So, what are the search engines and their corresponding user-agent? Below, I have listed some for reference.
Search Engine User-agent
AltaVista scooter
Baidu Baiduspider
InfoSeek InfoSeek
HotBot slurp
AOL Search Slurp
Excite Architextspider
Google Googlebot
Goto slurp
Lycos Lycos
MSN slurp
Netscape Googlebot
Northernlight Gulliver
WebCrawler Architextspider
Iwon slurp
Fast Fast
Directhit Grabber
Yahoo Web Pages Googlebot
LookSmart Web Pages slurp
Ii. Basic concepts of robots
Robots.txt file is a file of the website, it is to search engine spider to see. Search engine Spider Crawl Road Our site is the first to crawl this file, according to the contents of the site to determine the scope of access to the file. It can protect some of our files are not exposed to the search engine, thus effectively control the spider's crawl path, for our webmaster do a good job of SEO to create the necessary conditions. In particular, our site has just been created, some content is not perfect, temporarily do not want to be indexed by search engines.
Robots.txt can also be used in a directory. Search scope settings for files in this directory.
A few notes:
The website must have a robot.txt file.
The filename is a lowercase letter.
When you need to completely block the file, you need to match the meta's robots attribute.
Iii. the basic grammar of robots.txt
Basic format of the content item: key: Value pair.
1) user-agent Key
The following content corresponds to the name of each specific search engine crawler. Such as Baidu is Baiduspider, Google is Googlebot.
Generally we write this:
User-agent: *
Indicates that all search engine spiders are allowed to crawl and crawl. If you just want a certain search engine spider to crawl, listed in the following name can be. If it is more than one, write it repeatedly.
Note: user-agent: There will be a space behind.
In the robots.txt, the key is followed by a number, followed by a space, and the value phase is separate.
2) Disallow key
This key is used to explain the URL path that does not allow search engine spiders to crawl.
Example: Disallow:/index.php prohibit web index.php files
Allow key
The key describes the URL path that allows search engine spiders to crawl
For example: Allow:/index.php allows the site's index.php
Wildcard character *
Represents any number of characters
For example: Disallow:/*.jpg Web site All jpg files were banned.
End character $
Represents a URL that ends with the preceding character.
For example: Disallow:/?$ Web site All end of the file is prohibited.
Four, robots.txt case analysis
Example 1. Prohibit all search engines from accessing any part of the site
User-agent: *
Disallow:/
Example 2. Allow all search engines to access any part of the site
User-agent: *
Disallow:
Example 3. Only prohibit Baiduspider access to your Web site
User-agent:baiduspider
Disallow:/
Example 4. Only allow Baiduspider to access your Web site
User-agent:baiduspider
Disallow:
Example 5. Prohibit spider access to specific directories
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/data/
Note: 1 three directories to write separately. 2) Please note that the last to bring the slash. 3 The difference between a slash and a slash.
Example 6. Allow access to partial URLs in a specific directory
I hope that the a directory only b.htm allowed access, how to write?
User-agent: *
Allow:/a/b.htm
Disallow:/a/
Note: The permitted inclusion priority is higher than the prohibition included.
Start with example 7 to illustrate the use of wildcard characters. Wildcard characters include ("$" terminator;
"*" any character)
Example 7. Disable access to all dynamic pages in a Web site
User-agent: *
Disallow:/*?*
Example 8. Prevent search engines from crawling all the pictures on the site
User-agent: *
Disallow:/*.jpg$
Disallow:/*.jpeg$
Disallow:/*.gif$
Disallow:/*.png$
Disallow:/*.bmp$
In many other cases, specific circumstances need to be analyzed. As long as you understand these grammatical rules and the use of wildcard characters, I believe that many situations can be solved.
V. Meta-robots tags
Meta is the label content in the head tag of a Web page HTML file. It prescribes this HTML file for the crawl rules with the search engine. Unlike Robot.txt, it is only for files written in this HTML.
Writing:
<meta name= "robots" content= "/>."
... The contents are listed below
NOINDEX-prevents the page from being indexed.
Nofollow-Prevents indexing of any hyperlinks in the page.
Noarchive-Does not save the page snapshot of the page.
Nosnippet-the page's summary information is not displayed in the search results, and the page snapshot of the page is not saved.
NOODP-Does not use descriptive information in open Directory Project as its summary information in search results.
Six, the test of the robots
In Google Webmaster tools, add a site after the use of the left side of the crawl tool permissions, you can test the site's robots, detailed see figure.
Robots.txt and Mtea are introduced to the contents of the robots here, I believe that we have a more detailed understanding of robot. The use of good robots for our site SEO has an important role, do a good job, can effectively shield those we do not want to search engine crawling pages, that is, the user experience is not high page, which will be conducive to keyword ranking inside the page to fully display a customer, to get the search engine on the site of the In order to help us to do the keyword ranking better.
This article is written by Idsem Group Gizhigang Copyright Link: http://www.idsem.com Respect copyright reproduced please specify!!!