First, let me introduce what is Robots.txt:robots.txt is the first file to see when visiting a Web site in a search engine. The Robots.txt file tells the spider what files can be viewed on the server. When a search spider accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages that are not password protected on the site. Finally, robots.txt must be placed in the root directory of a site.
You can refer to Google, Baidu and Tencent's robots:
Http://www.google.com/robots.txt
Http://www.baidu.com/robots.txt
Http://www.qq.com/robots.txt
After you understand the robots.txt, what can we do with robots.txt?
1, with robots.txt shielding similar high page or no content of the page.
We know that the search engine included in the page, the page will be "audit", and when the similarity of two pages is very high, then the search engine will delete one of them, and will reduce the point of your site score.
Assuming that the following two links, the content is actually similar, then the first link should be blocked off.
/xxx?123
/123.html
Like the first link such a link is very many, then how do we shield it? In fact, as long as shielding/xxx?
The code is as follows:
Disallow:/xxx?
By the same token, we can use the same method to screen out some pages without content.
2, with robots.txt shielding redundant links, generally keep static links (both HTML, htm, shtml, etc.).
Because there are often multiple links to the same page in the site, and this will make the search engine on the site's friendliness decreased. In order to avoid this situation, we can remove the robots.txt links through the main link.
For example, the following two links point to the same page:
/ooo?123
/123.html
Then we should get rid of the first garbage, the code is as follows:
Disallow:/ooo?123
3, with robots.txt shielding dead chain
Dead chain is the Web page that used to exist, because the revision or other reasons and loss of utility after become dead chain, that is, seemingly a normal link to the Web page, but after clicking can not open the corresponding page page.
For example, the original directory for all the links under the/seo, because the directory address changes, now become a dead link, then we can use robots.txt to shield him, the code is as follows:
Disallow:/seo/
4, tell the search engine your sitemap.xml address
Use robots.txt to tell search engines the address of your sitemap.xml file without adding sitemap.xml links to the site. The specific code is as follows:
Sitemap: Your sitemap address
This is the basic use of robots.txt, a good site will have a good robots.txt, because robots.txt is a search engine to understand your site a way. In addition here I recommend a more suitable for WordPress users to use the robots.txt wording:
User: *
Disallow:/wp-
Disallow:/feed/
Disallow:/comments/feed
Disallow:/trackback/
Sitemap:http://rainjer.com/sitemap.xml
Finally, if you feel that the above is not enough to meet your needs, then you can be in Google or Baidu official robots.txt use Guide to learn:
Baidu: http://www.baidu.com/search/robots.html
Google: Http://www.google.com/support/forum/p/webmasters/thread?tid=4dbbe5f3cd2f6a13&hl=zh-CN