After the installation of WordPress site on the robots.txt file has been a lot of trouble to write the webmaster, Robots.txt file protocol also called Search Engine robot protocol, search engine crawler crawling site, the first will look at the site root directory whether there are robots.txt files, and then follow the ROBOTS.T The XT protocol crawls content that the site owner wants the search engine to crawl. robots.txt file is intended to tell the search engine crawler which pages can crawl, which pages cannot crawl, can effectively protect the privacy of users, but also conducive to saving the spider's bandwidth, so that spiders crawl more easily, promote included.
First come to the simple next robots.txt file rules:
1, allow all search to cause crawling any content
User-agent: *
Disallow:
This means that all search engines are allowed to crawl all the pages, although disallow is not allowed, but the back is empty, meaning there is no page that is not allowed to crawl.
2, shielding one or several search engine crawl, with the recent comparison of Fire 360 comprehensive search for example
User-agent:360spider
Disallow:/
user-agent:*
Disallow:
The first two lines mean not to allow 360 synthetic search spiders to crawl any page, followed by an explanation see 1th. Similarly, if in addition to shielding 360 comprehensive search also want to block Baidu Spider, then continue to add at the beginning.
3, do not allow search engines to crawl some of these pages, this side to not allow all search engines to crawl WordPress Admin page for example
user-agent:*
disallow:/wp-admin/
We all know that WordPress management backstage in the root directory of the Wp-admin folder inside, after the disallow plus the/wp-admin meaning is not allowed to search engine spiders crawl.
As for not allow Baidu crawl backstage, allow other search engine crawl backstage, or do not allow 360 comprehensive search crawl background, allow other search engine crawl background and so on combination, please refer to the above three points to combine.
Back to the point, and then the WordPress robots.txt file writing, in fact, WordPress's robots file is very simple, mainly see 3 main points:
1, site background do not crawl spiders
First set not to let search engines crawl wordpress background page, which is almost every webmaster set robots.txt file First purpose, not only limited to WordPress, of course, different types of Web site background page folder name is not the same.
2, static, dynamic URL do not crawl spiders
WordPress URL is best still static, because too many dynamic parameters are not conducive to crawling spiders. But many webmaster after static URL, each publish article, the search engine collects the static URL and the dynamic URL at the same time, this obviously can cause the article page weight to disperse, moreover can cause the duplicate page to be subjected to the search engine's punishment, actually avoids this situation the method is very simple, That is in the robots.txt file set, so that spiders do not crawl dynamic URL, so the dynamic URL will not be included in Baidu.
3, the end plus the XML format of the site map
At the end of the robots.txt plus site map, you can make the site map crawl site When the first time is crawled, more conducive to the collection of pages.
As a result, the simplest WordPress robots.txt is written as follows
user-agent:*
disallow:/wp-admin/
disallow:/*?*
#这意思是不抓取中包含? URL, dynamic URL feature is there?
Sitemap:http://www.yourdomain.com/sitemap.xml
Remove the line containing #, and the Sitemap in the yourdomain to your domain name, so a WordPress robots.txt file is completed, and finally upload the file to the root directory can be.
There are a few things to note about Robots.txt file writing:
1, Slash problem
The beginning of the first slash is a must, the end of a slash words means that all the pages in this directory, if there is no slash that shielding both include slashes, there are not including slashes, such as/wp-admin.html,/wp-admin.php and so on page (for example). This is a two different concept, and you must consider whether you want to add a slash to the back.
2, Case-writing problem
All but the first letter of each line must be lowercase.
3, Disallow and allow
In fact, for many novice stationmaster, master one of the writing is enough, because at the same time be instilled in these two kinds of writing is easy to lead to mixing, so if you are novice, then robots.txt file is best to use only one method, to prevent mixed around caused errors. This article by the SEO Learning Network webmaster Zhang Donglong Original, if you need to reprint, please keep the original address http://www.zhangdonglong.com/archives/578, thank you.