About WordPress's robots.txt files.

Last Update:2017-02-28 Source: Internet

Author: User

Tags php and domain name root directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After the installation of WordPress site on the robots.txt file has been a lot of trouble to write the webmaster, Robots.txt file protocol also called Search Engine robot protocol, search engine crawler crawling site, the first will look at the site root directory whether there are robots.txt files, and then follow the ROBOTS.T The XT protocol crawls content that the site owner wants the search engine to crawl. robots.txt file is intended to tell the search engine crawler which pages can crawl, which pages cannot crawl, can effectively protect the privacy of users, but also conducive to saving the spider's bandwidth, so that spiders crawl more easily, promote included.

First come to the simple next robots.txt file rules:

1, allow all search to cause crawling any content

User-agent: *

Disallow:

This means that all search engines are allowed to crawl all the pages, although disallow is not allowed, but the back is empty, meaning there is no page that is not allowed to crawl.

2, shielding one or several search engine crawl, with the recent comparison of Fire 360 comprehensive search for example

User-agent:360spider

Disallow:/

user-agent:*

Disallow:

The first two lines mean not to allow 360 synthetic search spiders to crawl any page, followed by an explanation see 1th. Similarly, if in addition to shielding 360 comprehensive search also want to block Baidu Spider, then continue to add at the beginning.

3, do not allow search engines to crawl some of these pages, this side to not allow all search engines to crawl WordPress Admin page for example

user-agent:*

disallow:/wp-admin/

We all know that WordPress management backstage in the root directory of the Wp-admin folder inside, after the disallow plus the/wp-admin meaning is not allowed to search engine spiders crawl.

As for not allow Baidu crawl backstage, allow other search engine crawl backstage, or do not allow 360 comprehensive search crawl background, allow other search engine crawl background and so on combination, please refer to the above three points to combine.

Back to the point, and then the WordPress robots.txt file writing, in fact, WordPress's robots file is very simple, mainly see 3 main points:

1, site background do not crawl spiders

First set not to let search engines crawl wordpress background page, which is almost every webmaster set robots.txt file First purpose, not only limited to WordPress, of course, different types of Web site background page folder name is not the same.

2, static, dynamic URL do not crawl spiders

WordPress URL is best still static, because too many dynamic parameters are not conducive to crawling spiders. But many webmaster after static URL, each publish article, the search engine collects the static URL and the dynamic URL at the same time, this obviously can cause the article page weight to disperse, moreover can cause the duplicate page to be subjected to the search engine's punishment, actually avoids this situation the method is very simple, That is in the robots.txt file set, so that spiders do not crawl dynamic URL, so the dynamic URL will not be included in Baidu.

3, the end plus the XML format of the site map

At the end of the robots.txt plus site map, you can make the site map crawl site When the first time is crawled, more conducive to the collection of pages.

As a result, the simplest WordPress robots.txt is written as follows

user-agent:*

disallow:/wp-admin/

disallow:/*?*

#这意思是不抓取中包含? URL, dynamic URL feature is there?

Sitemap:http://www.yourdomain.com/sitemap.xml

Remove the line containing #, and the Sitemap in the yourdomain to your domain name, so a WordPress robots.txt file is completed, and finally upload the file to the root directory can be.

There are a few things to note about Robots.txt file writing:

1, Slash problem

The beginning of the first slash is a must, the end of a slash words means that all the pages in this directory, if there is no slash that shielding both include slashes, there are not including slashes, such as/wp-admin.html,/wp-admin.php and so on page (for example). This is a two different concept, and you must consider whether you want to add a slash to the back.

2, Case-writing problem

All but the first letter of each line must be lowercase.

3, Disallow and allow

In fact, for many novice stationmaster, master one of the writing is enough, because at the same time be instilled in these two kinds of writing is easy to lead to mixing, so if you are novice, then robots.txt file is best to use only one method, to prevent mixed around caused errors. This article by the SEO Learning Network webmaster Zhang Donglong Original, if you need to reprint, please keep the original address http://www.zhangdonglong.com/archives/578, thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More