Stone 誋: The Rise and Fall of the Magic Robots Witness website

Last Update:2017-02-28 Source: Internet

Author: User

Tags root directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Very early promised to Ah bin write an article, thank him for one of my help, but until now also did not write out, a few days ago to see Zhuo Less asked a question about robots, for everyone to tidy up some of the situation of robots. robots.txt files are placed in the root directory of the site, is the search engine to visit the site to see the first file. When a search spider accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages that are not password protected on the site. Each site should have a robot, it tells the search engine my site What is not allowed to crawl, what pages are welcome crawling and crawl.

Several functions of robots:

1. Block all search engine crawl information, if your site is only your privacy site, do not want too many people know, you can use robots to block all the search engines, such as your personal blog writing. You can block out the search engines.

User-agent: *

Disallow:/

2. If you only want a certain search engine to crawl your information, this time you can use a robots to set up, for example: I just want my site is Baidu This included, and do not want to be included in other search engines. You can set it up using a robots

User-agent:baiduspider

Allow:

User-agent: *

Disallow:/

3. You can use a variety of wildcard characters to the corresponding deployment of the site, for example, I do not want the site to crawl all my pictures, this time can be used to set the $. Generally our common picture format is BMP, JPG, GIF, JPEG and other formats. This time the setting is:

User-agent: *

Disallow:/.bmp$

Disallow:/.jpg$

Disallow:/.gif$

Disallow:/.jpeg$

4. You can also use the * to screen out the relevant URLs, some sites do not allow search engines to crawl dynamic addresses can use this * wildcard characters for matching settings. In general, a dynamic URL is characterized by a "?" This is the time when we can use this feature to match the mask:

User-agent: *

Disallow:/*?*

5. If the site is revised, the entire folder is missing, in this case it is necessary to consider shielding off the entire folder. We can use robots to screen the entire folder, such as the site of the AB folder for the revision are all deleted, this time can be set:

User-agent: *

Disallow:/ab/

6. If there is a folder in the website do not want to be included, but in this folder there is a message is allowed to be included. It can be set up using the Allow of a robots. For example, my Site ab folder does not allow search engine crawl, but in the AB folder there is an information CD is allowed to be crawled, this time you can use a robots to set:

User-agent: *

Disallow:/ab/

Allow:/ab/cd

7. You can define the location of the site map in the robots, which is beneficial to the site's collection.

sitemap:< Site Map location >

8. Sometimes you will find my site set up a robot but also found that it included this URL address, this reason is because this search engine spider is crawling through the URL to the Web page of the general Google crawl such a URL when it is not with the title and description, But Baidu crawl this URL will bring title and description, so there are many people will say I set the robots but no effect. The actual situation is to crawl the link without the content of this page included.

The homepage weight of the website is the highest, the weight is by the link transmission, we set up the robot is in order to better pass the weight to those who need to have the very high weight the page, but some pages are not need search engine crawl and crawl.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More