Very early promised to Ah bin write an article, thank him for one of my help, but until now also did not write out, a few days ago to see Zhuo Less asked a question about robots, for everyone to tidy up some of the situation of robots. robots.txt files are placed in the root directory of the site, is the search engine to visit the site to see the first file. When a search spider accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages that are not password protected on the site. Each site should have a robot, it tells the search engine my site What is not allowed to crawl, what pages are welcome crawling and crawl.
Several functions of robots:
1. Block all search engine crawl information, if your site is only your privacy site, do not want too many people know, you can use robots to block all the search engines, such as your personal blog writing. You can block out the search engines.
User-agent: *
Disallow:/
2. If you only want a certain search engine to crawl your information, this time you can use a robots to set up, for example: I just want my site is Baidu This included, and do not want to be included in other search engines. You can set it up using a robots
User-agent:baiduspider
Allow:
User-agent: *
Disallow:/
3. You can use a variety of wildcard characters to the corresponding deployment of the site, for example, I do not want the site to crawl all my pictures, this time can be used to set the $. Generally our common picture format is BMP, JPG, GIF, JPEG and other formats. This time the setting is:
User-agent: *
Disallow:/.bmp$
Disallow:/.jpg$
Disallow:/.gif$
Disallow:/.jpeg$
4. You can also use the * to screen out the relevant URLs, some sites do not allow search engines to crawl dynamic addresses can use this * wildcard characters for matching settings. In general, a dynamic URL is characterized by a "?" This is the time when we can use this feature to match the mask:
User-agent: *
Disallow:/*?*
5. If the site is revised, the entire folder is missing, in this case it is necessary to consider shielding off the entire folder. We can use robots to screen the entire folder, such as the site of the AB folder for the revision are all deleted, this time can be set:
User-agent: *
Disallow:/ab/
6. If there is a folder in the website do not want to be included, but in this folder there is a message is allowed to be included. It can be set up using the Allow of a robots. For example, my Site ab folder does not allow search engine crawl, but in the AB folder there is an information CD is allowed to be crawled, this time you can use a robots to set:
User-agent: *
Disallow:/ab/
Allow:/ab/cd
7. You can define the location of the site map in the robots, which is beneficial to the site's collection.
sitemap:< Site Map location >
8. Sometimes you will find my site set up a robot but also found that it included this URL address, this reason is because this search engine spider is crawling through the URL to the Web page of the general Google crawl such a URL when it is not with the title and description, But Baidu crawl this URL will bring title and description, so there are many people will say I set the robots but no effect. The actual situation is to crawl the link without the content of this page included.
The homepage weight of the website is the highest, the weight is by the link transmission, we set up the robot is in order to better pass the weight to those who need to have the very high weight the page, but some pages are not need search engine crawl and crawl.