Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Recently, I often have friends ask me some questions about robots, today I will use some of the experience and skills of robots to share with you, welcome you to make a brick.
Robots nature: txt text file.
Application principle: When a search robot (some called search spider) to visit a site, it will first check the site root directory exists robots.txt, if it exists, the search robot will be according to the contents of the file to determine the scope of access; If the file does not exist, The search robot then crawls along the link.
Methods: User: Applicable to the following rules of the Rover Disallow: to intercept the Web page, the following is a number of commonly used in the writing of the robot file for your reference
The role of Robots: 1. Site managers can declare that the site does not want to be searched for spiders to visit the part, or the designated search engine only contains the specified content, can effectively protect the information of site Managers 2. According to the http://www.chengshinet.com/Razor's Law seo/20090114-88.html) to avoid site search results pages and some dynamic links resulting from repeated pages are crawled and affect the overall quality of the website. 3.robots.txt file can also directly to the search engine site map sitemap file location. 4. Avoid spiders to visit the Web site to see the files generated by 404 error messages.
Here are some of the techniques and specific operations of robots;
1. Limit search spiders crawl The scope of the file:
Allow all search spiders to access
User: *
Disallow:
Prohibit all search engines from accessing any part of the site
User: *
Disallow:/
Prohibit all search engine access to the site management login backstage, in case of leaking site information
User: *
Disallow:/admin/
Prohibit the visit of a certain search engine (that is, the following mentioned Taobao shielding Baidu)
User-agent:baiduspider
Disallow:/
Allow only one search engine access (Google in the following example)
User-agent:google
Disallow:
User: *
Disallow:/
To prevent spiders from grabbing all files for a particular file type (for example,. txt), use the following robots.txt entries:
User: *disallow:/*.txt$ to prevent spiders from crawling all included? URL (specifically, this URL begins with your domain name, takes over the meaning of the string, then a question mark, and then any string), you can use the following entry: User-agent:googlebotdisallow:/*?2. Prevents certain search results pages and print pages from being crawled. Http://www.chengshinet.com/robots.txtUser-agent: * Disallow:/plus/feedback_js.php Disallow:/plus/mytag_js.php Disallow:/plus/rss.phpdisallow:/plus/search.php 3.robots File Submit Sitemap User: *sitemap:http://www.xxx.com/ Sitemap.xml at present, the vast majority of search engine robots comply with the rules of robots.txt, but sometimes the search engine in order to maintain the integrity of the search and good user experience, still some important URLs for effective request, we look at a robots.txt Example: Http://www.taobao.com/robots.txt access to the above specific address, we can see the specific contents of robots.txt are as follows:
User-agent:baiduspiderdisallow:/user-agent:baiduspiderdisallow:/above text expression meaning is prohibit Baidu search robot to visit all files under www.taobao.com site.
Specific syntax analysis: User: After the name of the search robot, if the following is *, it refers to all the search robot, if it is the corresponding such as Baiduspider Baidu Spider; Disallow: The following is a directory of files that are not allowed to access. At present, the vast majority of search engine robots comply with robots.txt rules but when you use Baidu search Taobao, still will appear the index of the site, the following figure
But can not be questioned is a file will make leakage of some Web site login background address, many webmasters think that the screen search spider, the site is safe, in fact, sometimes also leaked the site's information, how to ensure that information is not leaked it, The most effective approach is to make the 301 redirection settings for the robots file.
This article first: the net marketing http://www.chengshinet.com/blog/Webmaster qq:573218273 welcome everyone to make a brick, new SEO all the group 61842653, welcome to join the SEO enthusiasts, together with progress!