How can I write a notebook? Robots Syntax: 1. User-Agent defines the search engine. Generally, the website contains: User-Agent: *. Here * indicates all, indicating that all search engines are defined. For example, if I want to define Baidu, It is User-Agent: baiduspider; Google, User-Agent: googlebot. 2. disallow prohibits crawling. For example, if I want to disable crawling my admin folder, It is disallow:/admin /. Disable the crawling of login.html, disallow:/admin/login.html under the Administrator folder. 3. Allow allows. We all know that, by default, all are permitted. Why should we allow this syntax? For example, I want to disable all files under the Administrator folder, except for the. html webpage. How can I write it? We know that disallow can be used to deny one by one, but it takes too much time and effort. At this time, allow is used to solve the complicated problem, so we can write: Allow:/admin/. html $ disallow:/admin /. 4. $ Terminator. Example: disallow :. PHP $ is used to block all requests. PHP end files, no matter how long the previous URL, such as ABC/AA/BB // index. PHP is also blocked. 5. * The wildcard character 0 or multiple arbitrary characters. Example: disallow :*? * This means to block all the "?" All dynamic URLs are blocked. Example of how to write the robots.txt file: Prohibit all search engines such as Google and Baidu from accessing the entire website User-Agent: * disallow:/allow all search engines spider to access the entire website (disallow: You can use allow: /alternative) User-Agent: * disallow: Prohibit baiduspider from accessing your website. other search engines such as Google do not prevent User-Agent: baiduspider disallow:/only allow Google Spider: googlebot is allowed to access your website, and Baidu and other search engines such as User-Agent: googlebot disallow: User-Agent: * disallow: /prohibit search engine spider from accessing the specified directory (Spider does not access these directories. Each directory must be declared separately and cannot be combined.) User-Agent: * disallow:/cgi-bin/disallow:/admin/disallow :/~ Jjjj/prohibit search engine spider from accessing the specified directory, but allow access to a subdirectory of the specified directory User-Agent: * allow:/admin/far disallow: /admin/use the wildcard asterisk "*" to set the URLs that prohibit access (prohibit all search engines from capturing all URLs in the/cgi-bin/directory ". html "format webpage (including subdirectories) User-Agent: * disallow:/cgi-bin /*. HTML uses the dollar sign "$" to disable access to a file with a certain suffix (only allow access ". html "format .) User-Agent: * allow:. html $ disallow: // prevents all search engines such as Google and Baidu from accessing the website? User-Agent: * disallow :/*? * Prevent Google Spider: googlebot from accessing images in a certain format on the website (prohibit images in the format of .jpg) User-Agent: googlebot disallow :. JPG $ only allows Google Spider: googlebotnet and .gif images (googlebot can only capture GIF images and webpages, images in other formats are prohibited; other search engines are not set) User-Agent: googlebot allow :. GIF $ disallow :. JPG $ ....... only Google Spider: googlebotnet is prohibited to capture .jpg images (images in other search engines and other formats are not prohibited) User-Agent: googlebot disallow :. JPG $ declare site map sitemap this tells the search engine where your sitemap is, such as: sitemap: http://www.AAAA.c OM/sitemap. xml googleand Baidu introduction to the robots.txt file: Google robotstxt and Baidu robots.txt. PS: Domestic search engine spider Baidu Spider: baiduspsogou Spider: sogou spider youdao Spider: yodaobot and outfoxbot search Spider: sosospider foreign search engine spider Google Spider: googlebot Yahoo Spider: Yahoo! Slurp Alexa: ia_archiver Bing (MSN): msnbot
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.