Robots.txt Use techniques prohibited search engine included
Source: Internet
Author: User
Keywordsnbsp; search engine crawl prohibit
&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Robots.txt is the first file to view when visiting a Web site in a search engine. The Robots.txt file tells the spider what files can be viewed on the server. When a search spider accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages that are not password protected on the site.
Robots.txt must be placed at the root of a site, and the filename must be all lowercase. Syntax: The simplest robots.txt file uses two rules: user-agent: Bots that apply the following rules Disallow: Web pages to intercept
Misunderstanding one: All the files on my website need spiders to crawl, then I don't need to add robots.txt files. Anyway, if the file doesn't exist, all the search spiders will default to access all pages on the site that are not password protected.
Every time a user attempts to access a URL that does not exist, the server logs 404 errors (Files cannot be found) in the log. Whenever you search for a spider to find a robots.txt file that doesn't exist, the server will also record a 404 error in the log, so you should add a robots.txt to your site.
Myth Two: In the robots.txt file set all the files can be searched spiders crawl, this can increase the percentage of the site.
Web site in the program script, style sheets and other documents, even if the spider included, will not increase the rate of site collection, but also waste server resources. Therefore, you must set the robots.txt file to not allow search spiders to index these files.
Specifically which files need to be excluded, in the robots.txt use tips in the article is described in detail.
Myth Three: Search Spiders crawl Web page too waste server resources, in robots.txt file set all search spiders can not crawl all the pages.
If so, will cause the entire site can not be indexed by search engines.
If your site involves personal privacy or confidential Web pages, how to tell the ban search engines included crawl, the following Houqingrong the following methods, I hope that the search engine is not included in the crawl site to help.
The first method of robots.txt
Search engine defaults to comply with the Robots.txt protocol, create robots.txt text files to the site root directory, edit the code as follows:
User: * Disallow:
Through the code, you can tell the search engine do not crawl to take the site included.
Second, web page code In the home code <head> and </head>, add <meta name= "robots" content= "noarchive" > Code, which prohibits search engines from crawling Web sites and displaying snapshots of web pages.
How to prohibit Baidu search engine included crawl page
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.