Robots.txt Use techniques prohibited search engine included

Source: Internet
Author: User
Keywords nbsp; search engine crawl prohibit

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Robots.txt is the first file to view when visiting a Web site in a search engine. The Robots.txt file tells the spider what files can be viewed on the server.
When a search spider accesses a site, it first checks to see if there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages that are not password protected on the site.

Robots.txt must be placed at the root of a site, and the filename must be all lowercase.
Syntax: The simplest robots.txt file uses two rules:
user-agent: Bots that apply the following rules
Disallow: Web pages to intercept

Misunderstanding one: All the files on my website need spiders to crawl, then I don't need to add robots.txt files. Anyway, if the file doesn't exist, all the search spiders will default to access all pages on the site that are not password protected.

Every time a user attempts to access a URL that does not exist, the server logs 404 errors (Files cannot be found) in the log. Whenever you search for a spider to find a robots.txt file that doesn't exist, the server will also record a 404 error in the log, so you should add a robots.txt to your site.

Myth Two: In the robots.txt file set all the files can be searched spiders crawl, this can increase the percentage of the site.

Web site in the program script, style sheets and other documents, even if the spider included, will not increase the rate of site collection, but also waste server resources. Therefore, you must set the robots.txt file to not allow search spiders to index these files.

Specifically which files need to be excluded, in the robots.txt use tips in the article is described in detail.

Myth Three: Search Spiders crawl Web page too waste server resources, in robots.txt file set all search spiders can not crawl all the pages.

If so, will cause the entire site can not be indexed by search engines.

If your site involves personal privacy or confidential Web pages, how to tell the ban search engines included crawl, the following Houqingrong the following methods, I hope that the search engine is not included in the crawl site to help.

The first method of robots.txt

Search engine defaults to comply with the Robots.txt protocol, create robots.txt text files to the site root directory, edit the code as follows:

User: *
Disallow:

Through the code, you can tell the search engine do not crawl to take the site included.

Second, web page code
In the home code <head> and </head>, add <meta name= "robots" content= "noarchive" > Code, which prohibits search engines from crawling Web sites and displaying snapshots of web pages.

How to prohibit Baidu search engine included crawl page

1, edit robots.txt file, design marked as:

User-agent:baiduspider
Disallow:/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.