When it comes to the ROBOTS.TXT of scum,

Source: Internet
Author: User
Keywords Crawl user agent prohibit
Tags file get google + it is learn learned robots screen

Absrtact: Remember long ago, that time Sina screen Baidu Spider event is very large, in fact, you have learned the ROBOTS.TXT of the writing, it is simple, two recognized the true and false. So learn the technology, you can better know the truth.

Remember long ago, that time Sina shielding Baidu Spider event is very large, in fact, you have learned the ROBOTS.TXT of the writing, it is simple, two recognized the true and false. So learn the technology, you can better know the truth.

First, let's meet our dear Spiders first:

Domestic search engine Spiders

Baidu Spider: Baiduspider

Sogou Spider: Sogou spider

Youdao Spider: Yodaobot and Outfoxbot

Search Spiders: Sosospider

Foreign search engine Spiders

Google Spider: Googlebot

Yahoo Spider: Yahoo! Slurp

Alexa Spider: ia_archiver

Bing Spider (MSN): msnbot

Several common English meanings of Robots.txt

user-agent: Bots that apply the following rules

Allow: Web pages that are allowed to be crawled

Disallow: Web pages to intercept

Two common symbols for Robots.txt

"*": matches 0 or more arbitrary characters (also all meaning)

' $ ': matches the line terminator.

Introduction is almost, the following to get to the point, Robots.txt:

One, all the spiders are allowed to crawl:

User: *

Disallow:

Or

User: *

Allow:/

(* The number can be understood as the meaning)

Second, prohibit all robot crawl

User: *

Disallow:/

Third, forbid a spider to crawl:

User: Spider name (described above)

Disallow:/

Four, only allowed a spider to crawl:

User: Spider name (described above)

Disallow:

User: *

Disallow:/

The top half of the spider is forbidden to crawl, the lower part is full, the general meaning is to prohibit the spider, make other spiders.

V. Spiders are forbidden to crawl certain directories

If you are not allowed to crawl admin and manage directories

User: *

Disallow:/admin/

Disallow:/manage/

Six, prohibit the spider specific suffix file, this use "*" number

If you are not allowed to crawl. htm files

User: *

Disallow: *.htm (* followed by the dot filename, such as. asp,.php)

Seven, only allowed to crawl a specific suffix file, this with "$" number

If only. htm files

User: *

Allow: htm$

Disallow:/

(Pictures can also refer to sixth and seventh)

Eight, prohibit crawl dynamic webpage

User: *

Disallow:/*? *

This is very useful in the Forum, the general pseudo static, you do not need to search engine to include its dynamic address. The friends who do the forum attention.

IX. Statement Site Map Sitemap

This tells the search engine where your sitemap is.

sitemap:http://your domain/sitemap.xml

How do we check the validity of our robots.txt this document? Google Administrator tool is recommended to access the "tool-> analysis robots.txt" to check the validity of the file.

Original article please indicate reproduced from: Wuhan Seo-sem said

This article address: http://www.semsay.com/seo/37.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.