Absrtact: Remember long ago, that time Sina screen Baidu Spider event is very large, in fact, you have learned the ROBOTS.TXT of the writing, it is simple, two recognized the true and false. So learn the technology, you can better know the truth.
Remember long ago, that time Sina shielding Baidu Spider event is very large, in fact, you have learned the ROBOTS.TXT of the writing, it is simple, two recognized the true and false. So learn the technology, you can better know the truth.
First, let's meet our dear Spiders first:
Domestic search engine Spiders
Baidu Spider: Baiduspider
Sogou Spider: Sogou spider
Youdao Spider: Yodaobot and Outfoxbot
Search Spiders: Sosospider
Foreign search engine Spiders
Google Spider: Googlebot
Yahoo Spider: Yahoo! Slurp
Alexa Spider: ia_archiver
Bing Spider (MSN): msnbot
Several common English meanings of Robots.txt
user-agent: Bots that apply the following rules
Allow: Web pages that are allowed to be crawled
Disallow: Web pages to intercept
Two common symbols for Robots.txt
"*": matches 0 or more arbitrary characters (also all meaning)
' $ ': matches the line terminator.
Introduction is almost, the following to get to the point, Robots.txt:
One, all the spiders are allowed to crawl:
User: *
Disallow:
Or
User: *
Allow:/
(* The number can be understood as the meaning)
Second, prohibit all robot crawl
User: *
Disallow:/
Third, forbid a spider to crawl:
User: Spider name (described above)
Disallow:/
Four, only allowed a spider to crawl:
User: Spider name (described above)
Disallow:
User: *
Disallow:/
The top half of the spider is forbidden to crawl, the lower part is full, the general meaning is to prohibit the spider, make other spiders.
V. Spiders are forbidden to crawl certain directories
If you are not allowed to crawl admin and manage directories
User: *
Disallow:/admin/
Disallow:/manage/
Six, prohibit the spider specific suffix file, this use "*" number
If you are not allowed to crawl. htm files
User: *
Disallow: *.htm (* followed by the dot filename, such as. asp,.php)
Seven, only allowed to crawl a specific suffix file, this with "$" number
If only. htm files
User: *
Allow: htm$
Disallow:/
(Pictures can also refer to sixth and seventh)
Eight, prohibit crawl dynamic webpage
User: *
Disallow:/*? *
This is very useful in the Forum, the general pseudo static, you do not need to search engine to include its dynamic address. The friends who do the forum attention.
IX. Statement Site Map Sitemap
This tells the search engine where your sitemap is.
sitemap:http://your domain/sitemap.xml
How do we check the validity of our robots.txt this document? Google Administrator tool is recommended to access the "tool-> analysis robots.txt" to check the validity of the file.
Original article please indicate reproduced from: Wuhan Seo-sem said
This article address: http://www.semsay.com/seo/37.html