2017-07-25 21:08:16
First, the scale of the web crawler
Restriction of web crawler
? SOURCE review: Judge User‐agent to limit
Check the user‐agent domain of the incoming HTTP protocol header, only in response to browser or friendly crawler access
? Post announcement: Robots Agreement
Inform all crawler sites of crawl strategies, requiring crawlers to comply with
Iii. Robots Agreement
Role: Web site tells the Web crawler which pages can be crawled, which do not
Form: Robots.txt file in the root directory of the Web site
If the site does not provide the robots agreement, it means that the site allows any crawler to crawl any number of times.
Class human behavior can in principle not comply with the robots agreement
https://www.baidu.com/robots.txthttp://news.sina.com.cn/robots.txt
Example:
https://www.jd.com/robots.txtuser‐agent: *Disallow: /?*Disallow: /pop/*. Htmldisallow: /pinpai/*.html?*user‐agent:etaospiderdisallow :/User‐agent:huihuispiderdisallow: /User‐agent:gwdangspiderdisallow: /User‐agent:wochachaspiderdisallow: /# Comment, * Represents all,/represents the root directory user‐agent: *Disallow: /
Python Crawler-robots Protocol