Before writing a scrapy study record, just a brief introduction of the next scrapy some of the content, and no actual examples, now start to record examples
The environment used is python2.7, scrapy1.2.0
Create a project first
Execute the command in the directory where you want to build the project Scrapy startproject Tutorial
Scrapy will help you build a project and then create a spider scrapy genspider zhuhuspider zhihu.com
The current file structure is
--Tutorial --spiders --__init__. py --zhihuspider.py --__init__. py --items.py --pipelines.py
--settings.py
The main files used are three zhihuspider.py,items.py,settings.py
Zhihuspider is used to write crawling spiders, items are used to save crawled data, settings modify the configuration
#-*-coding:utf-8-*- fromScrapy.spidersImportSpider fromScrapy.selectorImportSelector fromTutorial.itemsImportthe main body of the zhihuitem# spiderclassZhihuspider (Spider): Name="Zhihuspider"Allowed_domains= ['zhihu.com'] Start_urls= []
#获取要爬取的url, I put the keywords in zhihu.txt.defstart_requests (self): Url_head='https://www.zhihu.com/search?type=content&q='With Open ('Zhihu.txt','R') as F:datas=F.readlines () forDatainchDatas:url= url_head+DataPrinturl self.start_urls.append (URL) forUrlinchSelf.start_urls:yieldself.make_requests_from_url (URL)
#对返回的response进行解析, this step can be used in conjunction with the browser to find XPathdefParse (self, Response):
#删除所有的 <em> Response= Response.replace (Body=response.body.replace ('<em>',"')) HxS=Selector (response) Contents= Hxs.xpath ('//*[@class = "Zu-main-content"]//*[contains (@class, "List")]') Item=Zhihuitem () forContentinchcontents:item['Search_title'] = Content.xpath ('//*[@class = "title"]/a/text ()'). Extract () item['Search_title_link'] = Content.xpath ('//*[@class = "title"]/a/@href'). Extract () item['Search_answer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "Summary")]/text ()'). Extract () item['Search_answer_link'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "summary")]/a/@href'). Extract () item['Search_answer_writer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-meta")]//a[contains (@class, "author")]/text ()'). Extract ()PrintItemyieldItem
from Import Item, Field class Zhihuitem (Item): = Field () = field () = field () = field () = field ()
The items file holds data
Because there are restrictions on reptiles, we need to add anti-reptile mechanisms
There are basically 4 kinds, add useragent, add agent, disable cookie and crawl time limit
Add Python package middlewares under current project, and setttings in the same directory, only one __init__.py file below, create a new randomuseragent.py file
#Coding:utf-8ImportRandomclassrandomuseragent (object):def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents')) defprocess_request (self, request, spider): Request.headers.setdefault ('user-agent', Random.choice (self.agents))
Simply join the User-agent
Finally set in the settings
Downloader_middlewares = { #' tutorial.middlewares.MyCustomDownloaderMiddleware ': 543, 'tutorial.middlewares.RandomUserAgent.RandomUserAgent': 1,} User_agents= [ "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)", "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)", "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)", "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)", "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)", "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)", "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6", "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11", "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20", "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.11 taobrowser/2.0 safari/536.11", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.71 safari/537.1 lbbrowser", "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; Lbbrowser)", "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. net4.0e; Lbbrowser)", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.84 safari/535.11 lbbrowser", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)", "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400)", "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SV1; Qqdownload 732;. net4.0c;. net4.0e; 360SE)", "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)", "mozilla/5.0 (Windows NT 5.1) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1", "mozilla/5.0 (IPad; U CPU os 4_2_1 like Mac os X; ZH-CN) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8c148 safari/6533.18.5", "mozilla/5.0 (Windows NT 6.1; Win64; x64; Rv:2.0b13pre) gecko/20110307 Firefox/4.0b13pre", "mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) gecko/20100101 firefox/16.0", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11", "mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10"]cookies_enabled=Falsedownload_delay=3
The entire project was completed, Scrapy crawl zhihuspider-o zhihu.json export data. It's so simple.
Finally, when crawling data, pay attention to setting the crawl time, and do not burden the server.
Project has been submitted to GitHub on the address is Https://github.com/lin344902118/doubanSpider.git
And there's a pair of worms in the bean.
Python scrapy simple crawler record (for simple crawl)