Target task: The former Sina scrapy crawler project was modified to be based on the Redisspider class of Scrapy-redis Distributed crawler project, the data into the Redis database.
The item file, as before, does not need to change
#-*-coding:utf-8-*-ImportscrapyImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewsitem (scrapy. Item):#headings and URLs for large classesParenttitle =Scrapy. Field () Parenturls=Scrapy. Field ()#headings and sub-URLs for small classesSubTitle =Scrapy. Field () Suburls=Scrapy. Field ()#Small Class Directory storage pathSubfilename =Scrapy. Field ()#Sub-links under small classesSonurls =Scrapy. Field ()#article title and contentHead =Scrapy. Field () content= Scrapy. Field ()
Second, spiders crawler files, use the Redisspider class to replace the previous Spider class, the rest of the place to make a few changes can be, the specific code is as follows:
#-*-coding:utf-8-*-ImportscrapyImportOS fromSinanews.itemsImportSinanewsitem fromScrapy_redis.spidersImportRedisspiderImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinaspider (redisspider): Name="Sina"
# command to start a crawler Redis_key="Sinaspider:strat_urls" # dynamically defined crawler crawl domain range def __init__(Self, *args, * *Kwargs): Domain= Kwargs.pop ('Domain',"') Self.allowed_domains= Filter (None, Domain.split (',') ) Super (Sinaspider, self).__init__(*args, * *Kwargs)defParse (self, Response): Items= [] #URLs and headings for all large classesParenturls = Response.xpath ('//div[@id = "Tab01"]/div/h3/a/@href'). Extract () Parenttitle= Response.xpath ('//div[@id = "Tab01"]/div/h3/a/text ()'). Extract ()#all small classes of ur and titleSuburls = Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/@href'). Extract () SubTitle= Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/text ()'). Extract ()#Crawl all big classes forIinchRange (0, Len (parenttitle)):#Crawl all small classes forJinchRange (0, Len (suburls)): Item=Sinanewsitem ()#Save large categories of title and URLsitem['Parenttitle'] =Parenttitle[i] item['Parenturls'] =Parenturls[i]#check if the URL of the small class starts with the same category large class URL, if it is true (sports.sina.com.cn and Sports.sina.com.cn/nba)If_belong = Suburls[j].startswith (item['Parenturls']) #If you belong to this class, place the storage directory in this large class directory if(If_belong):#store small class URLs, title, and filename field dataitem['Suburls'] =Suburls[j] item['SubTitle'] =Subtitle[j] Items.append (item)#send request requests for each small class URL, get response together with the meta data to give the callback function Second_parse method processing forIteminchItems:yieldScrapy. Request (url = item['Suburls'], meta={'meta_1': item}, callback=self.second_parse)#for the URL of the small class returned, then the recursive request defSecond_parse (Self, Response):#extract meta data for each responsemeta_1= response.meta['meta_1'] #Take out all the sub-links in the small classSonurls = Response.xpath ('//a/@href'). Extract () Items= [] forIinchRange (0, Len (sonurls)):#Check that each link starts with a large class URL, ends with a. sHTML, and returns True if it isIf_belong = Sonurls[i].endswith ('. shtml') andSonurls[i].startswith (meta_1['Parenturls']) #get field values under the same item for easy Transfer If you belong to this class if(If_belong): Item=Sinanewsitem () item['Parenttitle'] =meta_1['Parenttitle'] item['Parenturls'] =meta_1['Parenturls'] item['Suburls'] = meta_1['Suburls'] item['SubTitle'] = meta_1['SubTitle'] item['Sonurls'] =Sonurls[i] Items.append (item)#send request requests for each small type of the URL of the link, get response and give the callback function together with the Meta data detail_parse method processing forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together forContent_oneinchcontent_list:content+=Content_one item['Head']= Head[0]ifLen (head) > 0Else "NULL"item['content']=contentyieldItem
Iii.. settings File Settings
Spider_modules = [' sinanews.spiders']newspider_module=' sinanews.spiders'#using the Scrapy-redis in the scrapy, do not use the default de-reset methodDupefilter_class ="Scrapy_redis.dupefilter.RFPDupeFilter"#use the Scheduler component in Scrapy-redis, without using the default schedulerSCHEDULER ="Scrapy_redis.scheduler.Scheduler"#allow pause, Redis request record is not lostScheduler_persist =True#default Scrapy-redis request queue form (by priority)Scheduler_queue_class ="Scrapy_redis.queue.SpiderPriorityQueue"#queue form, request FIFO first#Scheduler_queue_class = "Scrapy_redis.queue.SpiderQueue"#stack form, request advanced back out#Scheduler_queue_class = "Scrapy_redis.queue.SpiderStack"#just put the data into the Redis database, no need to write the pipelines fileItem_pipelines = {#' Sina.pipelines.SinaPipeline ': 'Scrapy_redis.pipelines.RedisPipeline': 400,}#log_level = ' DEBUG '#introduce an artifical delay- make use of parallelism#crawl.Download_delay = 1#Specify the host IP for the databaseRedis_host ="192.168.13.26"#Specify the port number of the databaseRedis_port = 6379
Execute command:
This time use the local Redis database directly, commenting out the Redis_host and Redis_port in the settings file.
Start the crawler
Scrapy Runspider sina.py
After executing the program, the terminal window appears as follows:
Indicates that the program is in a wait state, at which point the following command is executed on the Redis database side:
redis-cli> Lpush Sinaspider:start_urls http://news.sina.com.cn/guide/
http://news.sina.com.cn/guide/is the starting URL, at which time the program starts executing.
Python crawler Scrapy-redis Distributed Instance (i)