Python crawler Scrapy-redis Distributed Instance (i)

Last Update:2017-10-06 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Target task: The former Sina scrapy crawler project was modified to be based on the Redisspider class of Scrapy-redis Distributed crawler project, the data into the Redis database.

The item file, as before, does not need to change

#-*-coding:utf-8-*-ImportscrapyImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewsitem (scrapy. Item):#headings and URLs for large classesParenttitle =Scrapy. Field () Parenturls=Scrapy. Field ()#headings and sub-URLs for small classesSubTitle =Scrapy. Field () Suburls=Scrapy. Field ()#Small Class Directory storage pathSubfilename =Scrapy. Field ()#Sub-links under small classesSonurls =Scrapy. Field ()#article title and contentHead =Scrapy. Field () content= Scrapy. Field ()

Second, spiders crawler files, use the Redisspider class to replace the previous Spider class, the rest of the place to make a few changes can be, the specific code is as follows:

#-*-coding:utf-8-*-ImportscrapyImportOS fromSinanews.itemsImportSinanewsitem fromScrapy_redis.spidersImportRedisspiderImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinaspider (redisspider): Name="Sina"
    # command to start a crawler  Redis_key="Sinaspider:strat_urls" 　　 # dynamically defined crawler crawl domain range     def __init__(Self, *args, * *Kwargs): Domain= Kwargs.pop ('Domain',"') Self.allowed_domains= Filter (None, Domain.split (',') ) Super (Sinaspider, self).__init__(*args, * *Kwargs)defParse (self, Response): Items= []        #URLs and headings for all large classesParenturls = Response.xpath ('//div[@id = "Tab01"]/div/h3/a/@href'). Extract () Parenttitle= Response.xpath ('//div[@id = "Tab01"]/div/h3/a/text ()'). Extract ()#all small classes of ur and titleSuburls = Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/@href'). Extract () SubTitle= Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/text ()'). Extract ()#Crawl all big classes         forIinchRange (0, Len (parenttitle)):#Crawl all small classes             forJinchRange (0, Len (suburls)): Item=Sinanewsitem ()#Save large categories of title and URLsitem['Parenttitle'] =Parenttitle[i] item['Parenturls'] =Parenturls[i]#check if the URL of the small class starts with the same category large class URL, if it is true (sports.sina.com.cn and Sports.sina.com.cn/nba)If_belong = Suburls[j].startswith (item['Parenturls'])                #If you belong to this class, place the storage directory in this large class directory                if(If_belong):#store small class URLs, title, and filename field dataitem['Suburls'] =Suburls[j] item['SubTitle'] =Subtitle[j] Items.append (item)#send request requests for each small class URL, get response together with the meta data to give the callback function Second_parse method processing         forIteminchItems:yieldScrapy. Request (url = item['Suburls'], meta={'meta_1': item}, callback=self.second_parse)#for the URL of the small class returned, then the recursive request    defSecond_parse (Self, Response):#extract meta data for each responsemeta_1= response.meta['meta_1']        #Take out all the sub-links in the small classSonurls = Response.xpath ('//a/@href'). Extract () Items= []         forIinchRange (0, Len (sonurls)):#Check that each link starts with a large class URL, ends with a. sHTML, and returns True if it isIf_belong = Sonurls[i].endswith ('. shtml') andSonurls[i].startswith (meta_1['Parenturls'])            #get field values under the same item for easy Transfer If you belong to this class            if(If_belong): Item=Sinanewsitem () item['Parenttitle'] =meta_1['Parenttitle'] item['Parenturls'] =meta_1['Parenturls'] item['Suburls'] = meta_1['Suburls'] item['SubTitle'] = meta_1['SubTitle'] item['Sonurls'] =Sonurls[i] Items.append (item)#send request requests for each small type of the URL of the link, get response and give the callback function together with the Meta data detail_parse method processing         forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content    defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together         forContent_oneinchcontent_list:content+=Content_one item['Head']= Head[0]ifLen (head) > 0Else "NULL"item['content']=contentyieldItem

Iii.. settings File Settings

Spider_modules = [' sinanews.spiders']newspider_module=' sinanews.spiders'#using the Scrapy-redis in the scrapy, do not use the default de-reset methodDupefilter_class ="Scrapy_redis.dupefilter.RFPDupeFilter"#use the Scheduler component in Scrapy-redis, without using the default schedulerSCHEDULER ="Scrapy_redis.scheduler.Scheduler"#allow pause, Redis request record is not lostScheduler_persist =True#default Scrapy-redis request queue form (by priority)Scheduler_queue_class ="Scrapy_redis.queue.SpiderPriorityQueue"#queue form, request FIFO first#Scheduler_queue_class = "Scrapy_redis.queue.SpiderQueue"#stack form, request advanced back out#Scheduler_queue_class = "Scrapy_redis.queue.SpiderStack"#just put the data into the Redis database, no need to write the pipelines fileItem_pipelines = {#' Sina.pipelines.SinaPipeline ':    'Scrapy_redis.pipelines.RedisPipeline': 400,}#log_level = ' DEBUG '#introduce an artifical delay- make use of parallelism#crawl.Download_delay = 1#Specify the host IP for the databaseRedis_host ="192.168.13.26"#Specify the port number of the databaseRedis_port = 6379

Execute command:

This time use the local Redis database directly, commenting out the Redis_host and Redis_port in the settings file.

Start the crawler

Scrapy Runspider sina.py

After executing the program, the terminal window appears as follows:

Indicates that the program is in a wait state, at which point the following command is executed on the Redis database side:

redis-cli> Lpush Sinaspider:start_urls http://news.sina.com.cn/guide/

http://news.sina.com.cn/guide/is the starting URL, at which time the program starts executing.

Python crawler Scrapy-redis Distributed Instance (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More