Python crawler Scrapy-redis Distributed Instance (i)

Source: Internet
Author: User
Tags xpath

Target task: The former Sina scrapy crawler project was modified to be based on the Redisspider class of Scrapy-redis Distributed crawler project, the data into the Redis database.

The item file, as before, does not need to change

#-*-coding:utf-8-*-ImportscrapyImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewsitem (scrapy. Item):#headings and URLs for large classesParenttitle =Scrapy. Field () Parenturls=Scrapy. Field ()#headings and sub-URLs for small classesSubTitle =Scrapy. Field () Suburls=Scrapy. Field ()#Small Class Directory storage pathSubfilename =Scrapy. Field ()#Sub-links under small classesSonurls =Scrapy. Field ()#article title and contentHead =Scrapy. Field () content= Scrapy. Field ()

Second, spiders crawler files, use the Redisspider class to replace the previous Spider class, the rest of the place to make a few changes can be, the specific code is as follows:

#-*-coding:utf-8-*-ImportscrapyImportOS fromSinanews.itemsImportSinanewsitem fromScrapy_redis.spidersImportRedisspiderImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinaspider (redisspider): Name="Sina"
# command to start a crawler Redis_key="Sinaspider:strat_urls"    # dynamically defined crawler crawl domain range def __init__(Self, *args, * *Kwargs): Domain= Kwargs.pop ('Domain',"') Self.allowed_domains= Filter (None, Domain.split (',') ) Super (Sinaspider, self).__init__(*args, * *Kwargs)defParse (self, Response): Items= [] #URLs and headings for all large classesParenturls = Response.xpath ('//div[@id = "Tab01"]/div/h3/a/@href'). Extract () Parenttitle= Response.xpath ('//div[@id = "Tab01"]/div/h3/a/text ()'). Extract ()#all small classes of ur and titleSuburls = Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/@href'). Extract () SubTitle= Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/text ()'). Extract ()#Crawl all big classes forIinchRange (0, Len (parenttitle)):#Crawl all small classes forJinchRange (0, Len (suburls)): Item=Sinanewsitem ()#Save large categories of title and URLsitem['Parenttitle'] =Parenttitle[i] item['Parenturls'] =Parenturls[i]#check if the URL of the small class starts with the same category large class URL, if it is true (sports.sina.com.cn and Sports.sina.com.cn/nba)If_belong = Suburls[j].startswith (item['Parenturls']) #If you belong to this class, place the storage directory in this large class directory if(If_belong):#store small class URLs, title, and filename field dataitem['Suburls'] =Suburls[j] item['SubTitle'] =Subtitle[j] Items.append (item)#send request requests for each small class URL, get response together with the meta data to give the callback function Second_parse method processing forIteminchItems:yieldScrapy. Request (url = item['Suburls'], meta={'meta_1': item}, callback=self.second_parse)#for the URL of the small class returned, then the recursive request defSecond_parse (Self, Response):#extract meta data for each responsemeta_1= response.meta['meta_1'] #Take out all the sub-links in the small classSonurls = Response.xpath ('//a/@href'). Extract () Items= [] forIinchRange (0, Len (sonurls)):#Check that each link starts with a large class URL, ends with a. sHTML, and returns True if it isIf_belong = Sonurls[i].endswith ('. shtml') andSonurls[i].startswith (meta_1['Parenturls']) #get field values under the same item for easy Transfer If you belong to this class if(If_belong): Item=Sinanewsitem () item['Parenttitle'] =meta_1['Parenttitle'] item['Parenturls'] =meta_1['Parenturls'] item['Suburls'] = meta_1['Suburls'] item['SubTitle'] = meta_1['SubTitle'] item['Sonurls'] =Sonurls[i] Items.append (item)#send request requests for each small type of the URL of the link, get response and give the callback function together with the Meta data detail_parse method processing forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together forContent_oneinchcontent_list:content+=Content_one item['Head']= Head[0]ifLen (head) > 0Else "NULL"item['content']=contentyieldItem

Iii.. settings File Settings

Spider_modules = [' sinanews.spiders']newspider_module=' sinanews.spiders'#using the Scrapy-redis in the scrapy, do not use the default de-reset methodDupefilter_class ="Scrapy_redis.dupefilter.RFPDupeFilter"#use the Scheduler component in Scrapy-redis, without using the default schedulerSCHEDULER ="Scrapy_redis.scheduler.Scheduler"#allow pause, Redis request record is not lostScheduler_persist =True#default Scrapy-redis request queue form (by priority)Scheduler_queue_class ="Scrapy_redis.queue.SpiderPriorityQueue"#queue form, request FIFO first#Scheduler_queue_class = "Scrapy_redis.queue.SpiderQueue"#stack form, request advanced back out#Scheduler_queue_class = "Scrapy_redis.queue.SpiderStack"#just put the data into the Redis database, no need to write the pipelines fileItem_pipelines = {#' Sina.pipelines.SinaPipeline ':    'Scrapy_redis.pipelines.RedisPipeline': 400,}#log_level = ' DEBUG '#introduce an artifical delay- make use of parallelism#crawl.Download_delay = 1#Specify the host IP for the databaseRedis_host ="192.168.13.26"#Specify the port number of the databaseRedis_port = 6379

Execute command:

This time use the local Redis database directly, commenting out the Redis_host and Redis_port in the settings file.

Start the crawler

Scrapy Runspider sina.py

After executing the program, the terminal window appears as follows:

Indicates that the program is in a wait state, at which point the following command is executed on the Redis database side:

redis-cli> Lpush Sinaspider:start_urls http://news.sina.com.cn/guide/

http://news.sina.com.cn/guide/is the starting URL, at which time the program starts executing.

Python crawler Scrapy-redis Distributed Instance (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.