(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),

Source: Internet
Author: User

(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),

Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html

 

In crawlers, we often encounter file downloads and image downloads. In other languages or frameworks, we may filter data, then, the file download class is used asynchronously to achieve the goal. In the Scrapy framework, files and image download files are implemented, which is quite convenient. Only a few lines of code are required, you can easily complete the download. The following shows how to use scrapy to download the homepage content of a Douban album.

Advantages:

1) Automatic deduplication

2) asynchronous operations without blocking

3) a thumbnail of the specified size can be generated.

4) Calculate the expiration time

5) format conversion

 

 

Encoding Process:

1. Define Item

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy import Item,Fieldclass DoubanImgsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    image_urls = Field()    images = Field()    image_paths = Field()    pass

 

 

Ii. Define spider

#coding=utf-8from scrapy.spiders import Spiderimport refrom douban_imgs.items import DoubanImgsItemfrom scrapy.http.request import Request# please pay attention to the encoding of info,otherwise raise errorimport sysreload(sys)sys.setdefaultencoding('utf8')class download_douban(Spider):    name = 'download_douban'    def __init__(self, url='152686895', *args, **kwargs):        self.allowed_domains = ['douban.com']        self.start_urls = [                'http://www.douban.com/photos/album/%s/' %(url) ]        #call the father base function        self.url = url        super(download_douban, self).__init__(*args, **kwargs)    def parse(self, response):        """        :type response: response infomation        """        list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()        if list_imgs:            item = DoubanImgsItem()            item['image_urls'] = list_imgs            yield item

 

3. Define piepline

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom scrapy import Requestfrom scrapy import logclass DoubanImgsPipeline(object):    def process_item(self, item, spider):        return itemclass DoubanImgDownloadPieline(ImagesPipeline):    def get_media_requests(self,item,info):        for image_url in item['image_urls']:            yield Request(image_url)    def item_completed(self, results, item, info):        image_paths = [x['path'] for ok, x in results if ok]        if not image_paths:            raise DropItem("Item contains no images")        item['image_paths'] = image_paths        return item

 

 

4. Define setting. py and enable the item processor.

# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'douban_imgs.pipelines.DoubanImgDownloadPieline': 300,}IMAGES_STORE='C:\\doubanimgs'IMAGES_EXPIRES = 90

 

Running effect:

 

Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html

If scrapy or crawler series is helpful to you, please recommend it. I will update more crawler series later.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.