(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),
Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html
In crawlers, we often encounter file downloads and image downloads. In other languages or frameworks, we may filter data, then, the file download class is used asynchronously to achieve the goal. In the Scrapy framework, files and image download files are implemented, which is quite convenient. Only a few lines of code are required, you can easily complete the download. The following shows how to use scrapy to download the homepage content of a Douban album.
Advantages:
1) Automatic deduplication
2) asynchronous operations without blocking
3) a thumbnail of the specified size can be generated.
4) Calculate the expiration time
5) format conversion
Encoding Process:
1. Define Item
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy import Item,Fieldclass DoubanImgsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = Field() images = Field() image_paths = Field() pass
Ii. Define spider
#coding=utf-8from scrapy.spiders import Spiderimport refrom douban_imgs.items import DoubanImgsItemfrom scrapy.http.request import Request# please pay attention to the encoding of info,otherwise raise errorimport sysreload(sys)sys.setdefaultencoding('utf8')class download_douban(Spider): name = 'download_douban' def __init__(self, url='152686895', *args, **kwargs): self.allowed_domains = ['douban.com'] self.start_urls = [ 'http://www.douban.com/photos/album/%s/' %(url) ] #call the father base function self.url = url super(download_douban, self).__init__(*args, **kwargs) def parse(self, response): """ :type response: response infomation """ list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract() if list_imgs: item = DoubanImgsItem() item['image_urls'] = list_imgs yield item
3. Define piepline
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom scrapy import Requestfrom scrapy import logclass DoubanImgsPipeline(object): def process_item(self, item, spider): return itemclass DoubanImgDownloadPieline(ImagesPipeline): def get_media_requests(self,item,info): for image_url in item['image_urls']: yield Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item
4. Define setting. py and enable the item processor.
# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'douban_imgs.pipelines.DoubanImgDownloadPieline': 300,}IMAGES_STORE='C:\\doubanimgs'IMAGES_EXPIRES = 90
Running effect:
Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html
If scrapy or crawler series is helpful to you, please recommend it. I will update more crawler series later.