(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),

Last Update:2015-11-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html

In crawlers, we often encounter file downloads and image downloads. In other languages or frameworks, we may filter data, then, the file download class is used asynchronously to achieve the goal. In the Scrapy framework, files and image download files are implemented, which is quite convenient. Only a few lines of code are required, you can easily complete the download. The following shows how to use scrapy to download the homepage content of a Douban album.

Advantages:

1) Automatic deduplication

2) asynchronous operations without blocking

3) a thumbnail of the specified size can be generated.

4) Calculate the expiration time

5) format conversion

Encoding Process:

1. Define Item

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy import Item,Fieldclass DoubanImgsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    image_urls = Field()    images = Field()    image_paths = Field()    pass

Ii. Define spider

#coding=utf-8from scrapy.spiders import Spiderimport refrom douban_imgs.items import DoubanImgsItemfrom scrapy.http.request import Request# please pay attention to the encoding of info,otherwise raise errorimport sysreload(sys)sys.setdefaultencoding('utf8')class download_douban(Spider):    name = 'download_douban'    def __init__(self, url='152686895', *args, **kwargs):        self.allowed_domains = ['douban.com']        self.start_urls = [                'http://www.douban.com/photos/album/%s/' %(url) ]        #call the father base function        self.url = url        super(download_douban, self).__init__(*args, **kwargs)    def parse(self, response):        """        :type response: response infomation        """        list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()        if list_imgs:            item = DoubanImgsItem()            item['image_urls'] = list_imgs            yield item

3. Define piepline

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom scrapy import Requestfrom scrapy import logclass DoubanImgsPipeline(object):    def process_item(self, item, spider):        return itemclass DoubanImgDownloadPieline(ImagesPipeline):    def get_media_requests(self,item,info):        for image_url in item['image_urls']:            yield Request(image_url)    def item_completed(self, results, item, info):        image_paths = [x['path'] for ok, x in results if ok]        if not image_paths:            raise DropItem("Item contains no images")        item['image_paths'] = image_paths        return item

4. Define setting. py and enable the item processor.

# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'douban_imgs.pipelines.DoubanImgDownloadPieline': 300,}IMAGES_STORE='C:\\doubanimgs'IMAGES_EXPIRES = 90

Running effect:

Reprint the main indicated Source: http://www.cnblogs.com/codefish/p/4968260.html

If scrapy or crawler series is helpful to you, please recommend it. I will update more crawler series later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(8) What should Scrapy do for Distributed crawlers?-image download (source code release ),

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support