Scrapy: Use of imagepipeline

Last Update:2018-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In scrapy, To Crawl images, you can put the URLs of the images to be crawled in the image_urls field. When items are returned from Spider, imagepipeline automatically captures these URLs with a high priority, at the same time, the item will be locked until the image is captured. After the image download is completed, the image download path, URL, and other information will be filled in the images field.

To successfully capture an image, do the following:

(1) Add the image_urls and images fields to items. py,CodeAs follows:

Image_urls = field ()

Images = field ()

(2) Enable imagepipeline, add the following code in segttings. py:

Item_pipelines = ['scrapy. contrib. pipeline. Images. imagespipeline ']

Images_store = '/path/to/valid/dir' # sets the image download path.

(3) Image Storage

The image is stored after the sha1 hash value is calculated based on the original URL;

URL: http://www.example.com/image.jpg, for example

Sha1 hash value: 3afec3b4765f8f0a07b78f98c07b83f013567a0a

Stored as: <images_store>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

(4) avoid repeated capturing

Images that have been captured recently do not need to be re-captured. You can set the number of days in settings. py to be captured.

Images_expires = 90 # none of the items captured within 90 days will be recaptured

(5) Image Compression

Images_thumbs = {

'Small': (50, 50 ),

'Big ': (270,270 ),

}

Compressed files are stored in: <images_store>/thumbs/<size_name>/<image_id>. jpg

(6) Image Filtering

You can set filters as follows:

Image_min_height = 110

Images_min_width = 110

(7) custom imagepipeline

Reload functions:

Get_media_requests(Item, Info ):

Imagepipeline crawls Based on the URL specified in image_urls. You can use get_media_requests to generate a request for each URL. For example:

DefGet_media_requests (self, item, Info ):

ForImage_urlInItem ['image _ urls']:

YieldRequest (image_url)

After the image is downloaded, the processing result is returned to the item_completed () function as a binary group. The binary group is defined as follows:

(Success, image_info_or_failure)

The first element indicates whether the image is downloaded successfully. The second element is a dictionary with the following meanings:

-URL: image URL

-Path: image storage address, related to image_store

-Checksum: Image Content hash

Get_media_requestsFunction return example:

[(True,

{'Checksum': '2b00042f7481c7b056c4b1_d28f33cf ',

'Path': 'full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg ',

'Url': 'http: // www.example.com/images/product1.jpg '}),

(True,

{'Checksum': 'b9628c4ab9b595f72f280b90c4fd093d ',

'Path': 'full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg ',

'Url': 'http: // www.example.com/images/product2.jpg '}),

(False,

Failure (...)]

Item_completed(Results,Items,Info):

After all images are processed (no matter whether the download is successful or fails), item_completed is called for processing.ProgramAs follows: (by default,Item_completedReturns allItems)

FromScrapy. ExceptionsImportDropitem

DefItem_completed (self, results, item, Info ):

Image_paths = [x ['path']ForOK, XInResultsIfOK]

If notImage_paths:

RaiseDropitem ("item contains no images ")

Item ['image _ paths '] = image_paths

ReturnItem

(8) Example of custom imagepipeline:

FromScrapy. contrib. pipeline. ImagesImportImagespipeline

FromScrapy. ExceptionsImportDropitem

FromScrapy. HTTPImportRequest

ClassMyimagespipeline(Imagespipeline ):

DefGet_media_requests (self, item, Info ):

ForImage_urlInItem ['image _ urls']:

YieldRequest (image_url)

DefItem_completed (self, results, item, Info ):

Image_paths = [x ['path']ForOK, XInResultsIfOK]

If notImage_paths:

RaiseDropitem ("item contains no images ")

Item ['image _ paths '] = image_paths

ReturnItem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy: Use of imagepipeline

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support