Scrapy: Use of imagepipeline

Source: Internet
Author: User

In scrapy, To Crawl images, you can put the URLs of the images to be crawled in the image_urls field. When items are returned from Spider, imagepipeline automatically captures these URLs with a high priority, at the same time, the item will be locked until the image is captured. After the image download is completed, the image download path, URL, and other information will be filled in the images field.

To successfully capture an image, do the following:

(1) Add the image_urls and images fields to items. py,CodeAs follows:

Image_urls = field ()

Images = field ()

(2) Enable imagepipeline, add the following code in segttings. py:

Item_pipelines = ['scrapy. contrib. pipeline. Images. imagespipeline ']

Images_store = '/path/to/valid/dir' # sets the image download path.

(3) Image Storage

The image is stored after the sha1 hash value is calculated based on the original URL;

URL: http://www.example.com/image.jpg, for example

Sha1 hash value: 3afec3b4765f8f0a07b78f98c07b83f013567a0a

Stored as: <images_store>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

(4) avoid repeated capturing

Images that have been captured recently do not need to be re-captured. You can set the number of days in settings. py to be captured.

Images_expires = 90 # none of the items captured within 90 days will be recaptured

(5) Image Compression

Images_thumbs = {

'Small': (50, 50 ),

'Big ': (270,270 ),

}

Compressed files are stored in: <images_store>/thumbs/<size_name>/<image_id>. jpg

(6) Image Filtering

You can set filters as follows:

Image_min_height = 110

Images_min_width = 110

(7) custom imagepipeline

Reload functions:

Get_media_requests(Item, Info ):

Imagepipeline crawls Based on the URL specified in image_urls. You can use get_media_requests to generate a request for each URL. For example:

DefGet_media_requests (self, item, Info ):

ForImage_urlInItem ['image _ urls']:

YieldRequest (image_url)

After the image is downloaded, the processing result is returned to the item_completed () function as a binary group. The binary group is defined as follows:

(Success, image_info_or_failure)

The first element indicates whether the image is downloaded successfully. The second element is a dictionary with the following meanings:

-URL: image URL

-Path: image storage address, related to image_store

-Checksum: Image Content hash

Get_media_requestsFunction return example:

[(True,

{'Checksum': '2b00042f7481c7b056c4b1_d28f33cf ',

'Path': 'full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg ',

'Url': 'http: // www.example.com/images/product1.jpg '}),

(True,

{'Checksum': 'b9628c4ab9b595f72f280b90c4fd093d ',

'Path': 'full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg ',

'Url': 'http: // www.example.com/images/product2.jpg '}),

(False,

Failure (...)]

Item_completed(Results,Items,Info):

After all images are processed (no matter whether the download is successful or fails), item_completed is called for processing.ProgramAs follows: (by default,Item_completedReturns allItems)

FromScrapy. ExceptionsImportDropitem

DefItem_completed (self, results, item, Info ):

Image_paths = [x ['path']ForOK, XInResultsIfOK]

If notImage_paths:

RaiseDropitem ("item contains no images ")

Item ['image _ paths '] = image_paths

ReturnItem

(8) Example of custom imagepipeline:

FromScrapy. contrib. pipeline. ImagesImportImagespipeline

FromScrapy. ExceptionsImportDropitem

FromScrapy. HTTPImportRequest

ClassMyimagespipeline(Imagespipeline ):

DefGet_media_requests (self, item, Info ):

ForImage_urlInItem ['image _ urls']:

YieldRequest (image_url)

DefItem_completed (self, results, item, Info ):

Image_paths = [x ['path']ForOK, XInResultsIfOK]

If notImage_paths:

RaiseDropitem ("item contains no images ")

Item ['image _ paths '] = image_paths

ReturnItem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.