In scrapy, To Crawl images, you can put the URLs of the images to be crawled in the image_urls field. When items are returned from Spider, imagepipeline automatically captures these URLs with a high priority, at the same time, the item will be locked until the image is captured. After the image download is completed, the image download path, URL, and other information will be filled in the images field.
To successfully capture an image, do the following:
(1) Add the image_urls and images fields to items. py,CodeAs follows:
Image_urls = field ()
Images = field ()
(2) Enable imagepipeline, add the following code in segttings. py:
Item_pipelines = ['scrapy. contrib. pipeline. Images. imagespipeline ']
Images_store = '/path/to/valid/dir' # sets the image download path.
(3) Image Storage
The image is stored after the sha1 hash value is calculated based on the original URL;
URL: http://www.example.com/image.jpg, for example
Sha1 hash value: 3afec3b4765f8f0a07b78f98c07b83f013567a0a
Stored as: <images_store>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
(4) avoid repeated capturing
Images that have been captured recently do not need to be re-captured. You can set the number of days in settings. py to be captured.
Images_expires = 90 # none of the items captured within 90 days will be recaptured
(5) Image Compression
Images_thumbs = {
'Small': (50, 50 ),
'Big ': (270,270 ),
}
Compressed files are stored in: <images_store>/thumbs/<size_name>/<image_id>. jpg
(6) Image Filtering
You can set filters as follows:
Image_min_height = 110
Images_min_width = 110
(7) custom imagepipeline
Reload functions:
Get_media_requests(Item, Info ):
Imagepipeline crawls Based on the URL specified in image_urls. You can use get_media_requests to generate a request for each URL. For example:
DefGet_media_requests (self, item, Info ):
ForImage_urlInItem ['image _ urls']:
YieldRequest (image_url)
After the image is downloaded, the processing result is returned to the item_completed () function as a binary group. The binary group is defined as follows:
(Success, image_info_or_failure)
The first element indicates whether the image is downloaded successfully. The second element is a dictionary with the following meanings:
-URL: image URL
-Path: image storage address, related to image_store
-Checksum: Image Content hash
Get_media_requestsFunction return example:
[(True,
{'Checksum': '2b00042f7481c7b056c4b1_d28f33cf ',
'Path': 'full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg ',
'Url': 'http: // www.example.com/images/product1.jpg '}),
(True,
{'Checksum': 'b9628c4ab9b595f72f280b90c4fd093d ',
'Path': 'full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg ',
'Url': 'http: // www.example.com/images/product2.jpg '}),
(False,
Failure (...)]
Item_completed(Results,Items,Info):
After all images are processed (no matter whether the download is successful or fails), item_completed is called for processing.ProgramAs follows: (by default,Item_completedReturns allItems)
FromScrapy. ExceptionsImportDropitem
DefItem_completed (self, results, item, Info ):
Image_paths = [x ['path']ForOK, XInResultsIfOK]
If notImage_paths:
RaiseDropitem ("item contains no images ")
Item ['image _ paths '] = image_paths
ReturnItem
(8) Example of custom imagepipeline:
FromScrapy. contrib. pipeline. ImagesImportImagespipeline
FromScrapy. ExceptionsImportDropitem
FromScrapy. HTTPImportRequest
ClassMyimagespipeline(Imagespipeline ):
DefGet_media_requests (self, item, Info ):
ForImage_urlInItem ['image _ urls']:
YieldRequest (image_url)
DefItem_completed (self, results, item, Info ):
Image_paths = [x ['path']ForOK, XInResultsIfOK]
If notImage_paths:
RaiseDropitem ("item contains no images ")
Item ['image _ paths '] = image_paths
ReturnItem