We explained the work mechanism of scrapy and how to use scrapy to crawl beautiful pictures, and today went on to explain Scrapy climbed beautiful pictures, but took a different way and code implementation, the function of scrapy more in-depth use.
In the process of learning Scrapy official documents, found that scrapy itself to achieve the image and file download function, do not need us to implement the image download (but the principle is the same).
in the official documentation, we can see some of the following: Scrapy provides a reusable item pipelines to download the files contained in item (such as when crawling to a product and also want to save the corresponding picture). These pipeline have some common methods and structures (what we call Media Pipeline). Generally you will use files Pipeline or Images Pipeline.
Both of these pipeline implement the following features:
Avoid re-downloading data that has recently been downloaded
specifying where to store the media (filesystem directory, Amazon S3 buckets)
The Images Pipeline have a few extra functions for processing Images:
This pipeline also retains an internal queue for those images that are currently scheduled to be downloaded, and connects those items that arrive with the same image to that queue. This avoids multiple downloads of the same picture shared by several projects.
from the above, we can see that scrapy can not only download pictures, but also generate thumbnails of the specified size, which is very useful.
UseFiles Pipeline
When used FilesPipeline
, the typical workflow is as follows:
In a reptile, you crawl an item and put the URL file_urls
of the picture in the group.
The project is returned from within the crawler and into the project pipeline.
When the project FilesPipeline
enters file_urls
, the URLs within the group will be downloaded by the Scrapy Scheduler and the downloader (which means that the middleware of the scheduler and the downloader can be reused), and when the priority is higher, it will be processed before the other pages are fetched. The project will remain "locker" in this particular pipeline phase until the download of the file is completed (or, for some reason, the download is not completed).
When the file is finished downloading, another field files
() is updated into the structure. This group will contain a list of dictionaries, including information about the download file, such as the download path, the source file_urls
fetch address (obtained from the group), and the check code for the picture (checksum). files
the order of the files in the list file_urls
will be consistent with the source group. If a picture fails to download, an error message will be logged and the picture will not files
appear in the group.
UseImages Pipeline
When used Imagespipeline
, the typical workflow is as follows:
In a reptile, you crawl an item and put the URL images_urls
of the picture in the group.
The project is returned from within the crawler and into the project pipeline.
When the project Imagespipeline
enters images_urls
, the URLs within the group will be downloaded by the Scrapy Scheduler and the downloader (which means that the middleware of the scheduler and the downloader can be reused), and when the priority is higher, it will be processed before the other pages are fetched. The project will remain "locker" in this particular pipeline phase until the download of the file is completed (or, for some reason, the download is not completed).
-
images ) will be updated into the structure. This group will contain a list of dictionaries, including the download file information, such as the download path, the source fetch address (from < Span class= "Pre" >images_urls
Group acquired) and the checksum of the picture (checksum). images
file order in list and source images_urls
Groups remain consistent. If a picture fails to download, an error message will be recorded and the picture will not appear in images
in the group.
Pillow is used to generate thumbnails and to convert images into JPEG/RGB format, so you need to install this library in order to use the picture pipeline. The Python Imaging Library (PIL) is valid in most cases, but it is well known that there are problems in some settings, so we recommend using Pillow instead of PIL.
this time we use Images Pipelineto download images and use Pillow to generate thumbnails. On the basis of installing Scrapy, install this module using pip Install Pillow .
Open cmd, enter scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).
Is the structure of the project.
jiandanspider.py------Spider Spiders
items.py-----------------definition of the model to crawl data
pipelines.py-------------The data we end up storing
settings.py----------------Configuration of the Scrapy
Next I paste the code (copy the code to my blog):
jiandanspider.py (and no changes previously): #coding: utf-8# need to install Pillow module import scrapyfrom jiandan.items import Jiandanitemfrom Scrapy.crawler Import Crawlerprocessclass Jiandanspider (scrapy. Spider): name = ' Jiandan ' allowed_domains = [] start_urls = ["Http://jandan.net/ooxx"] def parse (self , response): item = Jiandanitem () item[' image_urls '] = Response.xpath ('//img//@src '). Extract () #提取图片链接 # print ' Image_urls ', item[' image_urls '] yield item new_url= response.xpath ('//a[@class = ' Previous-comment-page "]//@href"). Extract_first () #翻页 # print ' New_url ', New_url if new_url: yield Scrapy. Request (New_url,callback=self.parse)
items.py (added a field, see previous description of Images pipeline): #-*-Coding:utf-8-*-# Define Here the models for your scraped items## seeing do Cumentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass jiandanitem (scrapy. Item): # define the fields for your item here like: image_urls = scrapy. Field () #图片的链接 images = scrapy. Field ()
pipelines.py (change max, see note): #-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport Osimport urllibimport Scrapyfrom scrapy.exceptions Import dropitemfrom scrapy.pipelines.images import imagespipelinefrom Jiandan Import Settingsclass Jiandanpipeline (imagespipeline): #继承ImagesPipeline这个类 to achieve this function def get_media_requests (self, item, info): #重写ImagesPipeline get_media_requests method ':p Aram item::p Aram Info:: Return: In the workflow, you can As you can see, the pipeline gets the URL of the file and downloads it from the project. To do this, you need to rewrite the get_media_requests () method and return a request to each image URL: "' for Image_url in item[' Image_urls ']: Yield scrapy. Request (Image_url) def item_completed (self, results, item, Info): "':p Aram Results::p Aram Item: :p Aram Info:: return: When all the picture requests in a single project are complete (either to complete the download or for some reason the download fails), Item_compThe leted () method is called. "' image_paths = [x[' path '] for OK, x in results if OK] if not image_paths:raise dropitem (" Item Contains no images ") return item
settings.py (mainly for thumbnail settings): #-*-Coding:utf-8-*-# scrapy settings for Jiandan project## for simplicity, this file contains Only settings considered important or# commonly used. You can find more settings consulting the documentation:## http://doc.scrapy.org/en/latest/topics/settings.html# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#/ http Scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlbot_name = ' Jiandan ' SPIDER_MODULES = [' Jiandan.spiders ']newspider_module = ' jiandan.spiders ' item_pipelines = { ' jiandan.pipelines.JiandanPipeline ': 1, }# item_pipelines = {' Jiandan.pipelines.ImagesPipeline ': 1}images_store= ' e:\\jiandan2 ' Download_delay = 0.25images_ Thumbs = {#缩略图的尺寸, setting this value results in a thumbnail ' small ': ( 200, 200), ' big ':
Finally we start to run the program, CMD switch to the project directory,
Input scrapy crawl Jiandan, start crawler ...
About 25 minutes or so, the reptile work is over ...
Let's go and see the beautiful pictures.
Let's open the Thumbs folder and look at the thumbnails, which have the different sizes we set.
Today's share is here, if you think you can ah, remember to play a reward yo .
You are welcome to support me. Public number:
This article belongs to original works, welcome everybody to reprint to share. respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/
Scrapy Crawl Beauty Picture sequel (original)