Scrapy Crawl Beauty Picture sequel (original)

Source: Internet
Author: User

  We explained the work mechanism of scrapy and how to use scrapy to crawl beautiful pictures, and today went on to explain Scrapy climbed beautiful pictures, but took a different way and code implementation, the function of scrapy more in-depth use.

  In the process of learning Scrapy official documents, found that scrapy itself to achieve the image and file download function, do not need us to implement the image download (but the principle is the same).

in the official documentation, we can see some of the following: Scrapy provides a reusable item pipelines to download the files contained in item (such as when crawling to a product and also want to save the corresponding picture). These pipeline have some common methods and structures (what we call Media Pipeline). Generally you will use files Pipeline or Images Pipeline.

Both of these pipeline implement the following features:

    • Avoid re-downloading data that has recently been downloaded

    • specifying where to store the media (filesystem directory, Amazon S3 buckets)

The Images Pipeline have a few extra functions for processing Images:

    • Convert all downloaded images to a common format (JPG) and mode (RGB)

    • Thumbnail generation

    • Detect the width/height of images to ensure they meet minimum limits

This pipeline also retains an internal queue for those images that are currently scheduled to be downloaded, and connects those items that arrive with the same image to that queue. This avoids multiple downloads of the same picture shared by several projects.

from the above, we can see that scrapy can not only download pictures, but also generate thumbnails of the specified size, which is very useful.

UseFiles Pipeline

When used FilesPipeline , the typical workflow is as follows:

  1. In a reptile, you crawl an item and put the URL file_urls of the picture in the group.

  2. The project is returned from within the crawler and into the project pipeline.

  3. When the project FilesPipeline enters file_urls , the URLs within the group will be downloaded by the Scrapy Scheduler and the downloader (which means that the middleware of the scheduler and the downloader can be reused), and when the priority is higher, it will be processed before the other pages are fetched. The project will remain "locker" in this particular pipeline phase until the download of the file is completed (or, for some reason, the download is not completed).

  4. When the file is finished downloading, another field files () is updated into the structure. This group will contain a list of dictionaries, including information about the download file, such as the download path, the source file_urls fetch address (obtained from the group), and the check code for the picture (checksum). filesthe order of the files in the list file_urls will be consistent with the source group. If a picture fails to download, an error message will be logged and the picture will not files appear in the group.

UseImages Pipeline

When used Imagespipeline , the typical workflow is as follows:

  1. In a reptile, you crawl an item and put the URL images_urls of the picture in the group.

  2. The project is returned from within the crawler and into the project pipeline.

  3. When the project Imagespipeline enters images_urls , the URLs within the group will be downloaded by the Scrapy Scheduler and the downloader (which means that the middleware of the scheduler and the downloader can be reused), and when the priority is higher, it will be processed before the other pages are fetched. The project will remain "locker" in this particular pipeline phase until the download of the file is completed (or, for some reason, the download is not completed).

  4. images ) will be updated into the structure. This group will contain a list of dictionaries, including the download file information, such as the download path, the source fetch address (from  < Span class= "Pre" >images_urls   Group acquired) and the checksum of the picture (checksum).  images   file order in list and source  images_urls   Groups remain consistent. If a picture fails to download, an error message will be recorded and the picture will not appear in   images   in the group.

     

Pillow is used to generate thumbnails and to convert images into JPEG/RGB format, so you need to install this library in order to use the picture pipeline. The Python Imaging Library (PIL) is valid in most cases, but it is well known that there are problems in some settings, so we recommend using Pillow instead of PIL.

this time we use Images Pipelineto download images and use Pillow to generate thumbnails. On the basis of installing Scrapy, install this module using pip Install Pillow .

Open cmd, enter scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).

Is the structure of the project.

jiandanspider.py------Spider Spiders

items.py-----------------definition of the model to crawl data

pipelines.py-------------The data we end up storing

settings.py----------------Configuration of the Scrapy

Next I paste the code (copy the code to my blog):

jiandanspider.py (and no changes previously): #coding: utf-8# need to install Pillow module import scrapyfrom jiandan.items import Jiandanitemfrom Scrapy.crawler Import Crawlerprocessclass Jiandanspider (scrapy. Spider):    name = ' Jiandan '    allowed_domains = []    start_urls = ["Http://jandan.net/ooxx"]    def parse (self , response):        item = Jiandanitem ()        item[' image_urls '] = Response.xpath ('//img//@src '). Extract () #提取图片链接        # print ' Image_urls ', item[' image_urls ']        yield item        new_url= response.xpath ('//a[@class = ' Previous-comment-page "]//@href"). Extract_first () #翻页        # print ' New_url ', New_url        if new_url:            yield Scrapy. Request (New_url,callback=self.parse)

 

 

items.py (added a field, see previous description of Images pipeline): #-*-Coding:utf-8-*-# Define Here the models for your scraped items## seeing do Cumentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass jiandanitem (scrapy. Item):    # define the fields for your item here like:    image_urls = scrapy. Field () #图片的链接    images = scrapy. Field ()

 

 

pipelines.py (change max, see note): #-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport Osimport urllibimport Scrapyfrom scrapy.exceptions Import dropitemfrom scrapy.pipelines.images import imagespipelinefrom Jiandan Import Settingsclass Jiandanpipeline (imagespipeline): #继承ImagesPipeline这个类 to achieve this function def get_media_requests (self, item, info): #重写ImagesPipeline get_media_requests method ':p Aram item::p Aram Info:: Return: In the workflow, you can        As you can see, the pipeline gets the URL of the file and downloads it from the project.            To do this, you need to rewrite the get_media_requests () method and return a request to each image URL: "' for Image_url in item[' Image_urls ']: Yield scrapy.         Request (Image_url) def item_completed (self, results, item, Info): "':p Aram Results::p Aram Item: :p Aram Info:: return: When all the picture requests in a single project are complete (either to complete the download or for some reason the download fails), Item_compThe leted () method is called.  "' image_paths = [x[' path '] for OK, x in results if OK] if not image_paths:raise dropitem (" Item Contains no images ") return item

  

settings.py (mainly for thumbnail settings): #-*-Coding:utf-8-*-# scrapy settings for Jiandan project## for simplicity, this file contains Only settings considered important or# commonly used. You can find more settings consulting the documentation:##     http://doc.scrapy.org/en/latest/topics/settings.html#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#/     http Scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlbot_name = ' Jiandan ' SPIDER_MODULES = [' Jiandan.spiders ']newspider_module = ' jiandan.spiders ' item_pipelines = {   ' jiandan.pipelines.JiandanPipeline ': 1, }# item_pipelines = {' Jiandan.pipelines.ImagesPipeline ': 1}images_store= ' e:\\jiandan2 ' Download_delay = 0.25images_ Thumbs = {#缩略图的尺寸, setting this value results in a thumbnail    ' small ': (    200, 200), ' big ':

  

Finally we start to run the program, CMD switch to the project directory,

Input scrapy crawl Jiandan, start crawler ...

About 25 minutes or so, the reptile work is over ...

Let's go and see the beautiful pictures.

Let's open the Thumbs folder and look at the thumbnails, which have the different sizes we set.

Today's share is here, if you think you can ah, remember to play a reward yo .

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Crawl Beauty Picture sequel (original)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.