Scrapy Climbing Beautiful Pictures (original)

Source: Internet
Author: User



half a month has not been updated, and recently really a bit busy. First the competition, then the lab has the project, and then learned some new knowledge, so did not update the article. In order to express my apologies, I give you a wave of welfare ...









What we're talking about today is the reptile framework. Before I used Python to crawl the web video, is based on the mechanism of the crawler, their own custom-made, feel not so tall on, so I recently played a game Python powerful crawler framework scrapy.





Scrapy is a Python-written Crawler Framework, simple and lightweight, and very handy. Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:






 The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.






briefly introduced the Scrapy workflow, we started to go straight to the theme, using Scrapy to crawl beautiful pictures.





Note that today is not a scrapy basic tutorial, we will explain the development of the next seven-night music platform. So let's get started today.





Fried Egg Net (http://jandan.net) must have a lot of friends know ... I seem to see the wicked smile ...





At the beginning I want to crawl pictures, a senior to my recommended URL (in fact, he wanted the picture ...) ), my young mind was ' affected ' by this.












we came to the Fried Egg Net homepage, which has a column is sister, today's goal is it.












the classification of the pictures is arranged by the page, we have to crawl all the pictures need to simulate page.












Open the Firebug in Firefox and review the elements.









This is the picture link we need, just get the link and download it .





Let's see what is the link after the page turn???












   we just have to parse out the red Line circled the label , we can know the next page of the link, it is so simple. OK, this is the time to write code ...





Open cmd, enter Scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).















Is the structure of the project.





jiandanspider.py------Spider Spiders





items.py-----------------definition of the model to crawl data





pipelines.py-------------The data we end up storing





settings.py----------------Configuration of the Scrapy





Next I'll post the code:


jiandanspider.py: #coding: Utf-8import scrapyfrom jiandan.items import jiandanitemfrom scrapy.crawler Import Crawlerprocessclass Jiandanspider (scrapy. Spider):    name = ' Jiandan '    allowed_domains = []    start_urls = ["Http://jandan.net/ooxx"]        def parse (self , response):        item = Jiandanitem ()        item[' image_urls '] = Response.xpath ('//img//@src '). Extract () #提取图片链接        # print ' Image_urls ', item[' image_urls ']        yield item        new_url= response.xpath ('//a[@class = ' Previous-comment-page "]//@href"). Extract_first () #翻页        # print ' New_url ', New_url        if new_url:            yield Scrapy. Request (New_url,callback=self.parse)







items.py: #-*-Coding:utf-8-*-import scrapyclass jiandanitem (scrapy. Item):    # define the fields for your item here like:    image_urls = scrapy. Field () #图片的链接





pipelines.py:#-*-coding:utf-8-*-import osimport urllibfrom jiandan import settingsclass JiandanPipeline (object):    def process_item (self, item, spider):        dir_path = '%s/%s '% (settings. Images_store,spider.name) #存储路径        print ' Dir_path ', Dir_path        if not os.path.exists (Dir_path):            os.makedirs (Dir_path)        For Image_url in item[' image_urls ':            list_name = image_url.split ('/')            file_name = List_name[len (list_name)-1] #图片名称            # print ' filename ', file_name            file_path = '%s/%s '% (dir_path,file_name)            # print ' File_path ', file_ Path            if Os.path.exists (file_name):                continue            with open (File_path, ' WB ') as File_writer:                conn = Urllib.urlopen (image_url) #下载图片                file_writer.write (Conn.read ())            file_writer.close ()        return item




 
settings.py:#-*-coding:utf-8-*-# scrapy settings for Jiandan project## for simplicity, this file contains only setting s considered important or# commonly used. You can find more settings consulting the documentation:##     http://doc.scrapy.org/en/latest/topics/settings.html#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#/     http Scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlbot_name = ' Jiandan ' SPIDER_MODULES = [' Jiandan.spiders ']newspider_module = ' jiandan.spiders ' item_pipelines = {   ' jiandan.pipelines.JiandanPipeline ': 1, }images_store= ' E: ' Download_delay = 0.25





Finally we start to run the program, CMD switch to the project directory,



Input scrapy crawl Jiandan, start crawler ...












About 20 minutes or so, the reptile work is over ...












Let's go and see the beautiful picture, there are 1.21G ...


















Today's share is here, if you think you can ah, remember to play (especially for pictures).









You are welcome to support me. Public number:






This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/






Scrapy Climbing Beautiful Pictures (original)


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.