Scrapy Climbing Beautiful Pictures (original)

Last Update:2016-04-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

half a month has not been updated, and recently really a bit busy. First the competition, then the lab has the project, and then learned some new knowledge, so did not update the article. In order to express my apologies, I give you a wave of welfare ...

What we're talking about today is the reptile framework. Before I used Python to crawl the web video, is based on the mechanism of the crawler, their own custom-made, feel not so tall on, so I recently played a game Python powerful crawler framework scrapy.

Scrapy is a Python-written Crawler Framework, simple and lightweight, and very handy. Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:

The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.

briefly introduced the Scrapy workflow, we started to go straight to the theme, using Scrapy to crawl beautiful pictures.

Note that today is not a scrapy basic tutorial, we will explain the development of the next seven-night music platform. So let's get started today.

Fried Egg Net (http://jandan.net) must have a lot of friends know ... I seem to see the wicked smile ...

At the beginning I want to crawl pictures, a senior to my recommended URL (in fact, he wanted the picture ...) ), my young mind was ' affected ' by this.

we came to the Fried Egg Net homepage, which has a column is sister, today's goal is it.

the classification of the pictures is arranged by the page, we have to crawl all the pictures need to simulate page.

Open the Firebug in Firefox and review the elements.

This is the picture link we need, just get the link and download it .

Let's see what is the link after the page turn???

we just have to parse out the red Line circled the label , we can know the next page of the link, it is so simple. OK, this is the time to write code ...

Open cmd, enter Scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).

Is the structure of the project.

jiandanspider.py------Spider Spiders

items.py-----------------definition of the model to crawl data

pipelines.py-------------The data we end up storing

settings.py----------------Configuration of the Scrapy

Next I'll post the code:

jiandanspider.py: #coding: Utf-8import scrapyfrom jiandan.items import jiandanitemfrom scrapy.crawler Import Crawlerprocessclass Jiandanspider (scrapy. Spider):    name = ' Jiandan '    allowed_domains = []    start_urls = ["Http://jandan.net/ooxx"]        def parse (self , response):        item = Jiandanitem ()        item[' image_urls '] = Response.xpath ('//img//@src '). Extract () #提取图片链接        # print ' Image_urls ', item[' image_urls ']        yield item        new_url= response.xpath ('//a[@class = ' Previous-comment-page "]//@href"). Extract_first () #翻页        # print ' New_url ', New_url        if new_url:            yield Scrapy. Request (New_url,callback=self.parse)

items.py: #-*-Coding:utf-8-*-import scrapyclass jiandanitem (scrapy. Item):    # define the fields for your item here like:    image_urls = scrapy. Field () #图片的链接

pipelines.py:#-*-coding:utf-8-*-import osimport urllibfrom jiandan import settingsclass JiandanPipeline (object):    def process_item (self, item, spider):        dir_path = '%s/%s '% (settings. Images_store,spider.name) #存储路径        print ' Dir_path ', Dir_path        if not os.path.exists (Dir_path):            os.makedirs (Dir_path)        For Image_url in item[' image_urls ':            list_name = image_url.split ('/')            file_name = List_name[len (list_name)-1] #图片名称            # print ' filename ', file_name            file_path = '%s/%s '% (dir_path,file_name)            # print ' File_path ', file_ Path            if Os.path.exists (file_name):                continue            with open (File_path, ' WB ') as File_writer:                conn = Urllib.urlopen (image_url) #下载图片                file_writer.write (Conn.read ())            file_writer.close ()        return item

settings.py:#-*-coding:utf-8-*-# scrapy settings for Jiandan project## for simplicity, this file contains only setting s considered important or# commonly used. You can find more settings consulting the documentation:##     http://doc.scrapy.org/en/latest/topics/settings.html#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#/     http Scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlbot_name = ' Jiandan ' SPIDER_MODULES = [' Jiandan.spiders ']newspider_module = ' jiandan.spiders ' item_pipelines = {   ' jiandan.pipelines.JiandanPipeline ': 1, }images_store= ' E: ' Download_delay = 0.25

Finally we start to run the program, CMD switch to the project directory,

Input scrapy crawl Jiandan, start crawler ...

About 20 minutes or so, the reptile work is over ...

Let's go and see the beautiful picture, there are 1.21G ...

Today's share is here, if you think you can ah, remember to play (especially for pictures).

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Climbing Beautiful Pictures (original)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Climbing Beautiful Pictures (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy Climbing Beautiful Pictures (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support