Scrapy Climbing Beautiful Pictures (original)

Source: Internet
Author: User

Half a month has not been updated, and recently really a bit busy. First the Huawei competition, then the lab has the project, and then learned some new knowledge, so did not update the article. In order to express my apologies, I give you a wave of welfare ...

What we're talking about today is the reptile framework. Before I used Python to crawl the web video, is based on the mechanism of the crawler, their own custom-made, feel not so tall on, so I recently played a game Python powerful crawler framework scrapy.

Scrapy is a Python-written Crawler Framework, simple and lightweight, and very handy. Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:

The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.

Briefly introduced the Scrapy workflow, we started to go straight to the theme, using Scrapy to crawl beautiful pictures.

Note that today is not a scrapy basic tutorial, we will explain the development of the next seven-night music platform. So let's get started today.

Take the fried egg net (http://jandan.net) as an example:

We came to the Fried Egg Net homepage, which has a column is sister, today's goal is it.

The classification of the pictures is arranged by the page, we have to crawl all the pictures need to simulate page.

Open the Firebug in Firefox and review the elements.

  

This is the picture link we need, just get the link and download it.

Let's see what is the link after the page turn???

  

We just have to parse out the red Line circled the label, we can know the next page of the link, it is so simple. OK, this is the time to write code ...

Open cmd, enter Scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).

Is the structure of the project.

jiandanspider.py------Spider Spiders

items.py-----------------definition of the model to crawl data

pipelines.py-------------The data we end up storing

settings.py----------------Configuration of the Scrapy

Next I'll post the code:

12345678910111213141516171819202122 jiandanSpider.py: #coding:utf-8importscrapyfromjiandan.items importJiandanItemfromscrapy.crawler importCrawlerProcessclassjiandanSpider(scrapy.Spider):    name =‘jiandan‘    allowed_domains =[]    start_urls =["http://jandan.net/ooxx"]         defparse(self, response):        item =JiandanItem()        item[‘image_urls‘=response.xpath(‘//img//@src‘).extract()#提取图片链接        # print ‘image_urls‘,item[‘image_urls‘]        yielditem        new_url=response.xpath(‘//a[@class="previous-comment-page"]//@href‘).extract_first()#翻页        # print ‘new_url‘,new_url        ifnew_url:            yieldscrapy.Request(new_url,callback=self.parse)

  

1234567 items.py :# -*- coding: utf-8 -*-import scrapyclass JiandanItem(scrapy.Item):    # define the fields for your item here like:    image_urls =scrapy.Field()#图片的链接

  

12345678910111213141516171819202122232425262728 pipelines.py:# -*- coding: utf-8 -*-importosimporturllibfromjiandan importsettingsclassJiandanPipeline(object):     defprocess_item(self, item, spider):        dir_path =‘%s/%s‘%(settings.IMAGES_STORE,spider.name)#存储路径        print ‘dir_path‘,dir_path        ifnotos.path.exists(dir_path):            os.makedirs(dir_path)        forimage_url initem[‘image_urls‘]:            list_name =image_url.split(‘/‘)            file_name =list_name[len(list_name)-1]#图片名称            # print ‘filename‘,file_name            file_path =‘%s/%s‘%(dir_path,file_name)            # print ‘file_path‘,file_path            ifos.path.exists(file_name):                continue            with open(file_path,‘wb‘) as file_writer:                conn =urllib.urlopen(image_url)#下载图片                file_writer.write(conn.read())            file_writer.close()        return item

  

        
1234567891011121314151617181920212223 settings.py:# -*- coding: utf-8 -*-# Scrapy settings for jiandan project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     http://doc.scrapy.org/en/latest/topics/settings.html#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME =‘jiandan‘SPIDER_MODULES =[‘jiandan.spiders‘]NEWSPIDER_MODULE = ‘jiandan.spiders‘ITEM_PIPELINES ={   ‘jiandan.pipelines.JiandanPipeline‘1,}IMAGES_STORE=‘E:‘DOWNLOAD_DELAY =0.25

  

Finally we start to run the program, CMD switch to the project directory,

Input scrapy crawl Jiandan, start crawler ...

About 20 minutes or so, the reptile work is over ...

Let's go and see the beautiful picture, there are 1.21G ...

Today's share is here, if you think you can ah, remember to play a reward yo.

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Climbing Beautiful Pictures (original)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.