Scrapy Climbing Beautiful Pictures (original)

Last Update:2016-05-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Half a month has not been updated, and recently really a bit busy. First the Huawei competition, then the lab has the project, and then learned some new knowledge, so did not update the article. In order to express my apologies, I give you a wave of welfare ...

What we're talking about today is the reptile framework. Before I used Python to crawl the web video, is based on the mechanism of the crawler, their own custom-made, feel not so tall on, so I recently played a game Python powerful crawler framework scrapy.

Scrapy is a Python-written Crawler Framework, simple and lightweight, and very handy. Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:

The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.

Briefly introduced the Scrapy workflow, we started to go straight to the theme, using Scrapy to crawl beautiful pictures.

Note that today is not a scrapy basic tutorial, we will explain the development of the next seven-night music platform. So let's get started today.

Take the fried egg net (http://jandan.net) as an example:

We came to the Fried Egg Net homepage, which has a column is sister, today's goal is it.

The classification of the pictures is arranged by the page, we have to crawl all the pictures need to simulate page.

Open the Firebug in Firefox and review the elements.

This is the picture link we need, just get the link and download it.

Let's see what is the link after the page turn???

We just have to parse out the red Line circled the label, we can know the next page of the link, it is so simple. OK, this is the time to write code ...

Open cmd, enter Scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).

Is the structure of the project.

jiandanspider.py------Spider Spiders

items.py-----------------definition of the model to crawl data

pipelines.py-------------The data we end up storing

settings.py----------------Configuration of the Scrapy

Next I'll post the code:

12345678910111213141516171819202122 jiandanSpider.py: #coding:utf-8importscrapyfromjiandan.items importJiandanItemfromscrapy.crawler importCrawlerProcessclassjiandanSpider(scrapy.Spider): name =‘jiandan‘ allowed_domains =[] start_urls =["http://jandan.net/ooxx"] defparse(self, response): item =JiandanItem() item[‘image_urls‘] =response.xpath(‘//img//@src‘).extract()#提取图片链接 # print ‘image_urls‘,item[‘image_urls‘] yielditem new_url=response.xpath(‘//a[@class="previous-comment-page"]//@href‘).extract_first()#翻页 # print ‘new_url‘,new_url ifnew_url: yieldscrapy.Request(new_url,callback=self.parse)

1234567 items.py :# -*- coding: utf-8 -*-import scrapyclass JiandanItem(scrapy.Item): # define the fields for your item here like: image_urls =scrapy.Field()#图片的链接

12345678910111213141516171819202122232425262728 pipelines.py:# -*- coding: utf-8 -*-importosimporturllibfromjiandan importsettingsclassJiandanPipeline(object): defprocess_item(self, item, spider): dir_path =‘%s/%s‘%(settings.IMAGES_STORE,spider.name)#存储路径 print ‘dir_path‘,dir_path ifnotos.path.exists(dir_path): os.makedirs(dir_path) forimage_url initem[‘image_urls‘]: list_name =image_url.split(‘/‘) file_name =list_name[len(list_name)-1]#图片名称 # print ‘filename‘,file_name file_path =‘%s/%s‘%(dir_path,file_name) # print ‘file_path‘,file_path ifos.path.exists(file_name): continue with open(file_path,‘wb‘) as file_writer: conn =urllib.urlopen(image_url)#下载图片 file_writer.write(conn.read()) file_writer.close() return item

1234567891011121314151617181920212223 settings.py:# -*- coding: utf-8 -*-# Scrapy settings for jiandan project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## http://doc.scrapy.org/en/latest/topics/settings.html# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME =‘jiandan‘SPIDER_MODULES =[‘jiandan.spiders‘]NEWSPIDER_MODULE = ‘jiandan.spiders‘ITEM_PIPELINES ={ ‘jiandan.pipelines.JiandanPipeline‘: 1,}IMAGES_STORE=‘E:‘DOWNLOAD_DELAY =0.25

Finally we start to run the program, CMD switch to the project directory,

Input scrapy crawl Jiandan, start crawler ...

About 20 minutes or so, the reptile work is over ...

Let's go and see the beautiful picture, there are 1.21G ...

Today's share is here, if you think you can ah, remember to play a reward yo.

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Climbing Beautiful Pictures (original)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Climbing Beautiful Pictures (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy Climbing Beautiful Pictures (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support