half a month has not been updated, and recently really a bit busy. First the competition, then the lab has the project, and then learned some new knowledge, so did not update the article. In order to express my apologies, I give you a wave of welfare ...
What we're talking about today is the reptile framework. Before I used Python to crawl the web video, is based on the mechanism of the crawler, their own custom-made, feel not so tall on, so I recently played a game Python powerful crawler framework scrapy.
Scrapy is a Python-written Crawler Framework, simple and lightweight, and very handy. Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:
The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.
briefly introduced the Scrapy workflow, we started to go straight to the theme, using Scrapy to crawl beautiful pictures.
Note that today is not a scrapy basic tutorial, we will explain the development of the next seven-night music platform. So let's get started today.
Fried Egg Net (http://jandan.net) must have a lot of friends know ... I seem to see the wicked smile ...
At the beginning I want to crawl pictures, a senior to my recommended URL (in fact, he wanted the picture ...) ), my young mind was ' affected ' by this.
we came to the Fried Egg Net homepage, which has a column is sister, today's goal is it.
the classification of the pictures is arranged by the page, we have to crawl all the pictures need to simulate page.
Open the Firebug in Firefox and review the elements.
This is the picture link we need, just get the link and download it .
Let's see what is the link after the page turn???
we just have to parse out the red Line circled the label , we can know the next page of the link, it is so simple. OK, this is the time to write code ...
Open cmd, enter Scrapy startproject Jiandan, this will generate a project, and then I'll copy the entire project to Pycharm (or use IDE development faster).
Is the structure of the project.
jiandanspider.py------Spider Spiders
items.py-----------------definition of the model to crawl data
pipelines.py-------------The data we end up storing
settings.py----------------Configuration of the Scrapy
Next I'll post the code:
jiandanspider.py: #coding: Utf-8import scrapyfrom jiandan.items import jiandanitemfrom scrapy.crawler Import Crawlerprocessclass Jiandanspider (scrapy. Spider): name = ' Jiandan ' allowed_domains = [] start_urls = ["Http://jandan.net/ooxx"] def parse (self , response): item = Jiandanitem () item[' image_urls '] = Response.xpath ('//img//@src '). Extract () #提取图片链接 # print ' Image_urls ', item[' image_urls '] yield item new_url= response.xpath ('//a[@class = ' Previous-comment-page "]//@href"). Extract_first () #翻页 # print ' New_url ', New_url if new_url: yield Scrapy. Request (New_url,callback=self.parse)
items.py: #-*-Coding:utf-8-*-import scrapyclass jiandanitem (scrapy. Item): # define the fields for your item here like: image_urls = scrapy. Field () #图片的链接
pipelines.py:#-*-coding:utf-8-*-import osimport urllibfrom jiandan import settingsclass JiandanPipeline (object): def process_item (self, item, spider): dir_path = '%s/%s '% (settings. Images_store,spider.name) #存储路径 print ' Dir_path ', Dir_path if not os.path.exists (Dir_path): os.makedirs (Dir_path) For Image_url in item[' image_urls ': list_name = image_url.split ('/') file_name = List_name[len (list_name)-1] #图片名称 # print ' filename ', file_name file_path = '%s/%s '% (dir_path,file_name) # print ' File_path ', file_ Path if Os.path.exists (file_name): continue with open (File_path, ' WB ') as File_writer: conn = Urllib.urlopen (image_url) #下载图片 file_writer.write (Conn.read ()) file_writer.close () return item
settings.py:#-*-coding:utf-8-*-# scrapy settings for Jiandan project## for simplicity, this file contains only setting s considered important or# commonly used. You can find more settings consulting the documentation:## http://doc.scrapy.org/en/latest/topics/settings.html# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#/ http Scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlbot_name = ' Jiandan ' SPIDER_MODULES = [' Jiandan.spiders ']newspider_module = ' jiandan.spiders ' item_pipelines = { ' jiandan.pipelines.JiandanPipeline ': 1, }images_store= ' E: ' Download_delay = 0.25
Finally we start to run the program, CMD switch to the project directory,
Input scrapy crawl Jiandan, start crawler ...
About 20 minutes or so, the reptile work is over ...
Let's go and see the beautiful picture, there are 1.21G ...
Today's share is here, if you think you can ah, remember to play (especially for pictures).
You are welcome to support me. Public number:
This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/
Scrapy Climbing Beautiful Pictures (original)