Python Show-me-the-code No. 0013 grab sister pictures using Scrapy

Last Update:2015-05-21 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

question No. 0013: use Python to write a crawl picture of the program, crawl this link in the Japanese sister pictures:-)

Reference Code

Full code

Ideas:

In fact, this can not be scrapy, using regular matching +request should be able to complete the task. I want to practice the scrapy, so I use Scrapy to do this.

This only requires crawling a page of pictures, so also do not write what follow rules, is relatively simple. Through the analysis of the link in the sister pictures of the label, found that Baidu posted in the picture is with Bde_image this class, so it is good to do, directly with the XPath all IMG tags with bde_image class All put forward, is the required picture, put the necessary things into the item, Then hand it over to pipeline.

I pipeline in the first to determine whether the information is complete, and then detect whether the picture has been downloaded, if yes, skip, or download the picture, for convenience, save the picture, I also put the picture information (name, storage path) stored in MongoDB.

Steps:

生成一个叫baidutieba的scrapy项目：scrapy startproject baidutieba打开项目文件夹：cd baidutieba生成一个叫meizi的spider：scrapy genspider meizi baidu.com然后编写相关代码运行：scrapy crawl meizi

Code:

Spider
meizi.py

#-*-Coding:utf-8-*-ImportScrapy fromScrapy.contrib.spidersImportCrawlspider,rule fromScrapy.contrib.linkextractors.sgmlImportSgmllinkextractor fromBaidutieba.itemsImportBaidutiebaitem fromScrapy.selectorImportSelectorImportSysreload (SYS) sys.setdefaultencoding (' Utf-8 ') class meizispider(crawlspider):Name ="Meizi"Allowed_domains = ["Baidu.com"]Print "begin to crawl the sister figure"Start_urls = (' http://tieba.baidu.com/p/2166231880 ',    )# Defines the parse method, which is used to parse     def parse(self, Response):        # Find all the pictures of the class Bde_imageALLIMG = Selector (response). XPath ('//img[@class = "Bde_image"] ') forImginchAllimg:item = Baidutiebaitem () item[' Img_name '] = Img.xpath (' @bdwater '). Extract () [0] Item[' Img_url '] = Img.xpath (' @src '). Extract () [0]yieldItem

pipelines.py

#-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlImportPymongo fromScrapy.confImportSettings fromScrapy.exceptionsImportDropitem fromScrapyImportLogImportRequestsImportOs class imagedownloadandmongodbpipeline(object):     def __init__(self):        # Create a MongoDB connectionConnection = Pymongo. Mongoclient (settings[' Mongodb_server '], settings[' Mongodb_port ']) db = connection[settings[' mongodb_db ']] Self.collection = db[settings[' Mongodb_collection ']] def process_item(self, item, spider):valid =True        # Check if it's legal         forDatainchItemif  notData:valid =False                RaiseDropitem ("Missing {0}!". Format (data)ifValid# define Directory addressDir_path ='%s/%s '% (settings[' Images_store '], Spider.name)# Check if directory exists            if  notOs.path.exists (Dir_path): Log.msg ("No directory exists, create", Level=log. DEBUG, Spider=spider) os.makedirs (dir_path) Image_url = item[' Img_url ']# file nameUS = Image_url.split ('/')[3:] Image_file_name =' _ '. Join (US) File_path ='%s/%s '% (Dir_path, image_file_name)if  notOs.path.exists (File_path):# Check if you have downloaded the picture if it does not exist                 withOpen (File_path,' WB ') asHandle:response = Requests.get (Image_url, stream=True) forBlockinchResponse.iter_content (1024x768):ifBlock:handle.write (block) item[' File_path '] = File_path log.msg ("Downloaded pictures!", Level=log. DEBUG, Spider=spider)# database RecordSelf.collection.insert (Dict (item)) Log.msg ("stored in database!", Level=log. DEBUG, Spider=spider)Else: Log.msg ("The picture has been downloaded, skip", Level=log. DEBUG, Spider=spider)returnItem class imagedownloadpipeline(object):     def process_item(self, item, spider):        PrintItemreturnItem

items.py

#-*-Coding:utf-8-*-# Define Here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html Import scrapy class baidutiebaitem(scrapy.  Item): img_name = scrapy. Field ()img_url = scrapy. Field ()file_path = scrapy. Field () Pass

settings.py

#-*-Coding:utf-8-*-# scrapy settings for Baidutieba project## For simplicity, this file contains only the most important settings by# Default. all of the other settings is documented here:## http://doc.scrapy.org/en/latest/topics/settings.html#Bot_name =' Baidutieba 'Spider_modules = [' Baidutieba.spiders ']newspider_module =' Baidutieba.spiders 'Item_pipelines = {' Baidutieba.pipelines.ImageDownloadAndMongoDBPipeline ':1}# Store Picture pathImages_store ='/home/bill/pictures '# MongoDB ConfigurationMongodb_server ="localhost"Mongodb_port =27017mongodb_db ="Meizidb"Mongodb_collection ="Meizi"# Crawl responsibly by identifying yourself (and your website) on the User-agent#USER_AGENT = ' Baidutieba (+http://www.yourdomain.com) '

Crawl process:

Database:

Climb to the sister figure:

Python Show-me-the-code No. 0013 grab sister pictures using Scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More