question No. 0013: use Python to write a crawl picture of the program, crawl this link in the Japanese sister pictures:-)
Full code
Ideas:
In fact, this can not be scrapy, using regular matching +request should be able to complete the task. I want to practice the scrapy, so I use Scrapy to do this.
This only requires crawling a page of pictures, so also do not write what follow rules, is relatively simple. Through the analysis of the link in the sister pictures of the label, found that Baidu posted in the picture is with Bde_image this class, so it is good to do, directly with the XPath all IMG tags with bde_image class All put forward, is the required picture, put the necessary things into the item, Then hand it over to pipeline.
I pipeline in the first to determine whether the information is complete, and then detect whether the picture has been downloaded, if yes, skip, or download the picture, for convenience, save the picture, I also put the picture information (name, storage path) stored in MongoDB.
Steps:
生成一个叫baidutieba的scrapy项目:scrapy startproject baidutieba打开项目文件夹:cd baidutieba生成一个叫meizi的spider:scrapy genspider meizi baidu.com然后编写相关代码运行:scrapy crawl meizi
Code:
Spider
meizi.py
#-*-Coding:utf-8-*-ImportScrapy fromScrapy.contrib.spidersImportCrawlspider,rule fromScrapy.contrib.linkextractors.sgmlImportSgmllinkextractor fromBaidutieba.itemsImportBaidutiebaitem fromScrapy.selectorImportSelectorImportSysreload (SYS) sys.setdefaultencoding (' Utf-8 ') class meizispider(crawlspider):Name ="Meizi"Allowed_domains = ["Baidu.com"]Print "begin to crawl the sister figure"Start_urls = (' http://tieba.baidu.com/p/2166231880 ', )# Defines the parse method, which is used to parse def parse(self, Response): # Find all the pictures of the class Bde_imageALLIMG = Selector (response). XPath ('//img[@class = "Bde_image"] ') forImginchAllimg:item = Baidutiebaitem () item[' Img_name '] = Img.xpath (' @bdwater '). Extract () [0] Item[' Img_url '] = Img.xpath (' @src '). Extract () [0]yieldItem
pipelines.py
#-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlImportPymongo fromScrapy.confImportSettings fromScrapy.exceptionsImportDropitem fromScrapyImportLogImportRequestsImportOs class imagedownloadandmongodbpipeline(object): def __init__(self): # Create a MongoDB connectionConnection = Pymongo. Mongoclient (settings[' Mongodb_server '], settings[' Mongodb_port ']) db = connection[settings[' mongodb_db ']] Self.collection = db[settings[' Mongodb_collection ']] def process_item(self, item, spider):valid =True # Check if it's legal forDatainchItemif notData:valid =False RaiseDropitem ("Missing {0}!". Format (data)ifValid# define Directory addressDir_path ='%s/%s '% (settings[' Images_store '], Spider.name)# Check if directory exists if notOs.path.exists (Dir_path): Log.msg ("No directory exists, create", Level=log. DEBUG, Spider=spider) os.makedirs (dir_path) Image_url = item[' Img_url ']# file nameUS = Image_url.split ('/')[3:] Image_file_name =' _ '. Join (US) File_path ='%s/%s '% (Dir_path, image_file_name)if notOs.path.exists (File_path):# Check if you have downloaded the picture if it does not exist withOpen (File_path,' WB ') asHandle:response = Requests.get (Image_url, stream=True) forBlockinchResponse.iter_content (1024x768):ifBlock:handle.write (block) item[' File_path '] = File_path log.msg ("Downloaded pictures!", Level=log. DEBUG, Spider=spider)# database RecordSelf.collection.insert (Dict (item)) Log.msg ("stored in database!", Level=log. DEBUG, Spider=spider)Else: Log.msg ("The picture has been downloaded, skip", Level=log. DEBUG, Spider=spider)returnItem class imagedownloadpipeline(object): def process_item(self, item, spider): PrintItemreturnItem
items.py
#-*-Coding:utf-8-*-# Define Here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html Import scrapy class baidutiebaitem(scrapy. Item): img_name = scrapy. Field ()img_url = scrapy. Field ()file_path = scrapy. Field () Pass
settings.py
#-*-Coding:utf-8-*-# scrapy settings for Baidutieba project## For simplicity, this file contains only the most important settings by# Default. all of the other settings is documented here:## http://doc.scrapy.org/en/latest/topics/settings.html#Bot_name =' Baidutieba 'Spider_modules = [' Baidutieba.spiders ']newspider_module =' Baidutieba.spiders 'Item_pipelines = {' Baidutieba.pipelines.ImageDownloadAndMongoDBPipeline ':1}# Store Picture pathImages_store ='/home/bill/pictures '# MongoDB ConfigurationMongodb_server ="localhost"Mongodb_port =27017mongodb_db ="Meizidb"Mongodb_collection ="Meizi"# Crawl responsibly by identifying yourself (and your website) on the User-agent#USER_AGENT = ' Baidutieba (+http://www.yourdomain.com) '
Crawl process:
Database:
Climb to the sister figure:
Python Show-me-the-code No. 0013 grab sister pictures using Scrapy