標籤:python 爬蟲 scrapy mongodb 圖片抓取
第 0013 題: 用 Python 寫一個爬圖片的程式,爬 這個連結裡的日本妹子圖片 :-)
完整代碼
思路:
其實這個可以不用scrapy,就用正則匹配+request應該就可以完成任務了。我想練習下scrapy,於是就用scrapy做這個了。
這個只要求爬一個網頁上的圖片,所以也不用寫什麼follow規則,算是比較簡單的。通過分析連結裡的妹子圖片 的標籤,發現百度貼吧裡發的圖片是帶BDE_Image這個類的,所以就好辦了,直接用xpath把所有img標籤中帶BDE_Image類的全部提出來,就是所需的圖片了,把需要的東西放到item裡,然後交給pipeline搞定。
我在pipeline中先判斷資訊是否齊全,然後檢測是否已經下載過這圖片,如果是的話就跳過,否則把圖片下載下來,為了方便,儲存圖片後,我還把圖片資訊(名字,存放路徑)存放在mongodb中。
步驟:
產生一個叫baidutieba的scrapy項目:scrapy startproject baidutieba開啟專案檔夾:cd baidutieba產生一個叫meizi的spider:scrapy genspider meizi baidu.com然後編寫相關代碼運行:scrapy crawl meizi
代碼:
spider:
meizi.py
# -*- coding: utf-8 -*-import scrapyfrom scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom baidutieba.items import BaidutiebaItemfrom scrapy.selector import Selectorimport sysreload(sys)sys.setdefaultencoding(‘utf-8‘)class MeiziSpider(CrawlSpider): name = "meizi" allowed_domains = ["baidu.com"] print "開始爬取妹子圖" start_urls = ( ‘http://tieba.baidu.com/p/2166231880‘, ) # 定義parse方法,用來解析 def parse(self, response): # 找出所有類為BDE_Image的圖片 AllImg = Selector(response).xpath(‘//img[@class="BDE_Image"]‘) for img in AllImg: item = BaidutiebaItem() item[‘Img_name‘] = img.xpath(‘@bdwater‘).extract()[0] item[‘Img_url‘] = img.xpath(‘@src‘).extract()[0] yield item
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here## Don‘t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf import settingsfrom scrapy.exceptions import DropItemfrom scrapy import logimport requestsimport osclass ImageDownloadAndMongoDBPipeline(object): def __init__(self): # 建立mongodb串連 connection = pymongo.MongoClient( settings[‘MONGODB_SERVER‘], settings[‘MONGODB_PORT‘] ) db = connection[settings[‘MONGODB_DB‘]] self.collection = db[settings[‘MONGODB_COLLECTION‘]] def process_item(self, item, spider): valid = True # 檢查是否合法 for data in item: if not data: valid = False raise DropItem("Missing {0}!".format(data)) if valid: # 定義目錄位址 dir_path = ‘%s/%s‘ % (settings[‘IMAGES_STORE‘], spider.name) # 檢查目錄是否存在 if not os.path.exists(dir_path): log.msg("不存在目錄,建立", level=log.DEBUG, spider=spider) os.makedirs(dir_path) image_url = item[‘Img_url‘] # 檔案名稱 us = image_url.split(‘/‘)[3:] image_file_name = ‘_‘.join(us) file_path = ‘%s/%s‘ % (dir_path, image_file_name) if not os.path.exists(file_path): # 檢查是否已經下載過 若不存在 下載該圖片 with open(file_path, ‘wb‘) as handle: response = requests.get(image_url, stream=True) for block in response.iter_content(1024): if block: handle.write(block) item[‘File_path‘] = file_path log.msg("已下載圖片!", level=log.DEBUG, spider=spider) # 資料庫記錄 self.collection.insert(dict(item)) log.msg("已存入資料庫!", level=log.DEBUG, spider=spider) else: log.msg("已下載過該圖片,跳過", level=log.DEBUG, spider=spider) return itemclass ImageDownloadPipeline(object): def process_item(self, item, spider): print item return item
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BaidutiebaItem(scrapy.Item): Img_name = scrapy.Field() Img_url = scrapy.Field() File_path = scrapy.Field() pass
settings.py
# -*- coding: utf-8 -*-# Scrapy settings for baidutieba project## For simplicity, this file contains only the most important settings by# default. All the other settings are documented here:## http://doc.scrapy.org/en/latest/topics/settings.html#BOT_NAME = ‘baidutieba‘SPIDER_MODULES = [‘baidutieba.spiders‘]NEWSPIDER_MODULE = ‘baidutieba.spiders‘ITEM_PIPELINES = {‘baidutieba.pipelines.ImageDownloadAndMongoDBPipeline‘: 1}# 存放圖片路徑IMAGES_STORE = ‘/home/bill/Pictures‘# mongodb配置MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_DB = "meizidb"MONGODB_COLLECTION = "meizi"# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = ‘baidutieba (+http://www.yourdomain.com)‘
爬取過程:
資料庫:
爬到的妹子圖:
Python Show-Me-the-Code 第 0013 題 抓取妹子圖片 使用scrapy