Python Show-Me-the-Code 第 0013 題 抓取妹子圖片 使用scrapy

來源:互聯網
上載者:User

標籤:python   爬蟲   scrapy   mongodb   圖片抓取   

第 0013 題: 用 Python 寫一個爬圖片的程式,爬 這個連結裡的日本妹子圖片 :-)

  • 參考代碼

完整代碼

思路:

其實這個可以不用scrapy,就用正則匹配+request應該就可以完成任務了。我想練習下scrapy,於是就用scrapy做這個了。

這個只要求爬一個網頁上的圖片,所以也不用寫什麼follow規則,算是比較簡單的。通過分析連結裡的妹子圖片 的標籤,發現百度貼吧裡發的圖片是帶BDE_Image這個類的,所以就好辦了,直接用xpath把所有img標籤中帶BDE_Image類的全部提出來,就是所需的圖片了,把需要的東西放到item裡,然後交給pipeline搞定。

我在pipeline中先判斷資訊是否齊全,然後檢測是否已經下載過這圖片,如果是的話就跳過,否則把圖片下載下來,為了方便,儲存圖片後,我還把圖片資訊(名字,存放路徑)存放在mongodb中。

步驟:
產生一個叫baidutieba的scrapy項目:scrapy startproject baidutieba開啟專案檔夾:cd baidutieba產生一個叫meizi的spider:scrapy genspider meizi baidu.com然後編寫相關代碼運行:scrapy crawl meizi
代碼:

spider:
meizi.py

# -*- coding: utf-8 -*-import scrapyfrom scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom baidutieba.items import BaidutiebaItemfrom scrapy.selector import Selectorimport sysreload(sys)sys.setdefaultencoding(‘utf-8‘)class MeiziSpider(CrawlSpider):    name = "meizi"    allowed_domains = ["baidu.com"]    print "開始爬取妹子圖"    start_urls = (        ‘http://tieba.baidu.com/p/2166231880‘,    )    # 定義parse方法,用來解析    def parse(self, response):        # 找出所有類為BDE_Image的圖片        AllImg = Selector(response).xpath(‘//img[@class="BDE_Image"]‘)        for img in AllImg:            item = BaidutiebaItem()            item[‘Img_name‘] = img.xpath(‘@bdwater‘).extract()[0]            item[‘Img_url‘] = img.xpath(‘@src‘).extract()[0]            yield item

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don‘t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf import settingsfrom scrapy.exceptions import DropItemfrom scrapy import logimport requestsimport osclass ImageDownloadAndMongoDBPipeline(object):    def __init__(self):        # 建立mongodb串連        connection = pymongo.MongoClient(            settings[‘MONGODB_SERVER‘],            settings[‘MONGODB_PORT‘]        )        db = connection[settings[‘MONGODB_DB‘]]        self.collection = db[settings[‘MONGODB_COLLECTION‘]]    def process_item(self, item, spider):        valid = True        # 檢查是否合法        for data in item:            if not data:                valid = False                raise DropItem("Missing {0}!".format(data))        if valid:            # 定義目錄位址            dir_path = ‘%s/%s‘ % (settings[‘IMAGES_STORE‘], spider.name)            # 檢查目錄是否存在            if not os.path.exists(dir_path):                log.msg("不存在目錄,建立",                        level=log.DEBUG, spider=spider)                os.makedirs(dir_path)            image_url = item[‘Img_url‘]            # 檔案名稱            us = image_url.split(‘/‘)[3:]            image_file_name = ‘_‘.join(us)            file_path = ‘%s/%s‘ % (dir_path, image_file_name)            if not os.path.exists(file_path):                # 檢查是否已經下載過 若不存在 下載該圖片                with open(file_path, ‘wb‘) as handle:                    response = requests.get(image_url, stream=True)                    for block in response.iter_content(1024):                        if block:                            handle.write(block)                item[‘File_path‘] = file_path                log.msg("已下載圖片!",                        level=log.DEBUG, spider=spider)                # 資料庫記錄                self.collection.insert(dict(item))                log.msg("已存入資料庫!",                        level=log.DEBUG, spider=spider)            else:                log.msg("已下載過該圖片,跳過",                        level=log.DEBUG, spider=spider)        return itemclass ImageDownloadPipeline(object):    def process_item(self, item, spider):        print item        return item

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BaidutiebaItem(scrapy.Item):    Img_name = scrapy.Field()    Img_url = scrapy.Field()    File_path = scrapy.Field()    pass

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for baidutieba project## For simplicity, this file contains only the most important settings by# default. All the other settings are documented here:##     http://doc.scrapy.org/en/latest/topics/settings.html#BOT_NAME = ‘baidutieba‘SPIDER_MODULES = [‘baidutieba.spiders‘]NEWSPIDER_MODULE = ‘baidutieba.spiders‘ITEM_PIPELINES = {‘baidutieba.pipelines.ImageDownloadAndMongoDBPipeline‘: 1}# 存放圖片路徑IMAGES_STORE = ‘/home/bill/Pictures‘# mongodb配置MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_DB = "meizidb"MONGODB_COLLECTION = "meizi"# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = ‘baidutieba (+http://www.yourdomain.com)‘

爬取過程:

資料庫:

爬到的妹子圖:

Python Show-Me-the-Code 第 0013 題 抓取妹子圖片 使用scrapy

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.