使用mongodb儲存爬取豆瓣電影的資料

來源:互聯網
上載者:User

標籤:one   簡介   for   ide   trident   callback   .com   process   資料轉換   

  1. 建立爬蟲項目douban

    scrapy startproject douban
  2. 設定items.py檔案,儲存要儲存的資料類型和欄位名稱

    # -*- coding: utf-8 -*-import scrapyclass DoubanItem(scrapy.Item):    title = scrapy.Field()    # 內容    content = scrapy.Field()    # 評分    rating_num = scrapy.Field()    # 簡介    quote = scrapy.Field()
  3. 設定爬蟲檔案doubanmovies.py

    # -*- coding: utf-8 -*-import scrapyfrom douban.items import DoubanItemclass DoubanmoviesSpider(scrapy.Spider):    name = ‘doubanmovies‘    allowed_domains = [‘movie.douban.com‘]    offset = 0    url = ‘https://movie.douban.com/top250?start=‘    start_urls = [url + str(offset)]    def parse(self, response):        # print(‘*‘*60)        # print(response.url)        # print(‘*‘*60)        item = DoubanItem()        info = response.xpath("//div[@class=‘info‘]")        for each in info:            item[‘title‘] = each.xpath(".//span[@class=‘title‘][1]/text()").extract()            item[‘content‘] = each.xpath(".//div[@class=‘bd‘]/p[1]/text()").extract()            item[‘rating_num‘] = each.xpath(".//span[@class=‘rating_num‘]/text()").extract()            item[‘quote‘] = each .xpath(".//span[@class=‘inq‘]/text()").extract()            yield item            # print(item)        self.offset += 25        if self.offset <= 250:            yield scrapy.Request(self.url + str(self.offset),callback=self.parse)
  4. 設定管道檔案,使用mongodb資料庫來儲存爬取的資料。重點部分

    # -*- coding: utf-8 -*-from scrapy.conf import settingsimport pymongoclass DoubanPipeline(object):    def __init__(self):        self.host = settings[‘MONGODB_HOST‘]        self.port = settings[‘MONGODB_PORT‘]    def process_item(self, item, spider):        # 建立mongodb用戶端連線物件,該例從settings.py檔案裡面擷取mongodb所在的主機和連接埠參數,可直接書寫主機和連接埠        self.client = pymongo.MongoClient(self.host,self.port)        # 建立資料庫douban        self.mydb = self.client[‘douban‘]        # 在資料庫douban裡面建立表doubanmovies        # 把類似字典的資料轉換為phthon字典格式        content = dict(item)        # 把資料添加到表裡面        self.mysheetname.insert(content)        return item
  5. 設定settings.py檔案

    # -*- coding: utf-8 -*-BOT_NAME = ‘douban‘SPIDER_MODULES = [‘douban.spiders‘]NEWSPIDER_MODULE = ‘douban.spiders‘USER_AGENT = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;‘# Configure a delay for requests for the same website (default: 0)# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)COOKIES_ENABLED = False# Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   ‘douban.pipelines.DoubanPipeline‘: 300,}# mongodb資料庫設定變數MONGODB_HOST = ‘127.0.0.1‘MONGODB_PORT = 27017
  6. 終端測試

    scrapy crawl douban

這部落格園的程式碼片段縮排,難道要用4個空格才可以搞定?我發現只能使用4個空格才能解決如的代碼塊的縮排

使用mongodb儲存爬取豆瓣電影的資料

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.