標籤:orm strip() ade lan order creat 文學 rom paginator
用了一周的時間總算搞定了,跨過了各種坑,總算調試成功了,記錄如下:
1、首先在cmd中用命令列建立douban爬蟲項目
scrapy startproject douban
2、我用的是pycharm,匯入項目後,
1)在items.py中定義爬取的欄位
items.py代碼如下:
| 123456789101112 |
# -*- coding: utf-8 -*-import scrapy class DoubanBookItem(scrapy.Item): name = scrapy.Field() # 書名 price = scrapy.Field() # 價格 edition_year = scrapy.Field() # 出版年份 publisher = scrapy.Field() # 出版社 ratings = scrapy.Field() # 評分 author = scrapy.Field() # 作者 content = scrapy.Field() |
2)在spiders檔案夾下建立一個新的爬蟲bookspider.py,爬取網頁的代碼都在這裡寫。
以下是爬蟲bookspider.py代碼
| 123456789101112131415161718192021222324252627282930313233343536 |
# -*- coding:utf-8 -*- import scrapyfrom douban.items import DoubanBookItem class BookSpider(scrapy.Spider): name = ‘douban-book‘ allowed_domains = [‘douban.com‘] start_urls = [ ‘https://book.douban.com/top250‘ ] def parse(self, response): # 請求第一頁 yield scrapy.Request(response.url, callback=self.parse_next) # 請求其它頁 for page in response.xpath(‘//div[@class="paginator"]/a‘): link = page.xpath(‘@href‘).extract()[0] yield scrapy.Request(link, callback=self.parse_next) def parse_next(self, response): for item in response.xpath(‘//tr[@class="item"]‘): book = DoubanBookItem() book[‘name‘] = item.xpath(‘td[2]/div[1]/a/@title‘).extract()[0] book[‘content‘] = item.xpath(‘td[2]/p/text()‘).extract()[0] # book_info = item.xpath("td[2]/p[1]/text()").extract()[0] # book_info_content = book_info.strip().split(" / ") # book["author"] = book_info_content[0] # book["publisher"] = book_info_content[-3] # book["edition_year"] = book_info_content[-2] # book["price"] = book_info_content[-1] book[‘ratings‘] = item.xpath(‘td[2]/div[2]/span[2]/text()‘).extract()[0] yield book |
3)在settings.py檔案中配置要求標頭及mongodb資訊
from faker import Factoryf = Factory.create()USER_AGENT = f.user_agent()
DEFAULT_REQUEST_HEADERS = { ‘Host‘: ‘book.douban.com‘,‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘,‘Accept-Encoding‘: ‘gzip, deflate, br‘,‘Connection‘: ‘keep-alive‘,}
| 1234 |
<br data-filtered="filtered">MONGODB_HOST = "127.0.0.1" //在本機調試就這個地址MONGODB_PORT = 27017 //預設連接埠號碼MONGODB_DBNAME = "jkxy" //資料庫名字MONGODB_DOCNAME = "Book" //集合名字,相當於表名 |
4)在pipelines.py中編寫處理代碼
| 12345678910111213141516171819202122232425 |
# -*- coding: utf-8 -*-# form scrapy.conf import settings 已經過時不用了,採用下面的方法引入settingsfrom scrapy.utils.project import get_project_settings //因mongodb的host地址,連接埠號碼等都在settings中配置的,所以要把該檔案匯入進來。settings = get_project_settings()import pymongo //匯入mongodb串連模組 class DoubanBookPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] //從settings中取出host地址 port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] client = pymongo.MongoClient(host=host,port=port) // 建立一個MongoClient執行個體 tdb = client[dbname] //建立jkxy資料庫,dbname=“jkxy” self.post = tdb[settings["MONGODB_DOCNAME"]] //建立資料庫集合Book,相當於建立表 def process_item(self, item, spider): info = item[‘content‘].split(‘ / ‘) # [法] 聖埃克蘇佩裡 / 馬振聘 / 人民文學出版社 / 2003-8 / 22.00元 item[‘name‘] = item[‘name‘] item[‘price‘] = info[-1] item[‘edition_year‘] = info[-2] item[‘publisher‘] = info[-3] bookinfo = dict(item) //將爬取的資料變為字典 self.post.insert(bookinfo) //將爬取的資料插入mongodb資料庫 return item |
3、在cmd中進入douban項目spiders目錄,輸入scrapy runspiders bookspider.py,運行編寫的爬蟲
至此大功告成!
以下是在mongodb中爬取的資料圖
感想:
1、pycharm不能直接建立scrapy項目,需要在cmd中用命令scrapy startproject douban建立,
2、同樣,在pycharm中如果不配置的話,直接點運行也沒有用(一開始不知道怎麼回事,點了運行,提示成功了,但是資料庫裡啥東西也沒有,後來上網查了才發現要在cmd中用scrapy命令才行),但也有文章說在pycharm中設定一下就可以運行了,但我沒試成功。
3、網上查到的在cmd下先進入項目目錄,用命令scrapy crawl 項目名,我就在cmd目錄下用scrapy crawl douban,但我試了多次都不行。
以為是不是crawl命令不對,就在cmd下輸入scrapy -h查看一下都有啥命令,裡面沒有crawl命令,但是有一個runspiders 命令,就嘗試進入spiders目錄,直接運行bookspider.py,至此終於成功。
python爬取豆瓣250存入mongodb全紀錄