爬蟲學習之scrapy架構入門,爬蟲scrapy架構
爬蟲學習之scrapy架構入門
爬取的頁面是百度拇指([http://muzhi.baidu.com])的問答對,使用scrapy爬蟲架構。可以看到一個醫生最多展現760個問答,所以只爬取這些問答。首先開啟cmd命令列,使用cd命令開啟指定路徑,在路徑下命令 scrapy startproject projectname 便建立了爬蟲項目。我使用vscode開啟專案檔夾。在spiders檔案下建立knowledge.py檔案,這裡是爬蟲邏輯。
import scrapy,re,requests,jsonfrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom scrapy.spider import CrawlSpiderfrom baidumuzhi.items import BaidumuzhiItemuids = ['73879479','1246344246','1231532126','618625720','484658950','201748607','200140822','1690937','38227344','930048074','797647705','795334291','161087120','83187968','949887302','591339998','359728620','111266359','63320665','924213326','900849154','838701150','838701150','680796252']#這些是用來測試的uid,即每個醫生的uidclass KnowledgeSpider(CrawlSpider): name = 'knowledge' start_urls = ['http://muzhi.baidu.com/doctor/list/answer?pn=0&rn=10&uid=3450738847'] def parse(self, response): for uid in uids: item = BaidumuzhiItem() site = json.loads(response.text) targets = site['data']['list'] num_of_page = site['data']['total']//10+1 if num_of_page>76: num_of_page = 76 for target in targets: item['qid'] = target['qid'] item['title'] = target['title'] item['createTime'] = target['createTime'] item['answer'] = target['answer'] yield item urls = ['http://muzhi.baidu.com/doctor/list/answer?pn={0}&rn=10&uid={1}'.format(i*10,uid) for i in range(num_of_page)] for url in urls: yield Request(url,callback=self.parse)
yield與return不同。items.py檔案定義了需要提取的資料類,這裡提取問題碼,問題,時間,答案。代碼如下import scrapy
class BaidumuzhiItem(scrapy.Item): qid = scrapy.Field() title = scrapy.Field() createTime = scrapy.Field() answer = scrapy.Field() pass
pipelines.py定義了所爬取資料入資料庫的一些東西。這裡使用MongoDB資料庫import pymongo
class BaidumuzhiPipeline(object): def __init__(self): cilent = pymongo.MongoClient('localhost',27017) mydata = cilent['mydata'] qanda = mydata['qandaLast'] self.post = qanda def process_item(self, item, spider): infor = dict(item) self.post.insert(infor) return item
setting.py中需要對request頭進行設定。User-Agent從瀏覽器中複製即可。網站沒有封鎖,於是download_delay不進行設定。關於scrapy的文檔(http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html)行,基本就這些,下面我們調好MongoDB,就進行爬取吧!
寫這篇部落格怕忘了記錄下。
/東北大學資訊安全專業大一新生,英語熱愛者,執著與演算法。前進的路上還有很多需要大家指導,代碼不完美之處還請大家評論指出。/