爬蟲學習之scrapy架構入門,爬蟲scrapy架構

來源:互聯網
上載者:User

爬蟲學習之scrapy架構入門,爬蟲scrapy架構
爬蟲學習之scrapy架構入門

爬取的頁面是百度拇指([http://muzhi.baidu.com])的問答對,使用scrapy爬蟲架構。可以看到一個醫生最多展現760個問答,所以只爬取這些問答。首先開啟cmd命令列,使用cd命令開啟指定路徑,在路徑下命令 scrapy startproject projectname 便建立了爬蟲項目。我使用vscode開啟專案檔夾。在spiders檔案下建立knowledge.py檔案,這裡是爬蟲邏輯。
import scrapy,re,requests,jsonfrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom scrapy.spider import CrawlSpiderfrom baidumuzhi.items import BaidumuzhiItemuids = ['73879479','1246344246','1231532126','618625720','484658950','201748607','200140822','1690937','38227344','930048074','797647705','795334291','161087120','83187968','949887302','591339998','359728620','111266359','63320665','924213326','900849154','838701150','838701150','680796252']#這些是用來測試的uid,即每個醫生的uidclass KnowledgeSpider(CrawlSpider):    name = 'knowledge'    start_urls = ['http://muzhi.baidu.com/doctor/list/answer?pn=0&rn=10&uid=3450738847']    def parse(self, response):        for uid in uids:            item = BaidumuzhiItem()            site = json.loads(response.text)            targets = site['data']['list']            num_of_page = site['data']['total']//10+1            if num_of_page>76: num_of_page = 76            for target in targets:                item['qid'] = target['qid']                item['title'] = target['title']                item['createTime'] = target['createTime']                item['answer'] = target['answer']                yield item            urls = ['http://muzhi.baidu.com/doctor/list/answer?pn={0}&rn=10&uid={1}'.format(i*10,uid) for i in range(num_of_page)]            for url in urls:                yield Request(url,callback=self.parse)
yield與return不同。items.py檔案定義了需要提取的資料類,這裡提取問題碼,問題,時間,答案。代碼如下import scrapy
class BaidumuzhiItem(scrapy.Item):    qid = scrapy.Field()    title = scrapy.Field()    createTime = scrapy.Field()    answer = scrapy.Field()    pass
pipelines.py定義了所爬取資料入資料庫的一些東西。這裡使用MongoDB資料庫import pymongo
class BaidumuzhiPipeline(object):    def __init__(self):        cilent = pymongo.MongoClient('localhost',27017)        mydata = cilent['mydata']        qanda = mydata['qandaLast']        self.post = qanda    def process_item(self, item, spider):        infor = dict(item)        self.post.insert(infor)        return item
setting.py中需要對request頭進行設定。User-Agent從瀏覽器中複製即可。網站沒有封鎖,於是download_delay不進行設定。關於scrapy的文檔(http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html)行,基本就這些,下面我們調好MongoDB,就進行爬取吧!

寫這篇部落格怕忘了記錄下。

/東北大學資訊安全專業大一新生,英語熱愛者,執著與演算法。前進的路上還有很多需要大家指導,代碼不完美之處還請大家評論指出。/

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.