Introduction to scrapy framework for Crawler learning, crawler scrapy framework
Crawler learning-scrapy framework
Crawling pages are Baidu thumb ([http://muzhi.baidu.com]) Q & A pairs, using scrapy crawler framework. You can see that a doctor can display a maximum of 760 questions and answers, so you can only crawl these questions and answers. First, open the cmd command line, use the cd command to open the specified path, and run the command scrapy startproject projectname in the path to create a crawler project. I use vscode to open the project folder. Create the knodge DGE. py file under the spiders file. Here is the crawler logic.
Import scrapy, re, requests, jsonfrom scrapy. selector import Selectorfrom scrapy. http import Requestfrom scrapy. spider import CrawlSpiderfrom baidumuzhi. items import BaidumuzhiItemuids = ['000000', '000000', '000000', '000000', '000000', '000000', '000000 ', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123 ', '63 320665 ', '000000', '000000', '000000', '000000'] # These are the UIDs used for testing, that is, the uidclass KnowledgeSpider (CrawlSpider) of each doctor): name = 'knowledge' start_urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = 0 & rn = 10 & uid = 3450738847 '] def parse (self, response): for uid in uids: item = BaidumuzhiItem () site = json. loads (response. text) targets = site ['data'] ['LIST'] num_of_page = site ['data'] ['Total'] // 10 + 1 if num_of_page> 76: num_of_page = 76 for target in targets: item ['qid'] = target ['qid'] item ['title'] = target ['title'] item ['createtime'] = target ['createtime'] item ['answer'] = target ['answer'] yiel D item urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = {0} & rn = 10 & uid = {1 }'. format (I * 10, uid) for I in range (num_of_page)] for url in urls: yield Request (url, callback = self. parse)
Yield is different from return. The items. py file defines the data class to be extracted. Here we extract the question code, question, time, and answer. The Code is as follows: import scrapy
class BaidumuzhiItem(scrapy.Item): qid = scrapy.Field() title = scrapy.Field() createTime = scrapy.Field() answer = scrapy.Field() pass
Pipelines. py defines the data crawled into the database. Here we use the MongoDB Database import pymongo
class BaidumuzhiPipeline(object): def __init__(self): cilent = pymongo.MongoClient('localhost',27017) mydata = cilent['mydata'] qanda = mydata['qandaLast'] self.post = qanda def process_item(self, item, spider): infor = dict(item) self.post.insert(infor) return item
Set the request header in setting. py. Copy the User-Agent from the browser. The website is not blocked, so download_delay is not set. About scrapy document (http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html) line, basically on these, below we tune MongoDB, crawling it!
I'm afraid I forgot to record this blog.
/Freshman in the information security Major of Northeastern University, who loves English, persistently and algorithms. There is still a lot to be guided on the way forward. If the code is not perfect, please comment on it. /