Introduction to scrapy framework for Crawler learning, crawler scrapy framework

Source: Internet
Author: User

Introduction to scrapy framework for Crawler learning, crawler scrapy framework
Crawler learning-scrapy framework

Crawling pages are Baidu thumb ([http://muzhi.baidu.com]) Q & A pairs, using scrapy crawler framework. You can see that a doctor can display a maximum of 760 questions and answers, so you can only crawl these questions and answers. First, open the cmd command line, use the cd command to open the specified path, and run the command scrapy startproject projectname in the path to create a crawler project. I use vscode to open the project folder. Create the knodge DGE. py file under the spiders file. Here is the crawler logic.
Import scrapy, re, requests, jsonfrom scrapy. selector import Selectorfrom scrapy. http import Requestfrom scrapy. spider import CrawlSpiderfrom baidumuzhi. items import BaidumuzhiItemuids = ['000000', '000000', '000000', '000000', '000000', '000000', '000000 ', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123 ', '63 320665 ', '000000', '000000', '000000', '000000'] # These are the UIDs used for testing, that is, the uidclass KnowledgeSpider (CrawlSpider) of each doctor): name = 'knowledge' start_urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = 0 & rn = 10 & uid = 3450738847 '] def parse (self, response): for uid in uids: item = BaidumuzhiItem () site = json. loads (response. text) targets = site ['data'] ['LIST'] num_of_page = site ['data'] ['Total'] // 10 + 1 if num_of_page> 76: num_of_page = 76 for target in targets: item ['qid'] = target ['qid'] item ['title'] = target ['title'] item ['createtime'] = target ['createtime'] item ['answer'] = target ['answer'] yiel D item urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = {0} & rn = 10 & uid = {1 }'. format (I * 10, uid) for I in range (num_of_page)] for url in urls: yield Request (url, callback = self. parse)
Yield is different from return. The items. py file defines the data class to be extracted. Here we extract the question code, question, time, and answer. The Code is as follows: import scrapy
class BaidumuzhiItem(scrapy.Item):    qid = scrapy.Field()    title = scrapy.Field()    createTime = scrapy.Field()    answer = scrapy.Field()    pass
Pipelines. py defines the data crawled into the database. Here we use the MongoDB Database import pymongo
class BaidumuzhiPipeline(object):    def __init__(self):        cilent = pymongo.MongoClient('localhost',27017)        mydata = cilent['mydata']        qanda = mydata['qandaLast']        self.post = qanda    def process_item(self, item, spider):        infor = dict(item)        self.post.insert(infor)        return item
Set the request header in setting. py. Copy the User-Agent from the browser. The website is not blocked, so download_delay is not set. About scrapy document (http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html) line, basically on these, below we tune MongoDB, crawling it!

I'm afraid I forgot to record this blog.

/Freshman in the information security Major of Northeastern University, who loves English, persistently and algorithms. There is still a lot to be guided on the way forward. If the code is not perfect, please comment on it. /

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.