Introduction to scrapy framework for Crawler learning, crawler scrapy framework

Last Update:2018-03-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to scrapy framework for Crawler learning, crawler scrapy framework
Crawler learning-scrapy framework

Crawling pages are Baidu thumb ([http://muzhi.baidu.com]) Q & A pairs, using scrapy crawler framework. You can see that a doctor can display a maximum of 760 questions and answers, so you can only crawl these questions and answers. First, open the cmd command line, use the cd command to open the specified path, and run the command scrapy startproject projectname in the path to create a crawler project. I use vscode to open the project folder. Create the knodge DGE. py file under the spiders file. Here is the crawler logic.

Import scrapy, re, requests, jsonfrom scrapy. selector import Selectorfrom scrapy. http import Requestfrom scrapy. spider import CrawlSpiderfrom baidumuzhi. items import BaidumuzhiItemuids = ['000000', '000000', '000000', '000000', '000000', '000000', '000000 ', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123', '123 ', '63 320665 ', '000000', '000000', '000000', '000000'] # These are the UIDs used for testing, that is, the uidclass KnowledgeSpider (CrawlSpider) of each doctor): name = 'knowledge' start_urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = 0 & rn = 10 & uid = 3450738847 '] def parse (self, response): for uid in uids: item = BaidumuzhiItem () site = json. loads (response. text) targets = site ['data'] ['LIST'] num_of_page = site ['data'] ['Total'] // 10 + 1 if num_of_page> 76: num_of_page = 76 for target in targets: item ['qid'] = target ['qid'] item ['title'] = target ['title'] item ['createtime'] = target ['createtime'] item ['answer'] = target ['answer'] yiel D item urls = ['HTTP: // muzhi.baidu.com/doctor/list/answer? Pn = {0} & rn = 10 & uid = {1 }'. format (I * 10, uid) for I in range (num_of_page)] for url in urls: yield Request (url, callback = self. parse)

Yield is different from return. The items. py file defines the data class to be extracted. Here we extract the question code, question, time, and answer. The Code is as follows: import scrapy

class BaidumuzhiItem(scrapy.Item):    qid = scrapy.Field()    title = scrapy.Field()    createTime = scrapy.Field()    answer = scrapy.Field()    pass

Pipelines. py defines the data crawled into the database. Here we use the MongoDB Database import pymongo

class BaidumuzhiPipeline(object):    def __init__(self):        cilent = pymongo.MongoClient('localhost',27017)        mydata = cilent['mydata']        qanda = mydata['qandaLast']        self.post = qanda    def process_item(self, item, spider):        infor = dict(item)        self.post.insert(infor)        return item

Set the request header in setting. py. Copy the User-Agent from the browser. The website is not blocked, so download_delay is not set. About scrapy document (http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html) line, basically on these, below we tune MongoDB, crawling it!

I'm afraid I forgot to record this blog.

/Freshman in the information security Major of Northeastern University, who loves English, persistently and algorithms. There is still a lot to be guided on the way forward. If the code is not perfect, please comment on it. /

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to scrapy framework for Crawler learning, crawler scrapy framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to scrapy framework for Crawler learning, crawler scrapy framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support