Python crawl watercress 250 in MongoDB full record

Last Update:2018-01-21 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Took a week to finally fix, across the various pits, finally debugging success, recorded as follows:

1. First build the Douban crawler project in cmd with command line

Scrapy Startproject Douban

2, I use the pycharm, after importing the project,

1) Define the crawled fields in items.py

The items.py code is as follows:

123456789101112 # -*- coding: utf-8 -*-importscrapyclassDoubanBookItem(scrapy.Item): name =scrapy.Field() # 书名 price =scrapy.Field() # 价格 edition_year =scrapy.Field() # 出版年份 publisher =scrapy.Field() # 出版社 ratings =scrapy.Field() # 评分 author = scrapy.Field() # 作者 content =scrapy.Field()

2) Create a new reptile bookspider.py under the Spiders folder, and the code to crawl the page is written here.

Here is the crawler bookspider.py code

123456789101112131415161718192021222324252627282930313233343536 # -*- coding:utf-8 -*-importscrapyfrom douban.items importDoubanBookItemclassBookSpider(scrapy.Spider): name =‘douban-book‘ allowed_domains =[‘douban.com‘] start_urls =[ ‘https://book.douban.com/top250‘ ] defparse(self, response): # 请求第一页 yieldscrapy.Request(response.url, callback=self.parse_next) # 请求其它页 forpage inresponse.xpath(‘//div[@class="paginator"]/a‘): link =page.xpath(‘@href‘).extract()[0] yield scrapy.Request(link, callback=self.parse_next) defparse_next(self, response): foritem inresponse.xpath(‘//tr[@class="item"]‘): book =DoubanBookItem() book[‘name‘] =item.xpath(‘td[2]/div[1]/a/@title‘).extract()[0] book[‘content‘] =item.xpath(‘td[2]/p/text()‘).extract()[0] # book_info = item.xpath("td[2]/p[1]/text()").extract()[0] # book_info_content = book_info.strip().split(" / ") # book["author"] = book_info_content[0] # book["publisher"] = book_info_content[-3] # book["edition_year"] = book_info_content[-2] # book["price"] = book_info_content[-1] book[‘ratings‘] =item.xpath(‘td[2]/div[2]/span[2]/text()‘).extract()[0] yieldbook

3) Configure the request header and MongoDB information in the settings.py file

From faker Import Factoryf = Factory.create () user_agent = F.user_agent ()

Default_request_headers = {    ' Host ': ' book.douban.com ', ' Accept ': ' text/html,application/xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 ', ' accept-language ': ' zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ', ' accept-encoding ': ' gzip, deflate, Br ', ' Connection ': ' Keep-alive ',}

1234	`<br data-filtered="filtered">MONGODB_HOST` `="127.0.0.1"//在本机调试就这个地址MONGODB_PORT` `=27017//默认端口号MONGODB_DBNAME` `="jkxy"//数据库名字MONGODB_DOCNAME` `=` `"Book"//集合名字，相当于表名`

4) write the processing code in pipelines.py

12345678910111213141516171819202122232425 # -*- coding: utf-8 -*-# form scrapy.conf import settings 已经过时不用了，采用下面的方法引入settingsfromscrapy.utils.project importget_project_settings //因mongodb的host地址，端口号等都在settings中配置的，所以要把该文件导入进来。settings =get_project_settings()importpymongo //导入mongodb连接模块classDoubanBookPipeline(object): def__init__(self): host =settings["MONGODB_HOST"] //从settings中取出host地址 port = settings["MONGODB_PORT"] dbname =settings["MONGODB_DBNAME"] client =pymongo.MongoClient(host=host,port=port) //创建一个MongoClient实例 tdb =client[dbname] //创建jkxy数据库，dbname=“jkxy” self.post =tdb[settings["MONGODB_DOCNAME"]] //创建数据库集合Book，相当于创建表 defprocess_item(self, item, spider): info =item[‘content‘].split(‘ / ‘) # [法] 圣埃克苏佩里 / 马振聘 / 人民文学出版社 / 2003-8 / 22.00元 item[‘name‘] =item[‘name‘] item[‘price‘] =info[-1] item[‘edition_year‘] =info[-2] item[‘publisher‘] =info[-3] bookinfo =dict(item) //将爬取的数据变为字典 self.post.insert(bookinfo) //将爬取的数据插入mongodb数据库 returnitem

3, in the CMD into the Douban Project Spiders directory, input scrapy runspiders bookspider.py, run the written crawler

That's it!

The following is a graph of data crawled in MongoDB

Feelings:

1, Pycharm can not directly establish Scrapy project, need in cmd with command scrapy Startproject Douban established,

2, similarly, in the Pycharm if not configured, the direct point of operation is no use (at first do not know how to do, point to run, prompted success, but the database of what is not, and later on the internet to find out in the CMD to use the Scrapy command only line), But there is also the article said in the Pycharm set up can run, but I did not try to succeed.

3, on-line found in the CMD under the first entry into the project directory, with the command Scrapy crawl project name, I was in the cmd directory with scrapy crawl Douban, but I tried many times can not.

Think is not crawl command is not correct, in the cmd input scrapy-h see what command, there is no crawl command, but there is a runspiders command, try to enter the Spiders directory, directly run bookspider.py, and finally succeeded.

Python crawl watercress 250 in MongoDB full record

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More