Python crawl watercress 250 in MongoDB full record

Source: Internet
Author: User
Tags xpath

Took a week to finally fix, across the various pits, finally debugging success, recorded as follows:

1. First build the Douban crawler project in cmd with command line

Scrapy Startproject Douban

2, I use the pycharm, after importing the project,

1) Define the crawled fields in items.py

The items.py code is as follows:

123456789101112 # -*- coding: utf-8 -*-importscrapyclassDoubanBookItem(scrapy.Item):    name =scrapy.Field()            # 书名    price =scrapy.Field()           # 价格    edition_year =scrapy.Field()    # 出版年份    publisher =scrapy.Field()       # 出版社    ratings =scrapy.Field()         # 评分    author = scrapy.Field()          # 作者    content =scrapy.Field()

  

2) Create a new reptile bookspider.py under the Spiders folder, and the code to crawl the page is written here.

Here is the crawler bookspider.py code

123456789101112131415161718192021222324252627282930313233343536 # -*- coding:utf-8 -*-importscrapyfrom douban.items importDoubanBookItemclassBookSpider(scrapy.Spider):    name =‘douban-book‘    allowed_domains =[‘douban.com‘]    start_urls =[        ‘https://book.douban.com/top250‘    ]    defparse(self, response):         # 请求第一页        yieldscrapy.Request(response.url, callback=self.parse_next)        # 请求其它页        forpage inresponse.xpath(‘//div[@class="paginator"]/a‘):            link =page.xpath(‘@href‘).extract()[0]            yield scrapy.Request(link, callback=self.parse_next)    defparse_next(self, response):        foritem inresponse.xpath(‘//tr[@class="item"]‘):            book =DoubanBookItem()            book[‘name‘=item.xpath(‘td[2]/div[1]/a/@title‘).extract()[0]            book[‘content‘=item.xpath(‘td[2]/p/text()‘).extract()[0]            # book_info = item.xpath("td[2]/p[1]/text()").extract()[0]            # book_info_content = book_info.strip().split(" / ")            # book["author"] = book_info_content[0]            # book["publisher"] = book_info_content[-3]            # book["edition_year"] = book_info_content[-2]            # book["price"] = book_info_content[-1]            book[‘ratings‘=item.xpath(‘td[2]/div[2]/span[2]/text()‘).extract()[0]            yieldbook

3) Configure the request header and MongoDB information in the settings.py file

From faker Import Factoryf = Factory.create () user_agent = F.user_agent ()
Default_request_headers = {    ' Host ': ' book.douban.com ', ' Accept ': ' text/html,application/xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 ', ' accept-language ': ' zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ', ' accept-encoding ': ' gzip, deflate, Br ', ' Connection ': ' Keep-alive ',}
1234 <br data-filtered="filtered">MONGODB_HOST ="127.0.0.1"//在本机调试就这个地址MONGODB_PORT =27017//默认端口号MONGODB_DBNAME ="jkxy"//数据库名字MONGODB_DOCNAME = "Book"//集合名字,相当于表名

  

4) write the processing code in pipelines.py

12345678910111213141516171819202122232425 # -*- coding: utf-8 -*-# form scrapy.conf import settings  已经过时不用了,采用下面的方法引入settingsfromscrapy.utils.project importget_project_settings  //因mongodb的host地址,端口号等都在settings中配置的,所以要把该文件导入进来。settings =get_project_settings()importpymongo  //导入mongodb连接模块classDoubanBookPipeline(object):    def__init__(self):        host =settings["MONGODB_HOST"]  //从settings中取出host地址        port = settings["MONGODB_PORT"]        dbname =settings["MONGODB_DBNAME"]        client =pymongo.MongoClient(host=host,port=port) //创建一个MongoClient实例        tdb =client[dbname]   //创建jkxy数据库,dbname=“jkxy”        self.post =tdb[settings["MONGODB_DOCNAME"]]  //创建数据库集合Book,相当于创建表    defprocess_item(self, item, spider):        info =item[‘content‘].split(‘ / ‘)  # [法] 圣埃克苏佩里 / 马振聘 / 人民文学出版社 / 2003-8 / 22.00元        item[‘name‘=item[‘name‘]        item[‘price‘=info[-1]        item[‘edition_year‘=info[-2]        item[‘publisher‘=info[-3]        bookinfo =dict(item)  //将爬取的数据变为字典        self.post.insert(bookinfo)  //将爬取的数据插入mongodb数据库        returnitem

  

3, in the CMD into the Douban Project Spiders directory, input scrapy runspiders bookspider.py, run the written crawler

That's it!

The following is a graph of data crawled in MongoDB

Feelings:

1, Pycharm can not directly establish Scrapy project, need in cmd with command scrapy Startproject Douban established,

2, similarly, in the Pycharm if not configured, the direct point of operation is no use (at first do not know how to do, point to run, prompted success, but the database of what is not, and later on the internet to find out in the CMD to use the Scrapy command only line), But there is also the article said in the Pycharm set up can run, but I did not try to succeed.

3, on-line found in the CMD under the first entry into the project directory, with the command Scrapy crawl project name, I was in the cmd directory with scrapy crawl Douban, but I tried many times can not.

Think is not crawl command is not correct, in the cmd input scrapy-h see what command, there is no crawl command, but there is a runspiders command, try to enter the Spiders directory, directly run bookspider.py, and finally succeeded.

Python crawl watercress 250 in MongoDB full record

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.