Took a week to finally fix, across the various pits, finally debugging success, recorded as follows:
1. First build the Douban crawler project in cmd with command line
Scrapy Startproject Douban
2, I use the pycharm, after importing the project,
1) Define the crawled fields in items.py
The items.py code is as follows:
123456789101112 |
# -*- coding: utf-8 -*-
import
scrapy
class
DoubanBookItem(scrapy.Item):
name
=
scrapy.Field()
# 书名
price
=
scrapy.Field()
# 价格
edition_year
=
scrapy.Field()
# 出版年份
publisher
=
scrapy.Field()
# 出版社
ratings
=
scrapy.Field()
# 评分
author
= scrapy.Field()
# 作者
content
=
scrapy.Field()
|
2) Create a new reptile bookspider.py under the Spiders folder, and the code to crawl the page is written here.
Here is the crawler bookspider.py code
123456789101112131415161718192021222324252627282930313233343536 |
# -*- coding:utf-8 -*-
import
scrapy
from douban.items
import
DoubanBookItem
class
BookSpider(scrapy.Spider):
name
=
‘douban-book‘
allowed_domains
=
[
‘douban.com‘
]
start_urls
=
[
‘https://book.douban.com/top250‘
]
def
parse(
self
, response):
# 请求第一页
yield
scrapy.Request(response.url, callback
=
self
.parse_next)
# 请求其它页
for
page
in
response.xpath(
‘//div[@class="paginator"]/a‘
):
link
=
page.xpath(
‘@href‘
).extract()[
0
]
yield scrapy.Request(link, callback
=
self
.parse_next)
def
parse_next(
self
, response):
for
item
in
response.xpath(
‘//tr[@class="item"]‘
):
book
=
DoubanBookItem()
book[
‘name‘
]
=
item.xpath(
‘td[2]/div[1]/a/@title‘
).extract()[
0
]
book[
‘content‘
]
=
item.xpath(
‘td[2]/p/text()‘
).extract()[
0
]
# book_info = item.xpath("td[2]/p[1]/text()").extract()[0]
# book_info_content = book_info.strip().split(" / ")
# book["author"] = book_info_content[0]
# book["publisher"] = book_info_content[-3]
# book["edition_year"] = book_info_content[-2]
# book["price"] = book_info_content[-1]
book[
‘ratings‘
]
=
item.xpath(
‘td[2]/div[2]/span[2]/text()‘
).extract()[
0
]
yield
book
|
3) Configure the request header and MongoDB information in the settings.py file
From faker Import Factoryf = Factory.create () user_agent = F.user_agent ()
Default_request_headers = { ' Host ': ' book.douban.com ', ' Accept ': ' text/html,application/xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 ', ' accept-language ': ' zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ', ' accept-encoding ': ' gzip, deflate, Br ', ' Connection ': ' Keep-alive ',}
1234 |
<br data
-
filtered
=
"filtered"
>MONGODB_HOST
=
"127.0.0.1"
/
/
在本机调试就这个地址
MONGODB_PORT
=
27017
/
/
默认端口号
MONGODB_DBNAME
=
"jkxy"
/
/
数据库名字
MONGODB_DOCNAME
= "Book"
/
/
集合名字,相当于表名
|
4) write the processing code in pipelines.py
12345678910111213141516171819202122232425 |
# -*- coding: utf-8 -*-
# form scrapy.conf import settings 已经过时不用了,采用下面的方法引入settings
from
scrapy.utils.project
import
get_project_settings
/
/
因mongodb的host地址,端口号等都在settings中配置的,所以要把该文件导入进来。
settings
=
get_project_settings()
import
pymongo
/
/
导入mongodb连接模块
class
DoubanBookPipeline(
object
):
def
__init__(
self
):
host
=
settings[
"MONGODB_HOST"
]
/
/
从settings中取出host地址
port
= settings[
"MONGODB_PORT"
]
dbname
=
settings[
"MONGODB_DBNAME"
]
client
=
pymongo.MongoClient(host
=
host,port
=
port)
/
/
创建一个MongoClient实例
tdb
=
client[dbname]
/
/
创建jkxy数据库,dbname
=
“jkxy”
self
.post
=
tdb[settings[
"MONGODB_DOCNAME"
]]
/
/
创建数据库集合Book,相当于创建表
def
process_item(
self
, item, spider):
info
=
item[
‘content‘
].split(
‘ / ‘
)
# [法] 圣埃克苏佩里 / 马振聘 / 人民文学出版社 / 2003-8 / 22.00元
item[
‘name‘
]
=
item[
‘name‘
]
item[
‘price‘
]
=
info[
-
1
]
item[
‘edition_year‘
]
=
info[
-
2
]
item[
‘publisher‘
]
=
info[
-
3
]
bookinfo
=
dict
(item)
/
/
将爬取的数据变为字典
self
.post.insert(bookinfo)
/
/
将爬取的数据插入mongodb数据库
return
item
|
3, in the CMD into the Douban Project Spiders directory, input scrapy runspiders bookspider.py, run the written crawler
That's it!
The following is a graph of data crawled in MongoDB
Feelings:
1, Pycharm can not directly establish Scrapy project, need in cmd with command scrapy Startproject Douban established,
2, similarly, in the Pycharm if not configured, the direct point of operation is no use (at first do not know how to do, point to run, prompted success, but the database of what is not, and later on the internet to find out in the CMD to use the Scrapy command only line), But there is also the article said in the Pycharm set up can run, but I did not try to succeed.
3, on-line found in the CMD under the first entry into the project directory, with the command Scrapy crawl project name, I was in the cmd directory with scrapy crawl Douban, but I tried many times can not.
Think is not crawl command is not correct, in the cmd input scrapy-h see what command, there is no crawl command, but there is a runspiders command, try to enter the Spiders directory, directly run bookspider.py, and finally succeeded.
Python crawl watercress 250 in MongoDB full record