First of all, I wish everyone to start!
This article is about starting with a user who captures the user's details and stores the captured results in MongoDB by grabbing a list of followers and followers.
1 Environmental requirements
The underlying environment follows the previous environment, just adding mongodb (non-relational database) and Pymongo (Python's MongoDB Connection library), by default I think everyone has installed and started the MongoDB service.
Project creation, crawler creation, disabling Robotstxt_obey settings (refer to previous article)
2 Test crawler effects
I'm going to write a simple crawler here, crawling the number of followers and fans, the code is as follows:
# -*- coding: utf-8 -*-import scrapyclass ZhuHuSpider(scrapy.Spider): """ 知乎爬虫 """ name = ‘zhuhu‘ allowed_domains = [‘zhihu.com‘] start_urls = [‘https://www.zhihu.com/people/wo-he-shui-jiu-xing/following‘] def parse(self, response): # 他关注的人数 tnum = response.css("strong.NumberBoard-itemValue::text").extract()[0] # 粉丝数 fnum = response.css("strong.NumberBoard-itemValue::text").extract()[1] print("他关注的人数为:%s" % tnum) print("他粉丝的人数为:%s" % fnum)
The results of running in Pychram are as follows:
There are 500 errors, we add headers to try again, we set directly in the settings.py, as follows:
Run again to see the results:
This time, it's normal to get the information we need.
3 Crawl Analysis
Let's use Satoshi's homepage as an analysis portal, the homepage is as follows:
Https://www.zhihu.com/people/satoshi_nakamoto/following
The list of analysis user concerns is as follows:
The mouse is placed on the user image and the details are displayed as follows:
Here to note that I use the Firefox browser, select the network--xhr to get information
The core of Ajax technology is the XMLHttpRequest object (abbreviated as XHR), a feature introduced by Microsoft, which the other browser providers later provide the same implementation. XHR provides a smooth interface for sending request and Analysis server responses to the server, allowing for more information to be obtained asynchronously from the server, meaning that users can click to get new data without having to refresh the page.
Through the above request we can get the following connection:
#用户详细信息https://www.zhihu.com/api/v4/members/li-kang-65?include=allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topicshttps://www.zhihu.com/api/v4/members/jin-xiao-94-7?include=allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics#关注的人信息https://www.zhihu.com/api/v4/members/satoshi_nakamoto/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20
By analyzing the links above, you can see
1. User Details link consists of: https://www.zhihu.com/api/v4/members/{user}?include={include}
Where user is Url_token,include is Allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count , follower_count,articles_count,gender,badge[? ( Type=best_answerer)].topics
2. Followers Information link composition: https://www.zhihu.com/api/v4/members/satoshi_nakamoto/followees?include={include}&offset={ Offset}&limit={limit}
Which include is data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[? ( Type=best_answerer)].topics,offset is the paging offset, limit is the number of users per page, and can be seen by:
First page
Second page
Third page
4 Start crawl
We should write a simple crawler first, the function is implemented first, the code is as follows:
#-*-Coding:utf-8-*-import scrapyclass zhuhuspider (scrapy. Spider): "" "The Crawler" "" "name = ' Zhuhu ' allowed_domains = [' zhihu.com '] # User details Address User_detail = ' https: Www.zhihu.com/api/v4/members/{user}?include={include} ' # user details include user_include = ' Allow_message,is_followe D, ' is_following, ' is_org,is_blocking, ' employments, ' ' Answer_count, ' Follower_count, ' Articles_count, ' g Ender, ' badge[? ' ( Type=best_answerer)].topics ' # attention person address Follow_url = ' https://www.zhihu.com/api/v4/members/{user}/followees?include={ Include}&offset={offset}&limit={limit} ' # Followers of people include Follow_include = ' Data[*].answer_count, ' ' Articles_count, ' gender, ' follower_count, ' Is_f ollowed, ' is_following, ' ' Badge[? ( Type=best_answerer)].topics ' # Initial user Start_user = ' satoshi_nakamoto ' def start_requests (self): # redefine St here Art_requests method, note here the format usage of yield scrapy. Request (Self.user_detail.format (User=self.start_user, include=self.user_include), Callback=sel F.parse_user) yield scrapy. Request (Self.follow_url.format (User=self.start_user, Include=self.follow_include, offset=20, limit=20), Callback=self.parse_follow) def parse_user (self, Response): print (' user:%s '% response.text) def PA Rse_follow (self, Response): print (' follow:%s '% response.text)
The output results are as follows:
It should be noted here that authorization information must be added in the headers, otherwise it will error, authorization in the form of headers as follows:
The test found that the authorization value will not change for a period of time, and whether it is permanent remains to be verified.
5 Parse_user Writing
The Parse_user method is used to parse the user's detailed data, store and discover the user's watchlist, and return to the Parse_follow method to process the user's detailed storage fields as follows:
For the sake of convenience I add all the fields to the items.py (if you run the spider error and the hint field is not found, add that field), as follows:
class UserItem(scrapy.Item): """ 定义了响应报文中json的字段 """ is_followed = scrapy.Field() avatar_url_template = scrapy.Field() user_type = scrapy.Field() answer_count = scrapy.Field() is_following = scrapy.Field() url = scrapy.Field() type = scrapy.Field() url_token = scrapy.Field() id = scrapy.Field() allow_message = scrapy.Field() articles_count = scrapy.Field() is_blocking = scrapy.Field() name = scrapy.Field() headline = scrapy.Field() gender = scrapy.Field() avatar_url = scrapy.Field() follower_count = scrapy.Field() is_org = scrapy.Field() employments = scrapy.Field() badge = scrapy.Field() is_advertiser = scrapy.Field()
The Parse_user method code is as follows:
def parse_user(self, response): """ 解析用户详细信息方法 :param response: 获取的内容,转化为json格式 """ # 通过json.loads方式转换为json格式 results = json.loads(response.text) # 引入item类 item = UserItem() # 通过循环判断字段是否存在,存在将结果存入items中 for field in item.fields: if field in results.keys(): item[field] = results.get(field) # 直接返回item yield item # 将获取的用户通过format方式组合成新的url,调用callback函数交给parse_follow方法解析 yield scrapy.Request(self.follows_url.format(user=results.get(‘url_token‘), include=self.follow_include, offset=0, limit=20), callback=self.parse_follow)
6 Parse_follow Method Writing
The first thing to do is to convert the acquired response to JSON format, get the attention of the user, continue crawling for each user, and also handle paging. You can see the following two graphs:
The Parse_follow method after rewriting is as follows:
def parse_follow(self, response): """ 解析关注的人列表方法 """ # 格式化response results = json.loads(response.text) # 判断data是否存在,如果存在就继续调用parse_user解析用户详细信息 if ‘data‘ in results.keys(): for result in results.get(‘data‘): yield scrapy.Request(self.user_detail.format(user=result.get(‘url_token‘), include=self.user_include), callback=self.parse_user) # 判断paging是否存在,如果存在并且is_end参数为False,则继续爬取下一页,如果is_end为True,说明为最后一页 if ‘paging‘ in results.keys() and results.get(‘paging‘).get(‘is_end‘) == False: next_page = results.get(‘paging‘).get(‘next‘) yield scrapy.Request(next_page, callback=self.parse_follow)
After running the crawler, the results are as follows:
You can see that you've been getting content.
7 Deposit mongodb7.1 Item Pipeline
Storage using MongoDB, we need to modify the item Pipeline, referring to the website example to modify the code as follows:
Class Zhihuspiderpipeline (object): "" "The data is stored in the Monogodb database class, refer to the official website example" "" collection_name = ' user ' Def __init__ ( Self, Mongo_uri, mongo_db): "" "Initialize parameter:p Aram Mongo_uri:mongo URI:p Aram mongo_db:db name "" "Self.mongo_uri = Mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler (CLS, crawler) : Return CLS (Mongo_uri=crawler.settings.get (' Mongo_uri '), Mongo_db=crawler.settings.get (' MON Go_database ', ' items ') def open_spider (self, Spider): # Open Connection self.client = Pymongo. Mongoclient (Self.mongo_uri) # Db_auth because my MongoDB set up authentication, so need these two steps, not set can comment Self.db_auth = self.client.admin Self.db_auth.authenticate ("admin", "password") self.db = self.client[self.mongo_db] def close_spider (self, SPI Der): Self.client.close () def process_item (self, item, spider): # Use the Update method here Self.db[self.collec Tion_name].update ({' Url_token ': item[' Url_token '}, Dict (item), True) return item
Here's the Update method, the update () method is used to update a document that already exists. The syntax format is as follows:
db.collection.update( <query>, # update的查询条件,类似sql update查询内where后面的 <update>, # update的对象和一些更新的操作符(如$,$inc...)等,也可以理解为sql update查询内set后面的 { upsert: <boolean>, # 可选,这个参数的意思是,如果不存在update的记录,是否插入objNew,true为插入,默认是false,不插入。 multi: <boolean>, # 可选,mongodb 默认是false,只更新找到的第一条记录,如果这个参数为true,就把按条件查出来多条记录全部更新 writeConcern: <document> # 可选,抛出异常的级别。 })
With the Update method, if the query data exists, it is updated, and if it does not exist, insert dict (item) so that it can go heavy.
7.2 Settings configuration
After running the spider again, the results are as follows:
You can also see the data in MongoDB, as follows:
This section references: https://www.cnblogs.com/qcloud1001/p/6744070.html
To the end of this article.
Operations and Learning Python Reptile Advanced (vii) Scrapy crawl to the attention of users in MongoDB