R language crawler practice: zhihu live course data crawling practice, crawler live
Du Yu, EasyCharts team member, column creator of the R language Chinese community, with the following interests: Excel Business Chart, R language data visualization, and geographical information data visualization. Personal public account: data cube (ID: datamofang), founder of "data cube.
This article is an R-language crawler practice. It also uses httr packets to complete the entire data crawling process, combined with cookies login, form submission, and json data packets, no need to write complex xpath, css paths, or even complex Regular Expressions (although these three skills are significant for data crawling ).
I have previously demonstrated how to use httr to crawl courses in NetEase cloud classrooms, using the POST method and form submission.
The object crawled today is the information about the live course. The GET method used, combined with cookies logon and parameter submission, is used. This article will provide you with some detailed skills.
library("httr") library("dplyr") library("jsonlite")
library("curl")
library("magrittr")
library("plyr")
library("rlist")
Step 1: determine the technical framework used by the object webpage:
We can see that the page is only displayed with 10 course information items during initialization, and then move the mouse down to automatically refresh the page. This is a typical asynchronous webpage loading.
Open the background with the F12 key in Chrome. Can you find a homefeed in the all menu? Limit = 10 & offset = 10 & commondes = live asynchronous loading Link (XHR). Generally, asynchronous loading links are in the XHR menu, but there are exceptions, at this time, you need to search in the all menu (there may be many other media types of requests, You Need To Be patient to find them, if you cannot find them, press F5 to refresh, to see which request is new ).
This link looks like a course link. limit is a parameter that limits the number of records displayed on a webpage (just 10 records), and offset is an offset parameter (which is also 10 by default, that is to say, 10 new course records will be refreshed after the mouse slides down, that is, the offset will increase by an integer multiple of 10, and 10 records will be added for each pull-down request ), primary des is a module. live is used here.
Is that the case? Let's click this xhr request and go to the webpage preview on the lower right side to view the details. It's a pity that we just want the information, are all stored in the form of json data packets on the web site as a https://api.zhihu.com/lives/homefeed? Limit = 10 & offset = 10 & commondes = live.
Next, we need to analyze in detail the request method, header parameter settings, Cookie settings, parameters to be submitted, and other information.
Step 2: Construct header information, parameter table information, and cookies
Go to the Headers menu at the right of the developer tool: you will see information about the following four modules:
General:Request URL:https://api.zhihu.com/lives/homefeed?limit=10&offset=10&includes=liveRequest Method:GETStatus Code:200 OKRemote Address:47.95.51.100:443Referrer Policy:no-referrer-when-downgrade
Response Headers:Access-Control-Allow-Credentials:trueAccess-Control-Allow-Headers:Authorization,Content-Type,X-API-VersionAccess-Control-Allow-Methods:GET,PATCH,PUT,POST,DELETE,OPTIONSAccess-Control-Allow-Origin:https://www.zhihu.comConnection:keep-aliveContent-Encoding:gzipContent-Type:application/json; charset=utf-8Date:Wed, 11 Oct 2017 13:24:38 GMTEtag:W/"0b0bb047bb0eaf6962481a517b4276e48a774d54"Server:ZWSTransfer-Encoding:chunkedVary:Accept-EncodingX-Backend-Server:zhihu-live.liveweb.761d7228---10.4.165.2:31030[10.4.165.2:31030]X-Req-ID:10B3371859DE1B96X-Req-SSL:proto=TLSv1.2,sni=,cipher=ECDHE-RSA-AES256-GCM-SHA384X-Tracing-Servicename:livewebX-Tracing-Spanname:LiveHomeFeedHandler_get
Request Headers: accept: application/json, text/plain, */* Accept-Encoding: gzip, deflate, brAccept-Language: zh-CN, zh; q = 0.8, en-US; q = 0.6, en; q = 0.4 Authorization: oauth 8274ffb553d%e6a7fdacbc328e205dconnection: keep-aliveCookie: _ zap = "please type personal knowledge cookies" Host: api. zhihu. comIf-None-Match: W/"ba0517e8bddf8a450ffcda75a507295dc4024786" Origin: https://www.zhihu.comReferer: https://www.zhihu.com/lives/User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/61.0.20.3.79 Safari/537.36X-Api-Version: 3.0.63
Query String Parameters:limit:10offset:10includes:live
Note: I have used my logon cookies in the Request Parameters. Haha, after all, I know that there are 6000 or 7000 million users now. Although it is not large, it is very important to me, login information cannot be disclosed at will ~)
OK, perfect, so make sure that the browser request method used this time is GET request, and the URL object is
https://api.zhihu.com/lives/homefeed?limit=10&offset=10&includes=live
Note the similarities and differences between it and the original web site in your browser. Open the live column in your browser. The following web site is displayed in your web site bar.
However, the URL of the asynchronous loading request initiated in the background is actually the following URL after the parameter is submitted. This URL can be opened directly in the webpage browser because it is a get request, however, because it is a json page, there is no rendered plain text file after opening. The URL is as follows:
https://api.zhihu.com/lives/homefeed?limit=10&offset=10&includes=live
# Construct cookies:
Cookie = 'enter personal cookies'
# Construct browser header information: (all the information comes from the request module)
Headers <-c (
'Accept' = 'application/json ',
'Content-type' = 'application/json; charset = UTF-8 ',
'User-agent' = 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.20.3.79 Safari/123456 ',
'Referer' = 'https: // www.zhihu.com/lives /',
'Connection' = 'keep-alive ',
'Cookies' = Cookie)
In the above header information, Accept informs zhihu server of the file information in what format I want to obtain; Content-Type indicates the encoding format of the information returned by the server; user-Agent is the specification information (important) of the local browser, and Cookie is the login information.
Header information parameters may vary greatly in different web pages. In many cases, it is useless to try the useful parameters on our own, but the common parameters need to be focused.
# Construct the form submission parameters:
Payload <-list (
'Limit' = 10,
'Offset' = 10,
'Regionde' = 'live'
)
The parameters of the GET method can be written in the url. However, for webpages that require multi-page traversal, it is more convenient to write the parameters in the parameter table separately when constructing loops or traversing webpages. The parameter table can only be submitted in list format. In this case, only three parameters are required. The query parameters correspond to the query parameters in the httr GET method. (remember what parameters are used for the form body when the webpage is located in the POST method ).
Step 3: Try to view the output content structure in one step:
baseurl<-"https://api.zhihu.com/lives/homefeed"
r <- GET(baseurl,add_headers(.headers =headers),query =payload, encode="json",verbose())myresult<-r %>% content()
Perfect. There is no problem with the webpage response. Next, check the output content structure:
myresult<-r %>% content() %>% `[[`(2)
The required course information is located in the second list of the output content. Therefore, you can use the preceding statements to extract all course information, which is exactly 10.
The biggest difficulty encountered in this case is that we cannot determine the total number of course records, and no backend parameters are provided ...... I have a bold idea.
Since Query String Parameters: the module displays the data volume and course record offset by default for a single record, can I manually set a limit and offset? For example, if I set limit to 100, if the offset value is 0,100,200, it can be determined after several attempts.
When I set limit = 200, offset = 150, there is no course information, that is, there is no data after 150, then the number of course entries should be less than 150, the page returns the following information:
https://api.zhihu.com/lives/homefeed?limit=200&offset=150&includes=live{"paging": {"is_end": true, "next": "https://api.zhihu.com/lives/homefeed?limit=200&offset=350", "previous": ""}, "data": [], "attached_info": "MiBlODVkMzE1NDYyNGQ0MDY5YTA4OTVhM2FhMGIxZWVhNw=="}
When I set limit = 200, offset = 100, the content is displayed normally, that is, the total number of courses should be between 100 and ~ If the limit value is set to 150 and the offset value is 0, the number of records returned for a single request is 500, without any offset, in this case, not all content will be on the same page.
From the drop-down progress bar, it seems feasible. Then you can ctrl + S to save the webpage as json format, and then we can verify it.
In the following process, when constructing the payload parameter, set limit and offset to and 0 respectively, so that we can obtain all course data by requesting one request:
payload<-list('limit'=200,'offset'=0,'includes'='live')
baseurl<-"https://api.zhihu.com/lives/homefeed"
r <- GET(baseurl,add_headers(.headers =headers),query =payload, encode="json",verbose())myresult<-r %>% content() %>% `[[`(2) length(myresult)[1] 144
OK. The length is exactly 144, which is 100 ~ In the range of 150.
Next, we will use the fromJSON package of jsonlite to import the saved json file to verify whether the data volume of the manually saved json file is consistent with the webpage data requested by the code just now.
homefeed<-fromJSON("C:/Users/RAINDU/Desktop/homefeed.json",simplifyVector=FALSE)length(homefeed$data)length(homefeed$data)[1] 144
It seems that my guess is correct. This limit is indeed only a default request restriction parameter, not unchangeable. After we change the limit, the process of manually writing and submitting form parameters is avoided.
Next, save the above json file. As for content analysis and visualization, of course we will continue to share it in the next period ~
You can directly save the webpage as json in your browser, or separately save the output content as a local json file, using the list. save function of the rlist package.
list.save(homefeed,"C:/Users/RAINDU/Desktop/zhihulive.json")
If you can't wait and want to do this data analysis and visualization by yourself, you can capture and extract the content according to the above steps. I will also upload this uncleaned data to GitHub, you can download it by yourself.
For online courses, click the link at the end of the article:
For previous case data, go to GitHub:
Https://github.com/ljtyduyu/DataWarehouse/tree/master/File
Related course recommendations
R-language crawler case studies:
Netease cloud class
,
Zhihu live, toutiao.com, and Station B video
Share content:All the content and cases of this course come from my experiences and notes during the course. I hope to take this opportunity to share my crawler learning experience with you, it also contributed a small contribution to improving the crawler ecology of R language and promoting tools. It is also a phased Summary of crawler learning.
☟☟☟MengRead the original article and join the course now.