This article mainly introduces to you about Python3 the crawler crawl NetEase Cloud Music thermal evaluation of the relevant information, the text through the sample code introduced in very detailed, to everyone's study or work has a certain reference learning value, need to friends below with small to learn together.
Objective
Before just getting started Python crawler, have about half a month time did not write python, are almost forgotten. So prepare to write a simple crawler practice practiced hand, I think NetEase cloud music is the best feature of its accurate song recommendations and unique user reviews, so wrote this crawl NetEase cloud Music hot song list of the crawler. I am also just the starter crawler, have any comments and questions to welcome the proposed, we all together progress.
Nonsense will not say more ~ below to see a detailed introduction.
Our goal is to crawl through the popular comments of all the songs in the hot song rankings in NetEase Cloud.
This reduces the amount of work we need to crawl and can be saved to high-quality reviews.
Implementation analysis
First, we open the NetEase Cloud Web Edition,
Click on the leaderboard, then click on the left cloud music hot song list,
Let's start by opening a song to find out how to crawl the top songs of a given song, and I chose a song I like more recently as an example:
We'll see the commentary on the page below and we'll find a way to get these comments.
Next Open the Web Console (Chrom Open the Developer tool, if it is the other browser should also be similar), Chrom press F12,
Select the network, then we press F5 Refresh, after the refresh to get the data as shown:
You can see that the browser sends a lot of information, so which one is what we want? Here we can make a preliminary judgment by the status Code, status code (state codes) flag the status of the server request, where the status code of 200 means that the request is normal, and 304 is not normal (there are many types of status code, if you want to learn more about the self-search, Here does not say 304 specific meanings). So we usually just look at the status code of 200 request, and there is, we can use the preview in the right column to roughly see what information the server returned (or to see the response). By combining these two methods, we can quickly find the request we want to analyze. Through repeated searches, we finally found a request containing a comment,
It may not be clear on the csdn that we found the song in a POST request with the name R_SO_4_489998494?csrf_token=
. We send out this chunk so that we can see it clearly:
Request basic information:
Request Header:
form data in the request:
We can see that the request URL that contains this song is http://music.163.com/weapi/v1/resource/comments/R_SO_4_489998494?csrf_token=, we changed a few songs found, The first part of the request is the same, except that a string of numbers immediately following the r_so_4_ is different. We can extrapolate that every song has a specified id,r_so_4_ followed by the ID of the song.
We'll look at the submitted form data and we'll find that we need to fill out two data in the form, named params and Encseckey. followed by a large string of characters, in exchange for a few songs found that each song's params and Encseckey are different, so the two data may be encrypted by a specific algorithm.
The data returned by the server is in JSON format and contains very rich information (such as information about the reviewer, comment date, number of likes, comments, etc.), of which hotcomments is the most popular comment we are looking for, a total of 15 articles:
At this point, we have determined the direction, that is, only need to determine the params and Encseckey parameter values. But these two parameters are encrypted by a specific algorithm, how to do? I found a pattern, the number behind r_so_4_ in http://music.163.com/weapi/v1/resource/comments/R_SO_4_489998494?csrf_token= is the ID value of this song, And for the different songs of the Param and Encseckey value, if a song such as a of the two parameters to the song, then for the same number of pages, this parameter is universal, that is, the first page of a two parameter values to any other song two parameters, Can get comments on the first page of the corresponding song, similar to the second page, the third page, and so on.
And we just need to get the top 15 reviews on the first page, so we just need to find a song, copy the params and encseckey of the request in the first page of the song, and you can use it.
About these two parameters how to decrypt, strong know on the fact already have the answer, interested friends can go in to see (https://www.zhihu.com/question/36081767), we here just need to use our lazy way to complete the demand , Xixi.
So far, we have analyzed how to crawl netease cloud music, we analyze how to get the information of all the songs in the Cloud Music hot song list.
We need to get the song name and the corresponding ID value of all the songs in the Cloud Music hot song list.
Similar to the above analysis steps, we first entered the hot song list of the URL,
Press F12 to enter the Web Workbench,
We found all the song information for this list in a GET request called toplist?id=3778678.
Request the corresponding information
Let's preview the results returned by the request,
We found the code that contains the song information in line No. 524 of the Code,
Therefore, we only need to filter out the code that contains the information in the code for the request.
Here we use regular expressions for data filtering.
By observing the characteristics, we can extract the song information we need through two times regular expression filtering.
First regular expression We extract the No. 525 line of code from all the code returned by the request.
The first regular expression is as follows:
<ul class= "F-hide" ><li><a href= "/song\?id=\d*?" rel= "External nofollow" rel= "external nofollow" rel= " External nofollow "rel=" External nofollow ">.*</a></li></ul>
Second regular expression We extract the song information we need in line No. 524, we need the song name and ID of the song, the corresponding regular expression is as follows:
Get Song name:
<li><a href= "/song\?id=\d*?" rel= "External nofollow" rel= "external nofollow" rel= "external nofollow" rel= " External nofollow "> (. *?) </a></li>
Get the song's ID:
<li><a href= "/song\?id= (\d*?)" rel= "external nofollow" rel= "External nofollow" >.*?</a></li>
Here, our entire process has been analyzed, the code to see specific details ~ ~
The code is as follows:
#!/usr/bin/env python3#-*-coding:utf-8-*-import reimport urllib.requestimport urllib.errorimport urllib.parseimport Jsondef Get_all_hotsong (): #获取热歌榜所有歌曲名称和id url= ' http://music.163.com/discover/toplist?id=3778678 ' #网易云云音乐热歌榜url Html=urllib.request.urlopen (URL). read (). Decode (' UTF8 ') #打开url html=str (HTML) #转换成str pat1=r ' <ul class= ' f-hide ' ><li><a href= "/song\?id=\d*?" >.*</a></li></ul> ' #进行第一次筛选的正则表达式 result=re.compile (PAT1). FindAll (HTML) #用正则表达式进行筛选 result= Result[0] #获取tuple的第一个元素 pat2=r ' <li><a href= "/song\?id=\d*?" > (. *?) </a></li> ' #进行歌名筛选的正则表达式 pat3=r ' <li><a href= "/song\?id= (\d*?)" >.*?</a></li> ' #进行歌ID筛选的正则表达式 hot_song_name=re.compile (PAT2). FindAll (Result) #获取所有热门歌曲名称 Hot_song_ Id=re.compile (PAT3). FindAll (Result) #获取所有热门歌曲对应的Id return hot_song_name,hot_song_iddef get_hotcomments (hot_song_ name,hot_song_id): url= ' http://music.163.com/weapi/v1/resource/comments/R_SO_4_ ' + hot_song_id + '? csrf_token= '#歌评url header={#请求头部 ' user-agent ': ' mozilla/5.0 (X11; Fedora; Linux x86_64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 '} #post请求表单数据 data={' params ': ' zc7fzwbkxxsm6tz3pirjd056g9ightbtc8vjtpbxshkiboapnuyaxkze+kni9qiez/iieyrnzfnztp7yvtfybxolvqp/jdynzw2+ Grqdg7gror2zjroqoou2z0tnhy+qdhksv8zxonxuf93w3da51addqhb0ingl+v6n8kthdvzezbe0d3esufs8zjltnruj ', ' encSecKey ': ' 4801507e42c326dfc6b50539395a4fe417594f7cf122cf3d061d1447372ba3aa804541a8ae3b3811c081eb0f2b71827850af59af411a10a1795f7a16a 5189d163bc9f67b3d1907f5e6fac652f7ef66e5a1f12d6949be851fcf4f39a0c2379580a040dc53b306d5c807bf313cc0e8f39bf7d35de691c497cda1 D436b808549acc '} postdata=urllib.parse.urlencode (data). Encode (' UTF8 ') #进行编码 request=urllib.request.request (URL, Headers=header,data=postdata) Reponse=urllib.request.urlopen (Request). Read (). Decode (' UTF8 ') json_dict=json.loads (reponse) #获取json hot_commit=json_dict[' hotcomments ') #获取json中的热门评论 num=0 fhandle=open ('./song_comments ', ' a ') #写入文件 Fhandle.write (hot_song_name+ ': ' +' \ n ') for item in Hot_commit:num+=1 fhandle.write (str (num) + '. ') +item[' content ']+ ' \ n ') fhandle.write (' \n==============================================\n\n ') fhandle.close () hot_ Song_name,hot_song_id=get_all_hotsong () #获取热歌榜所有歌曲名称和idnum =0while num < len (hot_song_name): #保存所有热歌榜中的热评 print (' Fetching the first%d songs heat rating ... '% (num+1)) get_hotcomments (Hot_song_name[num],hot_song_id[num]) print ('%d songs ' Hot Review Crawl Success '% (num+1)) num+ =1
Code operation results are as follows:
Compare the songs of the song "If I Love You" on the Web page and the comments we keep:
Information is correct ~
Summarize