Share a method to crawl popular comments of NetEase Cloud Music using Python

Source: Internet
Author: User
This article describes in detail how to obtain popular comments of NetEase Cloud Music using Python. It has good reference value. let's take a look at it below. This article will detail examples of getting popular comments from NetEase Cloud Music using Python. It has good reference value. let's take a look at it together with the small editor.

Recently, I have been studying text mining related content. The so-called clever man is hard to learn. to analyze the text, first obtain the text. There are many ways to obtain text, such as downloading existing text documents from the Internet or using APIs provided by a third party to obtain data. However, sometimes the data we want cannot be directly obtained, because it does not provide direct download channels or APIs for us to obtain data. So what should we do at this time? There is a better way to get the desired data through web crawlers, that is, writing computer programs disguised as users. With the efficiency of computers, we can easily and quickly obtain data.

How to write a crawler? There are many languages that can be used to write crawlers, such as Java, php, and python. I personally prefer python. Because python not only has a built-in powerful network library, but also has many excellent third-party libraries. if someone else has built a wheel, we can use it directly, this makes writing crawlers very convenient. Simply put, less than 10 lines of python code can actually be used to write a small crawler, while other languages can write a lot more code, which is concise and easy to understand. it is a huge advantage of python.

Well, let's talk a little bit about it. In recent years, NetEase Cloud Music has been on fire. I have been a user of NetEase Cloud Music for several years. I used QQ Music and cool dog in the past. through my own experience, I think the best feature of NetEase Cloud Music is its precise song recommendations and unique user comments (solemnly declare !!! This is not a soft text, not an advertisement !!! It only represents your personal opinion !). There are often some God comments liked by many below a song. In addition, NetEase Cloud Music moved the comments of the selected users to the subway a few days ago, and NetEase Cloud Music's comments turned on again. Therefore, I want to analyze the comments of Yiyun and find out the rules, especially the characteristics of some hot reviews. With this purpose, I began to crawl comments from the Internet.

Python has two built-in network libraries urllib and urllib2, but these two libraries are not very convenient to use, so here we use a well-received third-party library requests. You can use requests to set up a proxy and simulate complicated crawler operations such as logon with only a few lines of code. If pip has been installed, use pip install requests to install it. The Chinese document address is in this region. The urllib and urllib2 libraries are also useful. I will introduce them to you later.

Before officially introducing crawlers, let's first talk about the basic working principle of Crawlers. we know that when we open a browser to access a website, a certain request is sent to the server, after receiving our request, the server will return data based on our request and parse the data in a browser to present it to us. If we use the code, we need to skip this step in the browser, directly send some data to the server, and then retrieve the data returned by the server to extract the information we want. However, the problem is that sometimes the server needs to verify the request we send. if it deems that our request is illegal, it will not return data or error data. In order to avoid this situation, we sometimes need to disguise the program as a normal user, in order to get a server response smoothly. How to pretend? This depends on the difference between accessing a webpage through a browser and accessing a webpage through a program. Generally, when we access a webpage through a browser, in addition to sending the access url, we also send additional information to the service, such as headers (header information, this is equivalent to the proof of identity of the request. when the server sees the data, it will know that we accessed it through a normal browser and will return the data to us. So our program needs to carry the information that marks our identity when sending a request like a browser, so that we can get the data smoothly. Sometimes, we must log on to the system to obtain some data. Therefore, we must simulate logon. In essence, login through a browser is to post some form information to the server (including username, password, and other information). after server verification, we can log on smoothly. The same is true for the exploitation program, if the browser posts any data, we can send it as it is. I will introduce simulated logon in detail later. Of course, sometimes things won't be so smooth, because some websites are configured with anti-crawling measures. for example, if the access speed is too fast, some websites are sometimes blocked by ip addresses (such as Douban ). At this time, we have to set up a proxy server, that is, to change our ip address. If an ip address is blocked, we will change it to another ip address. what should we do? these topics will be discussed later.

Finally, I would like to introduce a little trick that I think is useful in the crawling process. If you are using firefox or chrome, you may notice a place called chrome or firefox. This tool is very useful because we can clearly see what information the browser has sent and what information the server has returned when accessing a website, this information is the key to crawler writing. Next you will see its great use.

---------------------------------------------------- The split line that officially started ---------------------------------------------------

First, open the web version of NetEase Cloud Music and select a song to open its web page. here I take Jay Chou's "Sunday" as an example. Example 1

Figure 9

Now, we have determined the direction, that is, we only need to determine the params and encSecKey parameter values. This problem has plagued me one afternoon, I did not figure out the encryption method for these two parameters for a long time, but I found a regular, http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016? Csrf_token = the number next to r_o4 _ in a csrf_token is the id value of the song, and the param and encSecKey values of different songs, if you pass the two parameter values of A song, for example, A, to B, this parameter is common for the same page number, that is to say, the two parameter values on the first page of A can be used to get the comments on the first page of the corresponding song if they are passed to any other two parameters of A song. this is also true for the second and third pages of the corresponding song. Unfortunately, different page number parameters are different. this method can only capture a limited number of pages (of course, it is enough to capture the total number of comments and popular comments). If you want to capture all the data, you must understand the encryption method of these two parameter values. I thought I did not understand. I searched for this question last night and found the answer. So far, all the comments on Netease Cloud Music have been captured.

According to the convention, the final code is available for the test:

#! /Usr/bin/env python2.7 #-*-coding: UTF-8-*-# @ Time: # @ Author: Lyrichu # @ Email: 919987476@qq.com # @ File: netCloud_spider3.py ''' @ Description: https://www.zhihu.com/question/36081767 ) The post encryption section is also provided. you can refer to the original post: Author: flat chest fairy link: https://www.zhihu.com/question/36081767 /Answer/140287795 Source: Zhihu ''' from Crypto. cipher import AESimport base64import requestsimport jsonimport codecsimport time # Header information headers = {'host': "music.163.com", 'Accept-color': "zh-CN, zh; q = 0.8, en-US; q = 0.5, en; q = 0.3 ", 'Accept-encoding':" gzip, deflate ", 'Content-type ': "application/x-www-form-urlencoded", 'cookies': "_ ntes_nnid = 754361b04b121e078dee797cdb30e0fd, 1486026808627; _ ntes_nuid = required; JSESSIONID-WYYY = yfqt9ofhY % capacity % 5CQ50T % 2 FVaee % capacity % 5CRO % 2BO % capacity % 5 Coil % capacity % 5C % 3A1490677541180; _ iuqxldmzr _ = 32; vjuids = capacity. 0.51373751e63af8; vjlast = 1486102528.1490172479.21; _ gads = ID = small: T = 1486102537: S = ALNI_Mb5XX2vlkjsiU5cIy91-ToUDoFxIw; vinfo_n_f_l_n3 = 411a2def7f75a62e. 1.1.1486349441669.1486349607905.1490173828142; P_INFO = m15527594439@163.com | 1489375076 | 1 | study | 00 & 99 | null & null # hub & 420100 #10 #0 #0 | 155439 & 1 | study_client | 15527594439@163.com; NTES_CMT_USER_INFO = 84794134% 7Cm155 ***** release % 7 Cfalse % release % 3D; usertrack = c + 5 + hljHgU0T1FDmA66MAg ==; Province = 027; City = 027; _ ga = Hangzhou; _ utma = bytes; _ utmc = 94650624; _ utmz = 94650624.1490661822.6.2.utmcsr = baidu | utmccn = (organic) | utmcmd = organic; playerid = 81568911; _ utmb = 94650624.23.10.1490672820 ", 'connection':" keep-alive ", 'referer ':' http://music.163.com/ '} # Set proxy server proxies = {'http :':' http://121.232.146.184 ', 'Https :':' https://144.255.48.197 '} # The value of offset is: (Comment page-1) * 20. The first page of total is true, and the remaining pages are false # first_param =' {rid: "", offset: "0", total: "true", limit: "20", csrf_token: ""} '# First parameter second_param = "010001" # Second parameter # third parameter third_param = "second 7b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7 "# fourth parameter forth_param =" 0CoJUm6Qyw8W8jud "# obtain def get_params (page ): # page indicates the number of incoming pages iv = "0102030405060708" first_key = forth_param second_key = 16 * 'F' if (page = 1): # if the first page is first_param = '{rid: "", offset: "0", total: "true", limit: "20", csrf_token: ""} 'h_enctext = AES_encrypt (first_param, first_key, iv) else: offset = str (page-1) * 20) first_param = '{ Rid: "", offset: "% s", total: "% s", limit: "20", csrf_token: ""} '% (offset, 'false ') h_encText = Encrypt (first_param, first_key, iv) h_encText = AES_encrypt (h_encText, second_key, iv) return h_encText # get encSecKeydef Encrypt (): encSecKey = "Encrypt Encryption "return encSecKey # decryption process def AES_encrypt (text, key, iv): pad = 16-len (text) % 16 text = text + pad * chr (pad) encryptor = AES. new (key, AES. MODE_CBC, iv) encrypt_text = encryptor. encrypt (text) encrypt_text = base64.b64encode (encrypt_text) return encrypt_text # retrieve comment json data def get_json (url, params, en CSecKey): data = {"params": params, "encSecKey": encSecKey} response = requests. post (url, headers = headers, data = data, proxies = proxies) return response. content # capture hot comments and return to the hot comments list def get_hot_comments (url): hot_comments_list = [] hot_comments_list.append (u "user ID user nickname user profile address comment time likes Total comment content \ n ") params = get_params (1) # encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = Json. loads (json_text) hot_comments = json_dict ['hotcomments'] # Popular comments print ("% d popular comments in total! "% Len (hot_comments) for item in hot_comments: comment = item ['content'] # Comments likedCount = item ['lifecycount'] # Total likes comment_time = item ['Time'] # comment time (timestamp) userID = item ['user'] ['userid'] # reviewer id nickname = item ['user'] ['nickname'] # nickname avatarUrl = item ['user'] ['avatarurl'] # Avatar address comment_info = userID + "" + nickname + "" + avatarUrl + "" + comment_time + "" + likedCount + "" + comment + U "\ n" hot_comments_list.append (comment_info) return hot_comments_list # capture all comments of a song def get_all_comments (url ): all_comments_list = [] # store all comments all_comments_list.append (u "user ID user nickname user profile address comments time likes total comments \ n") # Header information params = get_params (1) encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = json. loads (json_text) comments_num = int (json_dict ['total']) if (comments_num % 20 = 0): page = comments_num/20 else: page = int (comments_num/20) + 1 print ("Total % d page comments! "% Page) for I in range (page): # capture params page by page = get_params (I + 1) encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = json. loads (json_text) if I = 0: print ("% d comments in total! "% Comments_num) # total comments for item in json_dict ['comments']: comment = item ['content'] # Comments likedCount = item ['lifecycount'] # Total likes comment_time = item ['Time'] # comment time (timestamp) userID = item ['user'] ['userid'] # reviewer id nickname = item ['user'] ['nickname'] # nickname avatarUrl = item ['user'] ['avatarurl'] # Avatar address comment_info = unicode (userID) + u "" + nickname + u "" + avatarUrl + u "" + unicode (comment_time) + U "" + unicode (likedCount) + u "" + comment + u "\ n" all_comments_list.append (comment_info) print ("page % d is captured! "% (I + 1) return all_comments_list # write comments to the text file def save_to_file (list, filename): with codecs. open (filename, 'A', encoding = 'utf-8') as f: f. writelines (list) print ("file written successfully! ") If _ name _ =" _ main _ ": start_time = time. time () # start time url =" http://music.163.com/ Weapi/v1/resource/comments/R_SO_4_186016 /? Csrf_token = "filename = u" .txt "all_comments_list = get_all_comments (url) save_to_file (all_comments_list, filename) end_time = time. time () # End time print ("Program time % f seconds. "% (end_time-start_time ))

I ran it with the above code and caught two popular Jay Chou songs "Sunday" (with more than 1.3 million comments) and "confession Balloon" (with more than 0.2 million comments ), the former has been running for more than 20 minutes, and the latter has been running for more than 6600 seconds (that is, nearly 2 hours), as shown below:

Note that I separate them by spaces. each line has a user ID, user nickname, user profile, address, comment time, likes, and comments.

The above is to share the details of a Python method for crawling NetEase Cloud Music popular comments. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.