Python crawls popular comments of Netease cloud music and popular comments of python

Source: Internet
Author: User

Python crawls popular comments of Netease cloud music and popular comments of python

Recently, I have been studying text mining related content. The so-called clever man is hard to learn. To analyze the text, first obtain the text. There are many ways to obtain text, such as downloading existing text documents from the Internet or using APIs provided by a third party to obtain data. However, sometimes the data we want cannot be directly obtained, because it does not provide direct download channels or APIs for us to obtain data. So what should we do at this time? There is a better way to get the desired data through web crawlers, that is, writing computer programs disguised as users. With the efficiency of computers, we can easily and quickly obtain data.

How to Write a crawler? There are many languages that can be used to write crawlers, such as Java, php, and python. I personally prefer python. Because python not only has a built-in powerful network library, but also has many excellent third-party libraries. If someone else has built a wheel, we can use it directly, this makes writing crawlers very convenient. Simply put, less than 10 lines of python code can actually be used to write a small crawler, while other languages can write a lot more code, which is concise and easy to understand. It is a huge advantage of python.

Well, let's talk a little bit about it. In recent years, Netease cloud music has been on fire. I have been a user of Netease cloud music for several years. I used QQ music and cool dog in the past. Through my own experience, I think the best feature of Netease cloud music is its precise song recommendations and unique user comments (solemnly declare !!! This is not a soft text, not an advertisement !!! It only represents your personal opinion !). There are often some god comments liked by many below a song. In addition, Netease cloud music moved the comments of the selected users to the subway a few days ago, and Netease cloud music's comments turned on again. Therefore, I want to analyze the comments of Yiyun and find out the rules, especially the characteristics of some hot reviews. With this purpose, I began to crawl comments from the Internet.

Python has two built-in Network Libraries urllib and urllib2, but these two libraries are not very convenient to use, so here we use a well-received third-party library requests. You can use requests to set up a proxy and simulate complicated crawler operations such as logon with only a few lines of code. If pip has been installed, use pip install requests to install it. The Chinese document address is in this region. The urllib and urllib2 libraries are also useful. I will introduce them to you later.

Before officially introducing crawlers, let's first talk about the basic working principle of crawlers. We know that when we open a browser to access a website, a certain request is sent to the server, after receiving our request, the server will return data based on our request and parse the data in a browser to present it to us. If we use the code, we need to skip this step in the browser, directly send some data to the server, and then retrieve the data returned by the server to extract the information we want. However, the problem is that sometimes the server needs to verify the request we send. If it deems that our request is illegal, it will not return data or error data. In order to avoid this situation, we sometimes need to disguise the program as a normal user, in order to get a server response smoothly. How to pretend? This depends on the difference between accessing a webpage through a browser and accessing a webpage through a program. Generally, when we access a webpage through a browser, in addition to sending the access url, we also send additional information to the service, such as headers (header information, this is equivalent to the proof of identity of the request. When the server sees the data, it will know that we accessed it through a normal browser and will return the data to us. So our program needs to carry the information that marks our identity when sending a request like a browser, so that we can get the data smoothly. Sometimes, we must log on to the system to obtain some data. Therefore, we must simulate logon. In essence, login through a browser is to post some form information to the server (including username, password, and other information). After server verification, we can log on smoothly. The same is true for the exploitation program, if the browser posts any data, we can send it as it is. I will introduce simulated logon in detail later. Of course, sometimes things won't be so smooth, because some websites are configured with anti-crawling measures. For example, if the access speed is too fast, some websites are sometimes blocked by ip addresses (such as Douban ). At this time, we have to set up a proxy server, that is, to change our ip address. If an ip address is blocked, we will change it to another ip address. What should we do? These topics will be discussed later.

Finally, I would like to introduce a little trick that I think is useful in the crawling process. If you are using firefox or chrome, you may notice a place called chrome or firefox. This tool is very useful because we can clearly see what information the browser has sent and what information the server has returned when accessing a website, this information is the key to Crawler writing. Next you will see its great use.

---------------------------------------------------- The split line that officially started ---------------------------------------------------

First, open the web version of Netease cloud music and select a song to open its web page. Here I take Jay Chou's "Sunday" as an example. Example 1

Figure 1

Next, open the web Console (for chrom, open the developer tool, and for other browsers, it should be similar), such as 2

Figure 2

At this time, we need to click the network, clear all the information, and then click resend (equivalent to refreshing the browser ), in this way, we can intuitively see what information the browser sends and what information the server responds. Example 3

Figure 3

Figure 4 shows the data obtained after refreshing:

Figure 4

We can see that the browser sends a lot of information, so which one is what we want? Here we can make a preliminary judgment through the status code. The status code marks the status of the server request. The status code 200 indicates that the request is normal, 304 indicates that the status code is abnormal (there are many types of status codes. If you want to learn more, you can search by yourself. Here we will not talk about the specific meaning of 304 ). Therefore, we generally only need to look at requests with a status code of 200. In addition, we can preview the right sidebar to roughly observe what information the server returns (or view the response ). As shown in figure 5:

Figure 5

By combining these two methods, we can quickly find the request we want to analyze. Note that in figure 5, the request URL column is the URL we want to request. There are two request methods: get and post, and the request header needs to be highlighted, it contains user-Agent (client information), refrence (where to jump from) and other information. Generally, we will include the header information in both get and post methods. Figure 6 shows the header information:

Figure 6

In addition, it should be noted that the get request usually directly uses? Parameter1 = value1 & parameter2 = value2 and so on. Therefore, no additional request parameters are required. post requests generally require additional parameters, instead of directly placing the parameter in the url, we need to pay attention to the parameter column sometimes. After careful searching, we finally find the original comments related to the request in http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016? Csrf_token = in this request, as shown in Figure 7:

Figure 7

When we open this request, we find that it is a post request with two request parameters: params and encSecKey, whose values are very long, it should feel like encrypted. As shown in Figure 8:

Figure 8

The data returned by the server for comments is in json format, which contains a wealth of information (such as comments, comments date, likes, and comments ), as shown in Figure 9: (in fact, hotComments is a popular comment, and comments is a comment array)

Figure 9

Now, we have determined the direction, that is, we only need to determine the params and encSecKey parameter values. This problem has plagued me one afternoon, I did not figure out the encryption method for these two parameters for a long time, but I found a regular, http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016? Csrf_token = The number next to r_o4 _ in a csrf_token is the id value of the song, And the param and encSecKey values of different songs, if you pass the two parameter values of A song, for example, A, to B, this parameter is common for the same page number, that is to say, the two parameter values on the first page of A can be used to get the comments on the first page of the corresponding song if they are passed to any other two parameters of A song. This is also true for the second and third pages of the corresponding song. Unfortunately, different page number parameters are different. This method can only capture a limited number of pages (of course, it is enough to capture the total number of comments and popular comments). If you want to capture all the data, you must understand the encryption method of these two parameter values. I thought I did not understand. I searched for this question last night and found the answer. So far, all the comments on Netease cloud music have been captured.

According to the Convention, the final code is available for the test:

#! /Usr/bin/env python2.7 #-*-coding: UTF-8-*-# @ Time: # @ Author: Lyrichu # @ Email: 919987476@qq.com # @ File: netCloud_spider3.py ''' @ Description: Netease cloud music review crawlers, you can completely crawl the entire part of The comments refer to the @ pingchi fairy article (Address: https://www.zhihu.com/question/36081767) post encryption Section is also given, for details, refer to the original post: Author: My breasts and fairies. Link: https://www.zhihu.com/question/36081767/answer/140287795: '''from Crypto. cipher import AESimport base64import requestsimport json Import codecsimport time # header information headers = {'host': "music.163.com", 'Accept-color': "zh-CN, zh; q = 0.8, en-US; q = 0.5, en; q = 0.3 ", 'Accept-encoding':" gzip, deflate ", 'content-type ': "application/x-www-form-urlencoded", 'cookies': "_ ntes_nnid = Beijing, 1486026808627; _ ntes_nuid = Shanghai; JSESSIONID-WYYY = yfqt9ofhY % Shanghai % 2FoswGgtl4dMv3Oa7% 5CQ50T % 2FV Aee % capacity % 5CRO % 2BO % capacity % 5 Coil % capacity % 5C % 3A1490677541180; _ iuqxldmzr _ = 32; vjuids = c8ca7976.15a029d006a. 0.51373751e63af8; vjlast = 1486102528.1490172479.21; _ gads = ID = small: T = 1486102537: S = ALNI_Mb5XX2vlkjsiU5cIy91-ToUDoFxIw; vinfo_n_f_l_n3 = 411a2def7f75a62e. 1.1.1486349441669.1486349607905.1490173 828142; P_INFO = m15527594439@163.com | 1489375076 | 1 | study | 00 & 99 | null & null # hub & 420100 #10 #0 #0 | 155439 & 1 | study_client | 15527594439@163.com; NTES_CMT_USER_INFO = 84794134% 7Cm155 ***** release % 7 Cfalse % release % 3D; usertrack = c + 5 + hljHgU0T1FDmA66MAg ==; Province = 027; City = 027; _ Ga = GA1.2.1549851014.1489469781; _ utma = Beijing; _ utmc = 94650624; _ utmz = baidu | utmccn = (organic) | utmcmd = organic; playerid = 81568911; _ utmb = 94650624.23.10.1490672820 ", 'connection':" keep-alive ", 'Referer': 'http: // music.163.com/'parameters # Set Proxy Server proxies = {'HTTP :': 'http: // 121.232.146.184 ', 'https:': 'https: // 144.20.48.197 '} # off Set Value: (comment page-1) * 20. The first page of total is true, and the remaining pages are false # first_param = '{rid: "", offset: "0 ", total: "true", limit: "20", csrf_token: ""} '# first parameter second_param = "010001" # second parameter # third parameter third_param = "limit F52741d546b8e289dc6935b3ece0462db0a22b8e7 "# Fourth parameter forth_param =" 0CoJUm6Qyw8W8jud "# obtain def get_params (page ): # page indicates the number of incoming pages iv = "0102030405060708" first_key = forth_param second_key = 16 * 'F' if (page = 1): # if the first page is first_param = '{rid: "", offset: "0", total: "true", limit: "20", csrf_token: ""} 'h_enctext = AES_encrypt (first_param, first_key, iv) else: offset = str (page-1) * 20) first_param = '{rid :"", Offset: "% s", total: "% s", limit: "20", csrf_token: ""} '% (offset, 'false') h_encText = AES_encrypt (first_param, first_key, iv) h_encText = AES_encrypt (h_encText, second_key, iv) return h_encText # Get encSecKeydef get_encSecKey (): encSecKey = "secret Encryption "return encSecKey # decryption process def AES_encrypt (text, key, iv): pad = 16-len (text) % 16 text = text + pad * chr (pad) encryptor = AES. new (key, AES. MODE_CBC, iv) encrypt_text = encryptor. encrypt (text) encrypt_text = base64.b64encode (encrypt_text) return encrypt_text # retrieve comment json data def get_json (url, params, encSecKey ): Data = {"params": params, "encSecKey": encSecKey} response = requests. post (url, headers = headers, data = data, proxies = proxies) return response. content # capture hot comments and return to the hot Comments List def get_hot_comments (url): hot_comments_list = [] hot_comments_list.append (u "User ID user nickname user profile address comment time likes total comment content \ n ") params = get_params (1) # encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = json. l Oads (json_text) hot_comments = json_dict ['hotcomments'] # popular comments print ("% d popular comments in total! "% Len (hot_comments) for item in hot_comments: comment = item ['content'] # comments likedCount = item ['lifecycount'] # Total likes comment_time = item ['time'] # comment time (timestamp) userID = item ['user'] ['userid'] # reviewer id nickname = item ['user'] ['nickname'] # nickname avatarUrl = item ['user'] ['avatarurl'] # Avatar address comment_info = userID + "" + nickname + "" + avatarUrl + "" + comment_time + "" + likedCount + "" + comment + U "\ n" hot_comments_list.append (comment_info) return hot_comments_list # capture all comments of a song def get_all_comments (url ): all_comments_list = [] # store all comments all_comments_list.append (u "User ID user nickname user profile address comments time likes total comments \ n") # header information params = get_params (1) encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = json. loads (json_text) comments_num = int (json_dict ['Total']) if (comments_num % 20 = 0): page = comments_num/20 else: page = int (comments_num/20) + 1 print ("Total % d page comments! "% Page) for I in range (page): # capture params page by page = get_params (I + 1) encSecKey = get_encSecKey () json_text = get_json (url, params, encSecKey) json_dict = json. loads (json_text) if I = 0: print ("% d comments in total! "% Comments_num) # Total comments for item in json_dict ['comments']: comment = item ['content'] # comments likedCount = item ['lifecycount'] # Total likes comment_time = item ['time'] # comment time (timestamp) userID = item ['user'] ['userid'] # reviewer id nickname = item ['user'] ['nickname'] # nickname avatarUrl = item ['user'] ['avatarurl'] # Avatar address comment_info = unicode (userID) + u "" + nickname + u "" + avatarUrl + u "" + unicode (comment_time) + U "" + unicode (likedCount) + u "" + comment + u "\ n" all_comments_list.append (comment_info) print ("Page % d is captured! "% (I + 1) return all_comments_list # write comments to the text file def save_to_file (list, filename): with codecs. open (filename, 'A', encoding = 'utf-8') as f: f. writelines (list) print ("file written successfully! ") If _ name _ =" _ main _ ": start_time = time. time () # Start time url =" http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016? Csrf_token = "filename = u" .txt "all_comments_list = get_all_comments (url) save_to_file (all_comments_list, filename) end_time = time. time () # End time print ("program time % f seconds. "% (end_time-start_time ))

I ran it with the above Code and caught two popular Jay Chou Songs "Sunday" (with more than 1.3 million comments) and "confession balloon" (with more than 0.2 million comments ), the former has been running for more than 20 minutes, and the latter has been running for more than 6600 seconds (that is, nearly 2 hours), as shown below:

Note that I separate them by spaces. Each line has a user ID, user nickname, user profile, address, comment time, likes, and comments. I uploaded the two txt files to Baidu cloud, interested in data can be directly downloaded for text analysis, address: "Sunny" (http://pan.baidu.com/s/1kU50rBL ), "confession balloon" (http://pan.baidu.com/s/1i4PNjff), or run your own code to capture it is also possible (Be careful not to open too many threads to Netease cloud server too much pressure Oh ~~ Some time in the middle, the server returned data very slowly. I don't know if the access is restricted, but it will be better later ). I may perform visual analysis on the comment data later, so stay tuned!

The above is all the content of this article. I hope this article will help you in your study or work. I also hope to provide more support to the customer's home!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.