Python climb a crawl NetEase cloud music

Source: Internet
Author: User
Tags base64 modulus

    • Results

Children's shoes that are not interested in the process look directly at this.

List of songs with more than 50,000 comments

First of all, congratulations on my favorite singer (one) Jay Chou's "Sunny Day" became NetEase Cloud Music first commented on millions of songs!

Through the results found that the current number of comments over 100,000 of the song just 10, through the first 10 found:

    • Joker Xue is really hot now.
    • Almost all male singers, male singers seem to be more popular? (Don't hit me), male singers in Jay Chou, Joker Xue, Xu Song (these three people I like) almost accounted for half of the list ...
    • "Fade" electric sound strong to attack, very with a sense of ha (with the bright Mai write code completely can't stop down ...)

According to the results made NetEase Cloud music song list:

    • A number of songs with 100,000 comments
    • A number of songs with 50,000 comments

Tip: The number of comments over 50,000 songs in the song list of individual songs due to copyright issues temporarily off the shelf, temporarily replaced by other excellent version.

High-energy Warning: TOP 29 "Lost Rivers" please play it carefully, if you insist on playing please read the comments first ...

      • Process

        1. Look at NetEase Cloud Music official web page HTML structure
          Home (http://music.163.com/)
          Song Order category page (Http://music.163.com/discover/playlist).
          Song single page (http://music.163.com/playlist?id=499518394)
          Song Details page (http://music.163.com/song?id=109998)
        2. The ID of the crawl song
          By observing the URL of the song details page, we find that the URL of the detail page can be obtained by crawling to the corresponding song ID, and the song information is on the details page. So you can get all the song's information by just collecting the ID of all the songs. And where do these IDs crawl from? Climbing from a song list, and where is the song list crawling? By looking at the URL of the song single page we find that the song list also has an ID, and the song ID can be crawled from the song category page, so the crawl will eventually collect all the songs ' IDs.
        3. Filter out eligible songs by crawling the number of comments

          Unfortunately, although the number of comments is also in the Details page, but NetEase cloud music did anti-crawl processing, using the AJAX Call Comment API method to populate the comments related data, because of the asynchronous nature of the page we climbed to the number of comments is empty, then we look for this API, Customs observation XHR request found to be the guy below.

          The response was rich, all the comments related to the data, but it was observed that the API was encrypted, but it doesn't matter ...

        4. Crawl the details of eligible songs (names, singers, etc.)

          This is a simple step, and it's easy to look at the HTML of the song details page to get the name and the singer's message.

    • Source

# Encoding=utf8Import requestsFrom BS4Import BeautifulSoupImport OS, JSONImport Base64From Crypto.cipherImport AESFrom prettytableImport prettytableImport Warningswarnings.filterwarnings ("Ignore") Base_url =' http://music.163.com/' _session = Requests.session ()# to match songs larger than the number of comments Comment_count_let =100000ClassSong(object):Def__lt__(Self, Other):return self.commentcount > Other.commentcount# because the NetEase Cloud Music song comment takes the Ajax fills the way to be in the HTML to crawl, needs to call the comment API, but the API carries on the encryption processing, below is the related solution methodDefAesencrypt(Text, seckey): Pad =16-len (text)%Text = text + pad * CHR (PAD) encryptor = aes.new (Seckey,2,' 0102030405060708 ') ciphertext = encryptor.encrypt (text) ciphertext = Base64.b64encode (ciphertext)Return ciphertextDefRsaencrypt(text, PubKey, modulus): Text = text[::-1] rs = int (Text.encode (' Hex '),* * INT (PubKey,)% int (modulus,16)Return Format (RS,' x '). Zfill (256)DefCreatesecretkey(size):Return (". Join (Map (Lambda xx: (Hex (ord (XX)) [2:]), os.urandom (size))) [0:16]# get all song IDs for cloud music through third-party channels# here stole a lazy straight from the http://grri94kmi4.app.tianmaying.com/songs climbed, this buddy has to the official website of the songs are crawling over, save a lot of# You can also use Getsongidlist () to crawl from the official website, relatively time-consuming, but more accurateDefGetsongidlistby3party(): Pagemax =1# Number of pages to crawl, optionally set the number of pages according to demand songidlist = []For pageIn range (pagemax): url =' http://grri94kmi4.app.tianmaying.com/songs?page= ' + str (page)# Print URL Url.decode (' utf-8 ') soup = BeautifulSoup (_session.get (URL). Content)# print Soup alist = Soup.findall (' A ', attrs={' Target ':' _blank '})For aIn Alist:songid = a[' href '].split (' = ') [1] songidlist.append (songid)Return songidlist# Crawl All song IDs of cloud music from the Discovery and song page of the official siteDefGetsongidlist(): Pagemax =1# The number of pages to crawl, currently a total of 42 pages, climbed 42 pages will take a long time, can be selected according to the needs of the page songidlist = []For IIn range (1, Pagemax +1): url =' Http://music.163.com/discover/playlist/?order=hot&cat= all &limit=35&offset= ' + str (i *) Url.decode (' utf-8 ') soup = BeautifulSoup (_session.get (URL). content) Alist = Soup.findall (' A ', attrs={' Class ':' Tit f-thide s-fc0 '})For aIn Alist:uri = a[' href '] Playlisturl = Base_url + uri[1:] soup = BeautifulSoup (_session.get (playlisturl). Content) ul = Soup.find (' UL ', attrs={' Class ':' F-hide '})For LiIn Ul.findall (' Li '): SongID = (Li.find (' A ')) [' href '].split (' = ') [1]Print' Crawl Song ID succeeded ' + songid songidlist.append (songid)# There's a repetition of songs in a single song, Go for a repeat song id songidlist = List (set (Songidlist))Return songidlist# matches the number of comments on the song to meet the requirements# Let comment number is greater than valueDefMatchsong(SongID, let): url = base_url +' weapi/v1/resource/comments/r_so_4_ ' + str (songid) +'/?csrf_token= ' headers = {' Cookie ':' appver=1.5.0.75771; ',' Referer ':' http://music.163.com/'} text = {' Username ':‘‘,' Password ':‘‘,' Rememberlogin ':' true '} modulus =' 00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280 104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3 Ece0462db0a22b8e7 ' nonce =' 0cojum6qyw8w8jud ' PubKey =' 010001 ' text = json.dumps (text) Seckey = Createsecretkey (Enctext = Aesencrypt (Aesencrypt (text, nonce), seckey) Encseckey = Rsaencrypt (Seckey, PubKey, modulus) data = {' Params ': Enctext,' Encseckey ': encseckey} req = Requests.post (URL, headers=headers, data=data) total = Req.json () [' Total ']if int > Let:song = Song () song.id = SongID Song.commentcount = TotalReturn song# Set up information for a songDefSetsonginfo(song): url = base_url +' song?id= ' + str (song.id) Url.decode (' utf-8 ') soup = BeautifulSoup (_session.get (URL). content) Strarr = Soup.title.string.split ('-') Song.singer = strarr[1] name = strarr[0].encode (' Utf-8 ')# Remove the words in the back of the song name (), if you don't want to remove the following three lines of code can be dropped index = Name.find (‘(‘)If index >0:name = name[0:index] Song.name = name# get a list of songs that match your criteriaDefGetsonglist():Print' # #正在爬取歌曲编号 ... # # '# songidlist = getsongidlist () songidlist = Getsongidlistby3party ()Print' # #爬取歌曲编号完成, total crawl to ' + str (len (songidlist) +' First # # ' songList = []Print' # #正在爬取符合评论数大于 ' + str (comment_count_let) +' The song ... # # 'For IDIn Songidlist:song = Matchsong (ID, Comment_count_let)IfNone! = Song:setsonginfo (song) Songlist.append (song)Print' successfully matches a {name: ', Song.name,'-', Song.singer,', Number of comments: ', Song.commentcount,‘}‘Print' # #爬取完成, qualifying total ' + str (len (songList)) + ' first # # ' return songlistdef main (): SongList = Getsonglist () # by number of comments from high to low sort songlist.sort () # print result table = prettytable ([u ' rank ', u ' comment number ', u ' song name ', u ' singer ') for index, Song in Enumerate (songList): Table.add_row ([index + 1, Song.commentcount, Song.name, Song.singer]) print table print  ' End ' if __name__ = =  ' __main__ ': Main ()      

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python climb a crawl NetEase cloud music

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.