How to crawl Himalaya full-screen audio files with Python

Last Update:2018-07-05 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is Himalaya

Himalaya FM is the domestic sharing platform, March mobile phone clients on-line, more than two years of mobile phone users have exceeded 200 million [1], become the fastest growing, the largest online mobile audio sharing platform.

Today's mini-series share the audio files and full-site audio of a single complete collection of crawled Himalaya

Environment configuration:

Windows + Python 3.6

Crawl single-Copy audio

1 Import JSON 2 3 Import Re 4 5 Import Requests

Crawl full-Station module usage

1 Import Re 2 3 Import Requests 4 5  from Import etree 6 7  from Import Xima

In fact, you open the site of any one of the audio will find that they have an ID

All we need to do is to get each audio ID and the entire audio ID, the name of each book, and then save the download

In fact, the idea is still very simple.

Here is the Python audio code that crawls a book

1 ImportJSON2 ImportRe3 4 ImportRequests5 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange6 7 classXima (object):8     def __init__(self, book_id, book_name):9         #Save the text withTenSelf.book_name =Book_name OneSelf.headers = { A             "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" -         } -         #The URL of the first page of the current book, keeping the ID of the book so that we can more easily get the audio information of other books theSelf.start_url ="https://www.ximalaya.com/revision/play/album?albumid=%s&pagenum={}&sort=-1&pagesize=30"%book_id -         #URL of each page of the current book -Self.book_url = [] -          forIinchRange (8):#If you want to get the number of pages prepared, you can find the number of pages from the first page, and then put in range () +url = Self.start_url.format (i + 1) - self.book_url.append (URL) +             #print (Self.book_url) A  at  -     defget_book_msg (self): -         """get audio information and titles to all books""" -All_list = []#store all audio and title information for the current book -          forUrlinchSelf.book_url: -             #iterate through the URLs of each page to extract the audio data from each page inr = Requests.get (URL, headers=self.headers) -Python_dict = Json.loads (R.content.decode ())#Get Dictionary data for all books on the first page toBook_list = python_dict['Data']['Tracksaudioplay']#The first page of all the book information, all the audio corresponding to the current hierarchical dictionary +              forBookinchbook_list: -                 #iterate through the playback address information of each audio and put the name into the dictionary theList = {} *list['src'] = book['src'] $list['name'] = book['TrackName']Panax Notoginseng                 #All individual audio is put into the list - all_list.append (list) the         Print(all_list) +         returnall_list A  the     defSave (self, all_list): +         """save every book to the local""" -         #iterate through each audio, then save $          forIinchall_list: $             Print(i) -             #{' src ': ' http://audio.xmcdn.com/group44/M01/67/B4/wKgKkVss32fCcK5xAIMTfNZL0Fo411.m4a ', ' name ': ' Do you still have money to invest in Japan? ' '} -i['name'] = Re.sub ('"',"', i['name'])#Some names will have ", this time, because of escaping the problem, the program will error, all we have to replace the" blank, theWith open (r'xima/{}.m4a'. Format (Self.book_name + i['name']),'AB') as F: -R = Requests.get (i['src'], headers=self.headers)WuyiRET =r.content the                 #get to audio binary file saved is the audio file - F.write (ret) Wu  -     defRun (self): About         """How to run""" $All_list =self.get_book_msg () - Self.save (all_list) -  -  A if __name__=='__main__': +     #Pass in the JSON ID of the current book to get the correct JSON data theXima = Xima ('3385980','Static said Japan') -Xima.run ()

Crawling full-station audio code

1 ImportRe2 ImportRequests3  fromlxmlImportetree4  fromOneximaImportXima5 6 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange7 defget_id ():8     """get information about each book in the leaderboard"""9Main_url ="https://www.ximalaya.com/shangye/top/"Tenheaders = { One         "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" A     } -R = Requests.get (Main_url, headers=headers) -     #gets the XML data to the current page theHTML =etree. HTML (R.content.decode ()) -     #get information on the location of each book -Div_list = Html.xpath ("//div[contains (@class, ' e-2997888007 rrc-album-item ')]") -All_lsit = []#I 'll put the audio of each book into the list in a dictionary later. +      forDivinchdiv_list: -Author = {}#to create a list, we want to get the ID of the book and the name of the book, and one by one corresponds +R = Div.xpath ("./a/@href") [0]#gets the information about the ID of the current book, the data is:/renwen/4859823/ A         Print(R) at         #so you have to pass the correct ID out of the regular, ID is to pass in the correct ID, get the correct JSON data -author['ID'] = Re.search (r'\/.*?\/(.*)\/', R). Group (1) -author['Book_name'] = Div.xpath ("./a/div[3]/div[1]/span/text ()") [0] -         #to pass in each audio message to the list - All_lsit.append (author) -     Print(all_lsit) in     returnAll_lsit -  to  + #call the function to get all the information for each book, is a list type -All_lsit =get_id () the  forIinchAll_lsit: *     #iterate through the list and upload the corresponding ID and title of each book to the class . $x = Xima (i['ID'], i['Book_name'])Panax NotoginsengX.run ()

How to crawl Himalaya full-screen audio files with Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More