What is Himalaya
Himalaya FM is the domestic sharing platform, March mobile phone clients on-line, more than two years of mobile phone users have exceeded 200 million [1], become the fastest growing, the largest online mobile audio sharing platform.
Today's mini-series share the audio files and full-site audio of a single complete collection of crawled Himalaya
Environment configuration:
Windows + Python 3.6
Crawl single-Copy audio
1 Import JSON 2 3 Import Re 4 5 Import Requests
Crawl full-Station module usage
1 Import Re 2 3 Import Requests 4 5 from Import etree 6 7 from Import Xima
In fact, you open the site of any one of the audio will find that they have an ID
All we need to do is to get each audio ID and the entire audio ID, the name of each book, and then save the download
In fact, the idea is still very simple.
Here is the Python audio code that crawls a book
1 ImportJSON2 ImportRe3 4 ImportRequests5 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange6 7 classXima (object):8 def __init__(self, book_id, book_name):9 #Save the text withTenSelf.book_name =Book_name OneSelf.headers = { A "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" - } - #The URL of the first page of the current book, keeping the ID of the book so that we can more easily get the audio information of other books theSelf.start_url ="https://www.ximalaya.com/revision/play/album?albumid=%s&pagenum={}&sort=-1&pagesize=30"%book_id - #URL of each page of the current book -Self.book_url = [] - forIinchRange (8):#If you want to get the number of pages prepared, you can find the number of pages from the first page, and then put in range () +url = Self.start_url.format (i + 1) - self.book_url.append (URL) + #print (Self.book_url) A at - defget_book_msg (self): - """get audio information and titles to all books""" -All_list = []#store all audio and title information for the current book - forUrlinchSelf.book_url: - #iterate through the URLs of each page to extract the audio data from each page inr = Requests.get (URL, headers=self.headers) -Python_dict = Json.loads (R.content.decode ())#Get Dictionary data for all books on the first page toBook_list = python_dict['Data']['Tracksaudioplay']#The first page of all the book information, all the audio corresponding to the current hierarchical dictionary + forBookinchbook_list: - #iterate through the playback address information of each audio and put the name into the dictionary theList = {} *list['src'] = book['src'] $list['name'] = book['TrackName']Panax Notoginseng #All individual audio is put into the list - all_list.append (list) the Print(all_list) + returnall_list A the defSave (self, all_list): + """save every book to the local""" - #iterate through each audio, then save $ forIinchall_list: $ Print(i) - #{' src ': ' http://audio.xmcdn.com/group44/M01/67/B4/wKgKkVss32fCcK5xAIMTfNZL0Fo411.m4a ', ' name ': ' Do you still have money to invest in Japan? ' '} -i['name'] = Re.sub ('"',"', i['name'])#Some names will have ", this time, because of escaping the problem, the program will error, all we have to replace the" blank, theWith open (r'xima/{}.m4a'. Format (Self.book_name + i['name']),'AB') as F: -R = Requests.get (i['src'], headers=self.headers)WuyiRET =r.content the #get to audio binary file saved is the audio file - F.write (ret) Wu - defRun (self): About """How to run""" $All_list =self.get_book_msg () - Self.save (all_list) - - A if __name__=='__main__': + #Pass in the JSON ID of the current book to get the correct JSON data theXima = Xima ('3385980','Static said Japan') -Xima.run ()
Crawling full-station audio code
1 ImportRe2 ImportRequests3 fromlxmlImportetree4 fromOneximaImportXima5 6 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange7 defget_id ():8 """get information about each book in the leaderboard"""9Main_url ="https://www.ximalaya.com/shangye/top/"Tenheaders = { One "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" A } -R = Requests.get (Main_url, headers=headers) - #gets the XML data to the current page theHTML =etree. HTML (R.content.decode ()) - #get information on the location of each book -Div_list = Html.xpath ("//div[contains (@class, ' e-2997888007 rrc-album-item ')]") -All_lsit = []#I 'll put the audio of each book into the list in a dictionary later. + forDivinchdiv_list: -Author = {}#to create a list, we want to get the ID of the book and the name of the book, and one by one corresponds +R = Div.xpath ("./a/@href") [0]#gets the information about the ID of the current book, the data is:/renwen/4859823/ A Print(R) at #so you have to pass the correct ID out of the regular, ID is to pass in the correct ID, get the correct JSON data -author['ID'] = Re.search (r'\/.*?\/(.*)\/', R). Group (1) -author['Book_name'] = Div.xpath ("./a/div[3]/div[1]/span/text ()") [0] - #to pass in each audio message to the list - All_lsit.append (author) - Print(all_lsit) in returnAll_lsit - to + #call the function to get all the information for each book, is a list type -All_lsit =get_id () the forIinchAll_lsit: * #iterate through the list and upload the corresponding ID and title of each book to the class . $x = Xima (i['ID'], i['Book_name'])Panax NotoginsengX.run ()
How to crawl Himalaya full-screen audio files with Python