How to crawl Himalaya full-screen audio files with Python

Source: Internet
Author: User
Tags xpath

What is Himalaya

Himalaya FM is the domestic sharing platform, March mobile phone clients on-line, more than two years of mobile phone users have exceeded 200 million [1], become the fastest growing, the largest online mobile audio sharing platform.

Today's mini-series share the audio files and full-site audio of a single complete collection of crawled Himalaya

Environment configuration:

Windows + Python 3.6

Crawl single-Copy audio

1 Import JSON 2 3 Import Re 4 5 Import Requests

Crawl full-Station module usage

1 Import Re 2 3 Import Requests 4 5  from Import etree 6 7  from Import Xima

In fact, you open the site of any one of the audio will find that they have an ID

All we need to do is to get each audio ID and the entire audio ID, the name of each book, and then save the download

In fact, the idea is still very simple.

Here is the Python audio code that crawls a book

1 ImportJSON2 ImportRe3 4 ImportRequests5 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange6 7 classXima (object):8     def __init__(self, book_id, book_name):9         #Save the text withTenSelf.book_name =Book_name OneSelf.headers = { A             "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" -         } -         #The URL of the first page of the current book, keeping the ID of the book so that we can more easily get the audio information of other books theSelf.start_url ="https://www.ximalaya.com/revision/play/album?albumid=%s&pagenum={}&sort=-1&pagesize=30"%book_id -         #URL of each page of the current book -Self.book_url = [] -          forIinchRange (8):#If you want to get the number of pages prepared, you can find the number of pages from the first page, and then put in range () +url = Self.start_url.format (i + 1) - self.book_url.append (URL) +             #print (Self.book_url) A  at  -     defget_book_msg (self): -         """get audio information and titles to all books""" -All_list = []#store all audio and title information for the current book -          forUrlinchSelf.book_url: -             #iterate through the URLs of each page to extract the audio data from each page inr = Requests.get (URL, headers=self.headers) -Python_dict = Json.loads (R.content.decode ())#Get Dictionary data for all books on the first page toBook_list = python_dict['Data']['Tracksaudioplay']#The first page of all the book information, all the audio corresponding to the current hierarchical dictionary +              forBookinchbook_list: -                 #iterate through the playback address information of each audio and put the name into the dictionary theList = {} *list['src'] = book['src'] $list['name'] = book['TrackName']Panax Notoginseng                 #All individual audio is put into the list - all_list.append (list) the         Print(all_list) +         returnall_list A  the     defSave (self, all_list): +         """save every book to the local""" -         #iterate through each audio, then save $          forIinchall_list: $             Print(i) -             #{' src ': ' http://audio.xmcdn.com/group44/M01/67/B4/wKgKkVss32fCcK5xAIMTfNZL0Fo411.m4a ', ' name ': ' Do you still have money to invest in Japan? ' '} -i['name'] = Re.sub ('"',"', i['name'])#Some names will have ", this time, because of escaping the problem, the program will error, all we have to replace the" blank, theWith open (r'xima/{}.m4a'. Format (Self.book_name + i['name']),'AB') as F: -R = Requests.get (i['src'], headers=self.headers)WuyiRET =r.content the                 #get to audio binary file saved is the audio file - F.write (ret) Wu  -     defRun (self): About         """How to run""" $All_list =self.get_book_msg () - Self.save (all_list) -  -  A if __name__=='__main__': +     #Pass in the JSON ID of the current book to get the correct JSON data theXima = Xima ('3385980','Static said Japan') -Xima.run ()

Crawling full-station audio code

1 ImportRe2 ImportRequests3  fromlxmlImportetree4  fromOneximaImportXima5 6 #Python Learning Exchange Group: 125240963, the group to share the daily dry, including the latest Python enterprise case study materials and 0 basic introductory tutorials, welcome to the group of small partners to learn exchange7 defget_id ():8     """get information about each book in the leaderboard"""9Main_url ="https://www.ximalaya.com/shangye/top/"Tenheaders = { One         "user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36" A     } -R = Requests.get (Main_url, headers=headers) -     #gets the XML data to the current page theHTML =etree. HTML (R.content.decode ()) -     #get information on the location of each book -Div_list = Html.xpath ("//div[contains (@class, ' e-2997888007 rrc-album-item ')]") -All_lsit = []#I 'll put the audio of each book into the list in a dictionary later. +      forDivinchdiv_list: -Author = {}#to create a list, we want to get the ID of the book and the name of the book, and one by one corresponds +R = Div.xpath ("./a/@href") [0]#gets the information about the ID of the current book, the data is:/renwen/4859823/ A         Print(R) at         #so you have to pass the correct ID out of the regular, ID is to pass in the correct ID, get the correct JSON data -author['ID'] = Re.search (r'\/.*?\/(.*)\/', R). Group (1) -author['Book_name'] = Div.xpath ("./a/div[3]/div[1]/span/text ()") [0] -         #to pass in each audio message to the list - All_lsit.append (author) -     Print(all_lsit) in     returnAll_lsit -  to  + #call the function to get all the information for each book, is a list type -All_lsit =get_id () the  forIinchAll_lsit: *     #iterate through the list and upload the corresponding ID and title of each book to the class . $x = Xima (i['ID'], i['Book_name'])Panax NotoginsengX.run ()

How to crawl Himalaya full-screen audio files with Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.