Python crawler audio data

Source: Internet
Author: User

A: Preface

This crawl is a variety of information about each channel in all stations under the popular section of Himalaya and each audio data in the channel, and then saves the crawled data to MongoDB for later use. This data volume is about 700,000. Audio data including audio, channel information, introduction and so on, very much.
Yesterday, the first interview in life, the other is an artificial intelligence big Data Company, I prepared in this sophomore summer internship, they asked to have crawled audio data, so I would like to analyze the Himalaya audio data crawl down. At the moment I'm still waiting on three sides, or I'm telling you the final interview message. (because you can get certain, whether success or not is very happy)

Second: Operating Environment
    • Ide:pycharm 2017
    • Python3.6
    • Pymongo 3.4.0
    • Requests 2.14.2
    • lxml 3.7.2
    • BeautifulSoup 4.5.3
Three: Example analysis

1. First go to this crawl of the main page http://www.ximalaya.com/dq/all/, you can see 12 channels per page, each channel below a lot of audio, and some channels have a lot of paging. Crawl plan: Loop 84 pages, after each page parsing, crawl each channel name, Image link, channel link saved to MongoDB.


Popular channels

2. Open developer mode, analyze the page, and quickly get the location of the data you want. The following code implements the information to crawl all the popular channels and can be saved to MongoDB.

Start_urls = [' http://www.ximalaya.com/dq/all/{} '. Format (num)For NumIn range (1,85)]for start_url in start_urls: html = Requests.get (Start_url, headers=headers1) .text soup = BeautifulSoup (html,  ' lxml ') for item in Soup.find_all (Class_= "Albumfaceoutter"): content = { ' href ': Item.a[ ' href '],  ' title ': Item.img[ ' alt '],  ' Img_url ': Item.img[ ' src ']} print ( content)             

Analysis Channel

3. The following is the beginning of the acquisition of all audio data in each channel, the previous through the parse page to get the United States channel link. For example, we enter http://www.ximalaya.com/6565682/album/237771 this link after analyzing the page structure. You can see that each audio has a specific ID, which can be obtained from a property in a div. Use Split () and int () to convert to a separate ID.


Channel page Analysis

4. Then click on an audio link, go to developer mode and refresh the page and click Xhr, then click on a JSON link to see the full details of this audio.

html = requests.get(url, headers=headers2).textnumlist = etree.HTML(html).xpath(‘//div[@class="personal_body"]/@sound_ids‘)[0].split(‘,‘)for i in numlist: murl = ‘http://www.ximalaya.com/tracks/{}.json‘.format(i) html = requests.get(murl, headers=headers1).text dic = json.loads(html)

Audio page Analysis


5. The above is just the main page of a channel to resolve all audio information, but the actual channel audio link is a lot of paging.

HTML = requests.Get (URL, headers=headers2).Textifanother = etree. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ')if len (ifanother):  num = ifanother[0] Print (' This channel resource exists ' + num + ' page ') for n in range (1, int (num)): Print (
                  
                    ' Start parsing {} The {} pages '. 
                   format (num, n)) url2 = URL + '? page={} '.  Format (n) # After parsing the audio page function on the line, followed by a full code description     
                        

Page out

6. All code
Full Code address Github.com/rieuse/learnpython

__author__ =' The cloth cluck _rieuse 'Import JSONImport RandomImport timeImport PymongoImport requestsFrom BS4Import BeautifulSoupFrom lxmlImport etreeclients = Pymongo. Mongoclient (' localhost ') db = clients["Ximalaya"]col1 = db["album"]col2 = db["Detaile"]ua_list = []# many user-agent used for random use can be anti-ban, display inconvenient not to stick out headers1 = {}# Visit the headers of the webpage, the display is inconvenient here I won't post it headers2 = {}# Visit the headers of the webpage it's not easy to show up here, I won't post it.DefGet_url(): Start_urls = [' http://www.ximalaya.com/dq/all/{} '. Format (num)For NumIn range (1,85)]For Start_urlIn start_urls:html = Requests.get (Start_url, headers=headers1). Text soup = beautifulsoup (HTML,' lxml ')For itemIn Soup.find_all (class_="Albumfaceoutter"): Content = {' href ': item.a[' href '],' Title ': item.img[' Alt '],' Img_url ': item.img[' src ']} col1.insert (content) print (' Write a channel ' + item.a[' href ']) print (content) Another (item.a[' href ']) time.sleep (1)DefAnother(URL): html = requests.get (URL, headers=headers2). Text Ifanother = etree. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ')If Len (ifanother): num = ifanother[0] Print (' This channel resource exists ' + num +' Pages ')For nIn range (1, int (num)): Print (' Start parsing {} ' on {} pages '. Format (num, n)) Url2 = URL + def get_m4a (URL): Time.sleep ( 1) html = requests.get (URL, headers=headers2). Text numlist = etree. HTML (HTML). XPath ( '//div[@class = "Personal_body"]/@sound_ids ') [ 0].split ( ', ') for I in numlist:murl =  ' Http://www.ximalaya.com/tracks/{}.json '. Format (i) HTML = Requests.get (Murl, headers=headers1). Text dic = json.loads (HTML) col2.insert (DIC) print (Murl + if __name__ =  ' __main__ ': Get_url ()     

7. If you change to an asynchronous form can be faster, only need to modify the following to do it. I tried to get nearly 100 more data per minute than normal. This source code is also in GitHub.


Asynchronous Five: summary

The amount of data captured is around 700,000, which can be followed by a lot of research, such as the number of playlists, the time sector ranking, channel audio, and so on. I will continue to learn the work of using scientific calculations and drawing tools to perform data analysis and cleaning.

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python crawler audio data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.