Python crawler audio data

Last Update:2017-06-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A: Preface

This crawl is a variety of information about each channel in all stations under the popular section of Himalaya and each audio data in the channel, and then saves the crawled data to MongoDB for later use. This data volume is about 700,000. Audio data including audio, channel information, introduction and so on, very much.
Yesterday, the first interview in life, the other is an artificial intelligence big Data Company, I prepared in this sophomore summer internship, they asked to have crawled audio data, so I would like to analyze the Himalaya audio data crawl down. At the moment I'm still waiting on three sides, or I'm telling you the final interview message. (because you can get certain, whether success or not is very happy)

Second: Operating Environment

Ide:pycharm 2017
Python3.6
Pymongo 3.4.0
Requests 2.14.2
lxml 3.7.2
BeautifulSoup 4.5.3

Three: Example analysis

1. First go to this crawl of the main page http://www.ximalaya.com/dq/all/, you can see 12 channels per page, each channel below a lot of audio, and some channels have a lot of paging. Crawl plan: Loop 84 pages, after each page parsing, crawl each channel name, Image link, channel link saved to MongoDB.

Popular channels

2. Open developer mode, analyze the page, and quickly get the location of the data you want. The following code implements the information to crawl all the popular channels and can be saved to MongoDB.

Start_urls = [' http://www.ximalaya.com/dq/all/{} '. Format (num)For NumIn range (1,85)]for start_url in start_urls: html = Requests.get (Start_url, headers=headers1) .text soup = BeautifulSoup (html,  ' lxml ') for item in Soup.find_all (Class_= "Albumfaceoutter"): content = { ' href ': Item.a[ ' href '],  ' title ': Item.img[ ' alt '],  ' Img_url ': Item.img[ ' src ']} print ( content)

Analysis Channel

3. The following is the beginning of the acquisition of all audio data in each channel, the previous through the parse page to get the United States channel link. For example, we enter http://www.ximalaya.com/6565682/album/237771 this link after analyzing the page structure. You can see that each audio has a specific ID, which can be obtained from a property in a div. Use Split () and int () to convert to a separate ID.

Channel page Analysis

4. Then click on an audio link, go to developer mode and refresh the page and click Xhr, then click on a JSON link to see the full details of this audio.

html = requests.get(url, headers=headers2).textnumlist = etree.HTML(html).xpath(‘//div[@class="personal_body"]/@sound_ids‘)[0].split(‘,‘)for i in numlist: murl = ‘http://www.ximalaya.com/tracks/{}.json‘.format(i) html = requests.get(murl, headers=headers1).text dic = json.loads(html)

Audio page Analysis

5. The above is just the main page of a channel to resolve all audio information, but the actual channel audio link is a lot of paging.

HTML = requests.Get (URL, headers=headers2).Textifanother = etree. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ')if len (ifanother):  num = ifanother[0] Print (' This channel resource exists ' + num + ' page ') for n in range (1, int (num)): Print (
                  
                    ' Start parsing {} The {} pages '. 
                   format (num, n)) url2 = URL + '? page={} '.  Format (n) # After parsing the audio page function on the line, followed by a full code description

Page out

6. All code
Full Code address Github.com/rieuse/learnpython

__author__ =' The cloth cluck _rieuse 'Import JSONImport RandomImport timeImport PymongoImport requestsFrom BS4Import BeautifulSoupFrom lxmlImport etreeclients = Pymongo. Mongoclient (' localhost ') db = clients["Ximalaya"]col1 = db["album"]col2 = db["Detaile"]ua_list = []# many user-agent used for random use can be anti-ban, display inconvenient not to stick out headers1 = {}# Visit the headers of the webpage, the display is inconvenient here I won't post it headers2 = {}# Visit the headers of the webpage it's not easy to show up here, I won't post it.DefGet_url(): Start_urls = [' http://www.ximalaya.com/dq/all/{} '. Format (num)For NumIn range (1,85)]For Start_urlIn start_urls:html = Requests.get (Start_url, headers=headers1). Text soup = beautifulsoup (HTML,' lxml ')For itemIn Soup.find_all (class_="Albumfaceoutter"): Content = {' href ': item.a[' href '],' Title ': item.img[' Alt '],' Img_url ': item.img[' src ']} col1.insert (content) print (' Write a channel ' + item.a[' href ']) print (content) Another (item.a[' href ']) time.sleep (1)DefAnother(URL): html = requests.get (URL, headers=headers2). Text Ifanother = etree. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ')If Len (ifanother): num = ifanother[0] Print (' This channel resource exists ' + num +' Pages ')For nIn range (1, int (num)): Print (' Start parsing {} ' on {} pages '. Format (num, n)) Url2 = URL + def get_m4a (URL): Time.sleep ( 1) html = requests.get (URL, headers=headers2). Text numlist = etree. HTML (HTML). XPath ( '//div[@class = "Personal_body"]/@sound_ids ') [ 0].split ( ', ') for I in numlist:murl =  ' Http://www.ximalaya.com/tracks/{}.json '. Format (i) HTML = Requests.get (Murl, headers=headers1). Text dic = json.loads (HTML) col2.insert (DIC) print (Murl + if __name__ =  ' __main__ ': Get_url ()

7. If you change to an asynchronous form can be faster, only need to modify the following to do it. I tried to get nearly 100 more data per minute than normal. This source code is also in GitHub.

Asynchronous Five: summary

The amount of data captured is around 700,000, which can be followed by a lot of research, such as the number of playlists, the time sector ranking, channel audio, and so on. I will continue to learn the work of using scientific calculations and drawing tools to perform data analysis and cleaning.

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python crawler audio data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler audio data

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support