Python crawler audio data and python crawler audio

Source: Internet
Author: User

Python crawler audio data and python crawler audio
I. Preface

This crawler crawls the information of each channel of all radio stations and the information of each audio data in the channel under the popular Himalayan topic, and then saves the crawled data to mongodb for future use. This time, the data volume is about 0.7 million. Audio Data includes audio, channel information, and introduction.
I had my first interview yesterday. The other party is a big data company of AI. I am going to go on an internship this summer. They will want to crawl the audio data, so I want to analyze the Himalayan audio data. Currently, I am still waiting for three sides or notifying me of the final interview message. (Because I can be certain to be certain, no matter whether it is successful or not)

Ii. Runtime Environment
  • Ides: Pycharm 2017
  • Python3.6
  • Pymongo 3.4.0
  • Requests 2.14.2
  • Lxml 3.7.2
  • BeautifulSoup 4.5.3
Iii. instance analysis

1. First enter the crawling home page http://www.ximalaya.com/dq/all/, You can see 12 channels per page, each channel has a lot of audio, some channels there are many pages. Capture plan: Cycle 84 pages, parse each page and capture the name, image link, and channel link of each channel and save it to mongodb.


Popular Channels

2. Open the developer mode and analyze the page to quickly obtain the desired data location. The following code captures information about all popular channels and stores it in mongodb.

start_urls = ['http://www.ximalaya.com/dq/all/{}'.format(num) for num in range(1, 85)]for start_url in start_urls:    html = requests.get(start_url, headers=headers1).text    soup = BeautifulSoup(html, 'lxml')    for item in soup.find_all(class_="albumfaceOutter"):        content = {            'href': item.a['href'],            'title': item.img['alt'],            'img_url': item.img['src']        }        print(content)

Analysis Channel

3. The following figure shows how to obtain all the audio data from each channel. The link to the US channel is obtained through the parsing page. For example, we go to the http://www.ximalaya.com/6565682/album/237771 link and analyze the page structure. It can be seen that each audio has a specific ID, which can be obtained in the attribute of a div. Use split () and int () to convert to a separate ID.


Channel PAGE analysis

4. Click an audio link to enter the developer mode, refresh the page, click XHR, and then click a json link to view all the details of the audio.

html = requests.get(url, headers=headers2).textnumlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')for i in numlist:    murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)    html = requests.get(murl, headers=headers1).text    dic = json.loads(html)

Audio PAGE analysis


5. The above only parses all the Audio Information on the home page of a channel, but the audio link of the channel actually has many pages.

Html = requests. get (url, headers = headers2 ). textifanother = etree. HTML (html ). xpath ('// div [@ class = "pagingBar_wrapper"]/a [last ()-1]/@ data-page') if len (ifanother ): num = ifanother [0] print ('current Channel Resource exist' + num + 'page') for n in range (1, int (num )): print ('start parsing the {} page 'in '. format (num, n) url2 = url + '? Page = {} '. format (n) # Then parse the audio page function. The complete code is described later.

Paging

6. All code
Complete Code address: github.com/rieuse/learnPython

_ Author _ = 'giggle _ rieuse' import jsonimport randomimport timeimport py1_import requestsfrom bs4 import BeautifulSoupfrom lxml import etreeclients = pymongo. dedicated client ('localhost ') db = clients ["XiMaLaYa"] col1 = db ["album"] col2 = db ["detaile"] UA_LIST = [] # many User-agents are used for random ban protection, it is inconvenient to display the headers1 = {}# headers accessing the webpage, if it is not displayed here, I will not post it. headers2 = {}# access the webpage's headers. If it is not displayed here, I will not post it. def get_url (): start_u Rls = ['HTTP: // www.ximalaya.com/dq/all/{}'.format (num) for num in range (1, 85)] for start_url in start_urls: html = requests. get (start_url, headers = headers1 ). text soup = BeautifulSoup (html, 'lxml') for item in soup. find_all (class _ = "albumfaceOutter"): content = {'href ': item. a ['href '], 'title': item. img ['alt'], 'img _ url': item. img ['src']} col1.insert (content) print ('write A channel' + item. a ['href ']) prin T (content) another (item. a ['href ']) time. sleep (1) def another (url): html = requests. get (url, headers = headers2 ). text ifanother = etree. HTML (html ). xpath ('// div [@ class = "pagingBar_wrapper"]/a [last ()-1]/@ data-page') if len (ifanother ): num = ifanother [0] print ('current Channel Resource exist' + num + 'page') for n in range (1, int (num )): print ('start parsing the {} page 'in '. format (num, n) url2 = url + '? Page = {}'. format (n) get_m4a (url2) get_m4a (url) def get_m4a (url): time. sleep (1) html = requests. get (url, headers = headers2 ). text numlist = etree. HTML (html ). xpath ('// div [@ class = "personal_body"]/@ sound_ids') [0]. split (',') for I in numlist: murl = 'HTTP: // www.ximalaya.com/tracks/?#.json'.format (I) html = requests. get (murl, headers = headers1 ). text dic = json. loads (html) col2.insert (dic) print (data in murl + 'has been successfully inserted into mongodb') if _ name _ = '_ main __': get_url ()

7. If you change the form to asynchronous mode, you only need to modify it as follows. I tried to get nearly 100 pieces of data every minute. This source code is also in github.


Asynchronous 5: Summary

The amount of data captured this time is about 0.7 million. This data can be studied in the future, such as the ranking of the number of channels, the ranking of time segments, and the number of audio channels. In the future, I will continue to learn how to use scientific computing and drawing tools for data analysis and cleaning.

If you have any questions during the learning process or want to obtain learning resources, join the learning exchange group.
626062078. Let's learn Python together!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.