關於python標準庫Beautiful Soup與MongoDb爬喜馬拉雅電台的總結

來源:互聯網
上載者:User
Beautiful Soup標準庫是一個可以從HTML/XML檔案中提取資料的Python庫,它能夠通過你喜歡的轉換器實現慣用的文檔導航,尋找,修改文檔的方式,Beautiful Soup將會節省數小時的工作時間。pymongo標準庫是MongoDb NoSql資料庫與python語言之間的橋樑,通過pymongo將資料儲存到MongoDb中。結合使用這兩者來爬去喜馬拉雅電台的資料...

Beautiful Soup支援Python標準庫中的HTML解析器,還支援一些第三方的解析器,其中一個是 lxml。本文使用的就是lxml,對於這個的安裝,請看 python 3.6 lxml標準庫lxml的安裝及etree的使用注意
同時,本文使用了XPath來解析我們想要的部分,對於XPath與Beautiful Soup的介紹與使用請看 Beautiful Soup 4.4.0 文檔 XPath 簡介
本文涉及到的Beautiful Soup與XPath的知識不是很深,看看官方文檔就能理解,而且我還加上了注釋...
對於pymongo標準庫,我就不多扯淡了,詳情請看 python標準庫之pymongo模組次體驗

有時候,我們需要判斷當前向伺服器發出請求的用戶端的類型,也就是通常所說的User-Agent,簡稱UA,我們在瀏覽網頁時所使用的瀏覽器就是UA的一種,換言之,UA就是瀏覽器,在HTTP協議中,通過User-Agent要求標頭說明使用者瀏覽器的類型,作業系統,瀏覽器核心等資訊的標識。通過這個標識,用過所訪問的網站可以顯示不同的版本,從而為使用者提供更好的體驗或者進行資訊統計。而有些網站正式利用UA來防止駭客或是像我們這種無聊的人來爬去網站的資料資訊。
因此,本文代碼首先就把所有的UA都給列取出來,以方便後續的爬取工作。

好了,下面來明確下我們要爬取得資料是什麼:


我們需要的是圖片的連結,alt等

隨後我們點擊圖片連結之後,擷取裡面的詳情,如果有些電台是多頁的,那麼我們用過xpath來依次訪問。同時我們擷取頁面中專輯裡的聲音模組的sound_id...

程式如下:

import randomimport requestsfrom bs4 import BeautifulSoupimport jsonfrom lxml import etreeimport pymongoclients = pymongo.MongoClient("localhost", 27017)db = clients["XiMaLaYa"]collection_1 = db["album"]collection_2 = db["detail"]UA_LIST = [    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]headers1 = {    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',    'Accept-Encoding': 'gzip, deflate, sdch',    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',    'Cache-Control': 'max-age=0',    'Proxy-Connection': 'keep-alive',    'Upgrade-Insecure-Requests': '1',    'User-Agent': random.choice(UA_LIST)  # User_agence表示使用者代理程式}headers2 = {    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',    'Accept-Encoding': 'gzip, deflate, sdch',    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',    'Cache-Control': 'max-age=0',    'Proxy-Connection': 'keep-alive',    'Referer': 'http://www.ximalaya.com/dq/all/2',    'Upgrade-Insecure-Requests': '1',    'User-Agent': random.choice(UA_LIST)}# Beautiful庫用來處理XML和HTML...# 主要就是利用BeautifulSoup模組來處理requests模組擷取的Html源碼# 利用lxml模組將html源碼解析成樹結構,xpath來處理樹節點.def get_url():    start_urls = ["http://www.ximalaya.com/dq/all/{}".format(num) for num in range(1,85)]    # start_urls = ["http://www.ximalaya.com/dq/all/1"]    for start_url in start_urls:        html = requests.get(start_url, headers=headers1).text        soup = BeautifulSoup(html, "lxml")  # 使用lxml來處理        for item in soup.find_all(class_="albumfaceOutter"):  # 解析並尋找xml節點            content = {                'href': item.a["href"],                'title': item.img['alt'],                'img_url': item.img['src']            }            collection_1.insert(content)            # another(item.a["href"])    print('寫入完成...')# 進入電台具體頁面 http://www.ximalaya.com/15836959/album/303085,並處理分頁錄音...def another(url):    html = requests.get(url, headers=headers1).text    # / :表示從根節點選取....    # // :表示匹配選擇的當前節點選擇文檔中的節點,而不考慮他們的位置...    ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page')  # 頁面連結地址  ifanother是list類型...    if len(ifanother):  # 判斷一個video的錄音是否分割成了多頁....        num = ifanother[0]  # 擷取頁面數...        print('本頻道儲存在' + num + '個頁面')        for n in range(1, int(num)):            url2 = url + '?page={}'.format(n)            get_m4a(url2)        get_m4a(url)# 擷取分頁錄音頁面的詳細資料...def get_m4a(url):    html = requests.get(url, headers=headers2).text    numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')    for i in numlist:        murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i)        html = requests.get(murl, headers=headers1).text        dic = json.loads(html)        collection_2.insert(dic)if __name__ == "__main__":    get_url()
相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.