Beautiful Soup標準庫是一個可以從HTML/XML檔案中提取資料的Python庫,它能夠通過你喜歡的轉換器實現慣用的文檔導航,尋找,修改文檔的方式,Beautiful Soup將會節省數小時的工作時間。pymongo標準庫是MongoDb NoSql資料庫與python語言之間的橋樑,通過pymongo將資料儲存到MongoDb中。結合使用這兩者來爬去喜馬拉雅電台的資料...
Beautiful Soup支援Python標準庫中的HTML解析器,還支援一些第三方的解析器,其中一個是 lxml。本文使用的就是lxml,對於這個的安裝,請看 python 3.6 lxml標準庫lxml的安裝及etree的使用注意
同時,本文使用了XPath來解析我們想要的部分,對於XPath與Beautiful Soup的介紹與使用請看 Beautiful Soup 4.4.0 文檔 XPath 簡介
本文涉及到的Beautiful Soup與XPath的知識不是很深,看看官方文檔就能理解,而且我還加上了注釋...
對於pymongo標準庫,我就不多扯淡了,詳情請看 python標準庫之pymongo模組次體驗
有時候,我們需要判斷當前向伺服器發出請求的用戶端的類型,也就是通常所說的User-Agent,簡稱UA,我們在瀏覽網頁時所使用的瀏覽器就是UA的一種,換言之,UA就是瀏覽器,在HTTP協議中,通過User-Agent要求標頭說明使用者瀏覽器的類型,作業系統,瀏覽器核心等資訊的標識。通過這個標識,用過所訪問的網站可以顯示不同的版本,從而為使用者提供更好的體驗或者進行資訊統計。而有些網站正式利用UA來防止駭客或是像我們這種無聊的人來爬去網站的資料資訊。
因此,本文代碼首先就把所有的UA都給列取出來,以方便後續的爬取工作。
好了,下面來明確下我們要爬取得資料是什麼:
我們需要的是圖片的連結,alt等
隨後我們點擊圖片連結之後,擷取裡面的詳情,如果有些電台是多頁的,那麼我們用過xpath來依次訪問。同時我們擷取頁面中專輯裡的聲音模組的sound_id...
程式如下:
import randomimport requestsfrom bs4 import BeautifulSoupimport jsonfrom lxml import etreeimport pymongoclients = pymongo.MongoClient("localhost", 27017)db = clients["XiMaLaYa"]collection_1 = db["album"]collection_2 = db["detail"]UA_LIST = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]headers1 = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Proxy-Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'User-Agent': random.choice(UA_LIST) # User_agence表示使用者代理程式}headers2 = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.ximalaya.com/dq/all/2', 'Upgrade-Insecure-Requests': '1', 'User-Agent': random.choice(UA_LIST)}# Beautiful庫用來處理XML和HTML...# 主要就是利用BeautifulSoup模組來處理requests模組擷取的Html源碼# 利用lxml模組將html源碼解析成樹結構,xpath來處理樹節點.def get_url(): start_urls = ["http://www.ximalaya.com/dq/all/{}".format(num) for num in range(1,85)] # start_urls = ["http://www.ximalaya.com/dq/all/1"] for start_url in start_urls: html = requests.get(start_url, headers=headers1).text soup = BeautifulSoup(html, "lxml") # 使用lxml來處理 for item in soup.find_all(class_="albumfaceOutter"): # 解析並尋找xml節點 content = { 'href': item.a["href"], 'title': item.img['alt'], 'img_url': item.img['src'] } collection_1.insert(content) # another(item.a["href"]) print('寫入完成...')# 進入電台具體頁面 http://www.ximalaya.com/15836959/album/303085,並處理分頁錄音...def another(url): html = requests.get(url, headers=headers1).text # / :表示從根節點選取.... # // :表示匹配選擇的當前節點選擇文檔中的節點,而不考慮他們的位置... ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page') # 頁面連結地址 ifanother是list類型... if len(ifanother): # 判斷一個video的錄音是否分割成了多頁.... num = ifanother[0] # 擷取頁面數... print('本頻道儲存在' + num + '個頁面') for n in range(1, int(num)): url2 = url + '?page={}'.format(n) get_m4a(url2) get_m4a(url)# 擷取分頁錄音頁面的詳細資料...def get_m4a(url): html = requests.get(url, headers=headers2).text numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',') for i in numlist: murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i) html = requests.get(murl, headers=headers1).text dic = json.loads(html) collection_2.insert(dic)if __name__ == "__main__": get_url()