最近因為經常要爬取網站資料,需要頻繁用到BeautifulSoup,但自己現在掌握的並不是特別熟練,就在這裡梳理下BeautifulSoup的各項用法,以供以後參考。本文的測試資料來自BeautifulSoup的官方文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 1.BeautifulSoup基本用法 1.1 BeautifulSoup介紹
BeautifulSoup是一個可以從HTML或XML頁面中從提取資料的Python第三方庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航,尋找,修改文檔的方式.
構建一個 BeautifulSoup 對象需要兩個參數,第一個參數是將要解析的 HTML 文本字串,第二個參數告訴 BeautifulSoup 使用哪個解析器來解析 HTML(如Python內建的html.parser、第三方解析器lxml和html5lib)。
BeautifulSoup對象構建如下所示:
soup = BeautifulSoup(html_doc,’lxml’) 1.2格式化輸出HTML文檔
代碼如下所示:
# -*- coding: utf-8 -*-"""Created on Thu May 4 13:56:00 2017@author: zch"""from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'lxml')print(soup.prettify())
格式化輸出結果如下所示:
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ;and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body></html>
1.3 瀏覽結構化資料的幾種方法
(1)擷取HTML文檔title各項屬性
(2)擷取HTML超連結(a)的各項屬性
(3)擷取HTML段落(p)的各項屬性
(4)通過find方法尋找HTML中的匹配項
2.BeautifulSoup執行個體測試
代碼如下所示:
# -*- coding: utf-8 -*-"""Created on Thu May 4 15:11:23 2017@author: zch"""from bs4 import BeautifulSoupimport rehtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'lxml')print('測試1:擷取所有的連結')links = soup.find_all('a')for link in links: print(link.name,link['href'],link.get_text())print('測試2:通過正則匹配擷取連結')link_node = soup.find('a',href=re.compile(r"cie"))print(link_node.name,link_node['href'],link_node.get_text())print('測試3:擷取故事本文')p_text = soup.find('p',class_='story')print(p_text.name,p_text.get_text()) #print(soup.p.get_text())
測試結果如下圖所示: