Python資料擷取之BeautifulSoup__Python

來源:互聯網
上載者:User

最近因為經常要爬取網站資料,需要頻繁用到BeautifulSoup,但自己現在掌握的並不是特別熟練,就在這裡梳理下BeautifulSoup的各項用法,以供以後參考。本文的測試資料來自BeautifulSoup的官方文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 1.BeautifulSoup基本用法 1.1 BeautifulSoup介紹

BeautifulSoup是一個可以從HTMLXML頁面中從提取資料的Python第三方庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航,尋找,修改文檔的方式.

構建一個 BeautifulSoup 對象需要兩個參數第一個參數是將要解析的 HTML 文本字串第二個參數告訴 BeautifulSoup 使用哪個解析器來解析 HTML(如Python內建的html.parser、第三方解析器lxmlhtml5lib)。
BeautifulSoup對象構建如下所示:
soup = BeautifulSoup(html_doc,’lxml’) 1.2格式化輸出HTML文檔

代碼如下所示:

# -*- coding: utf-8 -*-"""Created on Thu May  4 13:56:00 2017@author: zch"""from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'lxml')print(soup.prettify())

格式化輸出結果如下所示:

<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    Elsie   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ;and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>
1.3 瀏覽結構化資料的幾種方法

(1)擷取HTML文檔title各項屬性

(2)擷取HTML超連結(a)的各項屬性

(3)擷取HTML段落(p)的各項屬性

(4)通過find方法尋找HTML中的匹配項

2.BeautifulSoup執行個體測試

代碼如下所示:

# -*- coding: utf-8 -*-"""Created on Thu May  4 15:11:23 2017@author: zch"""from bs4 import BeautifulSoupimport rehtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'lxml')print('測試1:擷取所有的連結')links = soup.find_all('a')for link in links:    print(link.name,link['href'],link.get_text())print('測試2:通過正則匹配擷取連結')link_node = soup.find('a',href=re.compile(r"cie"))print(link_node.name,link_node['href'],link_node.get_text())print('測試3:擷取故事本文')p_text = soup.find('p',class_='story')print(p_text.name,p_text.get_text()) #print(soup.p.get_text())

測試結果如下圖所示:

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.