Python模組學習之bs4

最後更新：2015-04-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

1、安裝bs4

我用的ubuntu14.4，直接用apt-get命令就行

sudo apt-get install Python-bs4

2、安裝解析器

Beautiful Soup支援Python標準庫中的HTML解析器，還支援一些第三方的解析器，其中一個是lxml。

sudo apt-get install Python-lxml

3、如何使用

將一段文檔傳入BeautifulSoup的構造方法，就能得到一個文檔的對象，可以傳入一段字串或一個檔案控制代碼。

from bs4 import BeautifulSoupsoup = BeautifulSoup(open("index.html"))soup = BeautifulSoup("<html>data</html>")

4、對象的種類

Beautfiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：tag，NavigableString，BeautifulSoup，Comment。

tag

Tag對象與XML或HMTL原生文檔中的tag相同：

soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>‘)tag = soup.btype(tag)# <class ‘bs4.element.Tag‘>

每個tag都有自己的名字，通過.name來擷取：

tag.name# u‘b‘

一個tag可能有很多屬性。

tag[‘class‘]# u‘boldest‘

tag.attrs# {u‘class‘: u‘boldest‘}

NavigableString

字串常被包含在tag內。

tag.string# u‘Extremely bold‘type(tag.string)# <class ‘bs4.element.NavigableString‘>

BeautifulSoup

BeautifulSoup對象表示的是一個文檔的全部內容。

soup<html><body><b class="boldest">Extremely bold</b></body></html>type(soup)<class ‘bs4.BeautifulSoup‘>

Comment

一般表示的是文檔的注釋部分。

5、遍曆文檔樹

tag的名字

可以通過點取屬性的方式擷取tag，並且可以多次調用。

soup.head# <head><title>The Dormouse‘s story</title></head>soup.title# <title>The Dormouse‘s story</title>

通過點取屬性的方式只能擷取當前名字的第一個tag：

soup.a# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想擷取所有的a標籤

soup.find_all(‘a‘)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6、搜尋文檔樹

Beautiful Soup最重要的搜尋方法有兩個：find（）,find_all()。

過濾器

最簡單的過濾器是字串

soup.find_all(‘b‘)# [<b>The Dormouse‘s story</b>]

通過傳入Regex來作為參數

import refor tag in soup.find_all(re.compile("^b")):    print(tag.name)# body# b

傳入列表參數

soup.find_all(["a", "b"])# [<b>The Dormouse‘s story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果沒有合適的過濾器，還可以自訂方法

find_all()

find_all( name , attrs , recursive , text , **kwargs )

name參數

name參數可以尋找所有名字為name的tag，比如title\head\body\p等等

keyword參數

如果一個指定名字的參數不是搜尋內建的參數名,搜尋時會把該參數當作指定名字tag的屬性來搜尋,如果包含一個名字為 id 的參數,Beautiful Soup會搜尋每個tag的”id”屬性.

soup.find_all(id=‘link2‘)# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果傳入 href 參數,Beautiful Soup會搜尋每個tag的”href”屬性:

soup.find_all(href=re.compile("elsie"))# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜尋指定名字的屬性時可以使用的參數值包括字串 , Regex , 列表, True .

下面的例子在文檔樹中尋找所有包含 id 屬性的tag,無論 id 的值是什麼:

soup.find_all(id=True)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多個指定名字的參數可以同時過濾tag的多個屬性:

soup.find_all(href=re.compile("elsie"), id=‘link1‘)# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

按css搜尋

class由於與Python關鍵字衝突，因此在beatifulsoup中為class_

class_ 參數同樣接受不同類型的過濾器 ,字串,Regex,方法或 True

text參數

text參數可以搜尋文檔中的字串內容。與 name 參數的可選值一樣, text 參數接受字串 , Regex , 列表, True。

像調用 find_all() 一樣調用tag

find_all() 幾乎是Beautiful Soup中最常用的搜尋方法,所以我們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:

soup.find_all("a")soup("a")

這兩行代碼也是等價的:

soup.title.find_all(text=True)soup.title(text=True)

CSS選取器

Beautiful Soup支援大部分的CSS選取器 [6] ,在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字串參數,即可使用CSS選取器的文法找到tag:

soup.select("title")# [<title>The Dormouse‘s story</title>]soup.select("p nth-of-type(3)")# [<p class="story">...</p>]

通過tag標籤逐層尋找:

soup.select("body a")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("html head title")# [<title>The Dormouse‘s story</title>]

找到某個tag標籤下的直接子標籤 [6] :

soup.select("head > title")# [<title>The Dormouse‘s story</title>]soup.select("p > a")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("p > a:nth-of-type(2)")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]soup.select("p > #link1")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("body > a")# []

找到兄弟節點標籤:

soup.select("#link1 ~ .sister")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]soup.select("#link1 + .sister")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通過CSS的類名尋找:

soup.select(".sister")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("[class~=sister]")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通過tag的id尋找:

soup.select("#link1")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("a#link2")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通過是否存在某個屬性來尋找:

soup.select(‘a[href]‘)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通過屬性的值來尋找:

soup.select(‘a[href="http://example.com/elsie"]‘)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select(‘a[href^="http://example.com/"]‘)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select(‘a[href$="tillie"]‘)# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select(‘a[href*=".com/el"]‘)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Python模組學習之bs4

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More