Python 外掛程式雜談 (4) —- BeautifulSoup , Python中的網頁分析工具

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

嗯哼，Meego中文核心站-- 米趣網又發新博文啦。
前面向大家介紹了 PyQuery ，下面轉而介紹一下 BeautifulSoup , Beautiful Soup 是 Python 內建的網頁分析工具，名字叫美麗的蝴蝶。呵呵，某些時候確如美麗蝴蝶一樣。
先來段介紹:
Beautiful Soup 是一個 Python HTML/XML 處理器，設計用來快速地轉換網頁抓取。以下的特性支撐著 Beautiful Soup：

Beautiful Soup 不會選擇即使你給他一個損壞的標籤。他產生一個轉換DOM樹，儘可能和你原文檔內容含義一致。這種措施通常能夠你搜集資料的需求。
Beautiful Soup 提供一些簡單的方法以及類Python文法來尋找、尋找、修改一顆轉換樹：一個工具集協助你解析一棵樹並釋出你需要的內容。你不需要為每一個應用建立自己的解析工具。
Beautiful Soup 自動將送進來的文檔轉換為 Unicode 編碼 而且在輸出的時候轉換為 UTF-8,。除非這個文檔沒有指定編碼方式或者Beautiful Soup 沒能自動檢測編碼，你需要手動指定編碼方式，否則你不需要考慮編碼的問題。

Beautiful Soup 轉換任何你給他的內容，然後為你做那些轉換的事情。你可以命令他 “找出所有的連結", 或者 "找出所有 class 是 externalLink 的連結" , 再或者是 "找出所有的連結 url 匹配 ”foo.com", 甚至是 "找出那些表頭是粗體文字，然後返回給我文字“.
那些設計不好的網站中的有價值的資料可以被你一次鎖定，原本要花數個小時候的工作，通過使用 Beautiful Soup 可以在幾分鐘內搞定。
下面讓我們快速開始：
首先引用包：

from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything[/font][/color]

複製代碼

下面使用一段代碼示範Beautiful Soup的基本使用方式。你可以拷貝與粘貼這段代碼自己運行。

from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body>This is paragraph one.',
'This is paragraph two.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
#
# This is paragraph
#
# one
#
# .
#
#
# This is paragraph
#
# two
#
# .
#
# </body>
# </html>

複製代碼

下面是一個解析文檔的方法：

soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# This is paragraph one.
head.nextSibling.contents[0].nextSibling
# This is paragraph two.

複製代碼

接著是一打方法尋找一文檔中包含的標籤，或者含有指定屬性的標籤

titleTag = soup.html.head.title
titleTag
# <title>Page title</title>
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p', align="center")
# [This is paragraph one. ]
soup.find('p', align="center")
# This is paragraph one.
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'

複製代碼

當然也可以簡單地修改文檔

titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
# <head><title id="theTitle">New title</title></head>
soup.p.extract()
soup.prettify()
# <html>
# <head>
# <title id="theTitle">
# New title
# </title>
# </head>
# <body>
#
# This is paragraph
#
# two
#
# .
#
# </body>
# </html>
soup.p.replaceWith(soup.b)
# <html>
# <head>
# <title id="theTitle">
# New title
# </title>
# </head>
# <body>
#
# two
#
# </body>
# </html>
soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " tags!")
soup.body
# <body>This page used to have two tags!</body>

複製代碼

最後，為大家提供 Beautiful Soup 的文檔。希望能對您有協助。

轉載文章，請註明來自 米趣網

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More