Beautiful Soup 是用Python寫的一個HTML/XML的解析器,它可以很好的處理不規範標記並產生剖析樹(parse tree)。它提供簡單又常用的導航(navigating),搜尋以及修改剖析樹的操作。它可以大大節省你的編程時間。
使用python開發網頁分析功能時,可以借用該庫的網頁解析功能,時分方便,比自己寫正則方便很多,使用時需要引入模組,如下:
在程式中中匯入 Beautiful Soup庫:
from BeautifulSoup import BeautifulSoup # For processing HTMLfrom BeautifulSoup import BeautifulStoneSoup # For processing XMLimport BeautifulSoup # To get everything
Beautiful Soup對html處理比較好,對xml處理不是特別完美,如下:
#! /usr/bin/python#coding:utf-8
from BeautifulSoup import BeautifulSoupimport redoc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>']soup = BeautifulSoup(''.join(doc))print soup.prettify()
輸出如下:
# <html># <head># <title># Page title# </title># </head># <body># <p id="firstpara" align="center"># This is paragraph# <b># one# </b># .# </p># <p id="secondpara" align="blah"># This is paragraph# <b># two# </b># .# </p># </body># </html>
當然它的功能很強大,下面是一個從網頁中提取title的例子,如下:
#!/usr/bin/env python#coding:utf-8import Queueimport threadingimport urllib2import timefrom BeautifulSoup import BeautifulSouphosts = ["http://yahoo.com", "http://google.com", "http://amazon.com","http://ibm.com"]queue = Queue.Queue()out_queue = Queue.Queue()class ThreadUrl(threading.Thread): """Threaded Url Grab""" def __init__(self, queue, out_queue): threading.Thread.__init__(self) self.queue = queue self.out_queue = out_queue def run(self): while True: #grabs host from queue host = self.queue.get() #grabs urls of hosts and then grabs chunk of webpage url = urllib2.urlopen(host) chunk = url.read() #place chunk into out queue self.out_queue.put(chunk) #signals to queue job is done self.queue.task_done()class DatamineThread(threading.Thread): """Threaded Url Grab""" def __init__(self, out_queue): threading.Thread.__init__(self) self.out_queue = out_queue def run(self): while True: #grabs host from queue chunk = self.out_queue.get() #parse the chunk soup = BeautifulSoup(chunk) print soup.findAll(['title']) #signals to queue job is done self.out_queue.task_done()start = time.time()def main(): #spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() #populate queue with data for host in hosts: queue.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.start() #wait on the queue until everything has been processed queue.join() out_queue.join()main()print "Elapsed Time: %s" % (time.time() - start)
該例子用到了多線程和隊列,隊列可以簡化多線程開發,即分而治之的思想,一個線程只有一個獨立的功能,通過隊列共用資料,簡化程式邏輯,輸出結果如下:
[<title>IBM - United States</title>][<title>Google</title>][<title>Yahoo!</title>][<title>Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more</title>]Elapsed Time: 12.5929999352
中文文檔:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html
官方地址:http://www.crummy.com/software/BeautifulSoup/#Download/