python Beautiful Soup分析網頁

來源:互聯網
上載者:User

Beautiful Soup 是用Python寫的一個HTML/XML的解析器,它可以很好的處理不規範標記並產生剖析樹(parse tree)。它提供簡單又常用的導航(navigating),搜尋以及修改剖析樹的操作。它可以大大節省你的編程時間。

使用python開發網頁分析功能時,可以借用該庫的網頁解析功能,時分方便,比自己寫正則方便很多,使用時需要引入模組,如下:

在程式中中匯入 Beautiful Soup庫:

from BeautifulSoup import BeautifulSoup          # For processing HTMLfrom BeautifulSoup import BeautifulStoneSoup     # For processing XMLimport BeautifulSoup                             # To get everything

 

Beautiful Soup對html處理比較好,對xml處理不是特別完美,如下:

 

#! /usr/bin/python#coding:utf-8
from BeautifulSoup import BeautifulSoupimport redoc = ['<html><head><title>Page title</title></head>',       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',       '</html>']soup = BeautifulSoup(''.join(doc))print soup.prettify()

輸出如下:

# <html>#  <head>#   <title>#    Page title#   </title>#  </head>#  <body>#   <p id="firstpara" align="center">#    This is paragraph#    <b>#     one#    </b>#    .#   </p>#   <p id="secondpara" align="blah">#    This is paragraph#    <b>#     two#    </b>#    .#   </p>#  </body># </html>

當然它的功能很強大,下面是一個從網頁中提取title的例子,如下:

 

#!/usr/bin/env python#coding:utf-8import Queueimport threadingimport urllib2import timefrom BeautifulSoup import BeautifulSouphosts = ["http://yahoo.com", "http://google.com", "http://amazon.com","http://ibm.com"]queue = Queue.Queue()out_queue = Queue.Queue()class ThreadUrl(threading.Thread):    """Threaded Url Grab"""    def __init__(self, queue, out_queue):        threading.Thread.__init__(self)        self.queue = queue        self.out_queue = out_queue    def run(self):        while True:            #grabs host from queue            host = self.queue.get()            #grabs urls of hosts and then grabs chunk of webpage            url = urllib2.urlopen(host)            chunk = url.read()            #place chunk into out queue            self.out_queue.put(chunk)            #signals to queue job is done            self.queue.task_done()class DatamineThread(threading.Thread):    """Threaded Url Grab"""    def __init__(self, out_queue):        threading.Thread.__init__(self)        self.out_queue = out_queue    def run(self):        while True:            #grabs host from queue            chunk = self.out_queue.get()            #parse the chunk            soup = BeautifulSoup(chunk)            print soup.findAll(['title'])            #signals to queue job is done            self.out_queue.task_done()start = time.time()def main():    #spawn a pool of threads, and pass them queue instance    for i in range(5):        t = ThreadUrl(queue, out_queue)        t.setDaemon(True)        t.start()    #populate queue with data    for host in hosts:        queue.put(host)    for i in range(5):        dt = DatamineThread(out_queue)        dt.setDaemon(True)        dt.start()    #wait on the queue until everything has been processed    queue.join()    out_queue.join()main()print "Elapsed Time: %s" % (time.time() - start)

該例子用到了多線程和隊列,隊列可以簡化多線程開發,即分而治之的思想,一個線程只有一個獨立的功能,通過隊列共用資料,簡化程式邏輯,輸出結果如下:

 

[<title>IBM - United States</title>][<title>Google</title>][<title>Yahoo!</title>][<title>Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more</title>]Elapsed Time: 12.5929999352

中文文檔:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html

官方地址:http://www.crummy.com/software/BeautifulSoup/#Download/
 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.