python Beautiful Soup 抓取解析網頁

最後更新：2015-03-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.總之就是一個解析xml和html之類的庫，用著還算順手。

官網地址:http://www.crummy.com/software/BeautifulSoup/

下面來介紹下使用python和Beautiful Soup 抓取一個網頁上的PM2.5資料。

PM2.5 資料的網站：http://www.pm25.com/city/wuhan.html

這個網站上有相應的PM2.5資料，他們在幾個地方布置的有監測器，大約每隔一個小時更新一次（有的時候，儀器的資料會丟失）。我們要抓取的資料就是幾個監測點的一些空氣品質指標。

 1 def getPM25(): 2     url = "http://www.pm25.com/city/wuhan.html" 3  4     headers = { 5             "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 6             "Accept-Language":"zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3", 7             "Connection":"keep-alive", 8             "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0", 9         }10     try:11         req = urllib2.Request(url,headers=headers)12         response =  urllib2.urlopen(req)13         content =  response.read()14         response.close()15         pm = BSoup(content,from_encoding="utf-8")16         logging.info(pm.select(".citydata_updatetime")[0].get_text() + u" ")17         with open(‘pm2dot5.txt‘,‘a‘) as f:18             print>>f, pm.select(".citydata_updatetime")[0].get_text()19             for locate in pm.select(".pj_area_data ul:nth-of-type(1) li"):20                 print>>f, locate.select(".pjadt_location")[0].get_text().rjust(15),"\t",21                           locate.select(".pjadt_aqi")[0].get_text().rjust(15),"\t",22                           locate.select(".pjadt_quality")[0].get_text().rjust(15),"\t",23                           locate.select(".pjadt_wuranwu")[0].get_text().rjust(15),"\t",24                           locate.select(".pjadt_pm25")[0].get_text().rjust(15),"\t",25                           locate.select(".pjadt_pm10")[0].get_text().rjust(15)26             print>>f, "\n\n\n"27         return 028     except Exception,e:29         logging.error(e)30         return 1

主要使用python的庫 urllib2

構造發送的headers，偽裝成Firefox瀏覽器
用上述資料構建一個請求，然後開啟這個網路請求
調用 response.read() 即可擷取html的內容

提取標籤內容

下面就是使用Beautiful Soup來解析html內容，提取標籤裡的數值。具體函數還是要參見官方文檔。

這裡主要使用了select方法和get_text方法。

select方法可以根據標籤名（tag，比如 a，li，body）或者css類或者id來選擇元素。

get_text方法可以擷取對應的文本，比如"<h1>hello</h1>"，就可以獲得 "hello"

具體的元素類，需要藉助瀏覽器的審查元素功能來查看

寫入文本：

　　主要使用了python的 with文法，with能夠確保開啟的檔案發生異常時一定會被關閉。同時使用了一個流重新導向的小技巧，

print >> f,"hello" f為開啟的檔案流，這句話的意思是將print列印的東西重新導向到檔案中。

日誌記錄：

由於這個程式要在後台跑很久，所以還是最好記錄下出錯的資訊，方便調試。使用的python內建的logging模組。

 1 logging.basicConfig(level=logging.DEBUG, 2                 format=‘%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s‘, 3                 datefmt=‘%a, %d %b %Y %H:%M:%S‘, 4                 filename=‘debug.log‘, 5                 filemode=‘w‘) 6     console = logging.StreamHandler() 7     console.setLevel(logging.INFO) 8     formatter = logging.Formatter(‘%(name)-12s: %(levelname)-8s %(message)s‘) 9     console.setFormatter(formatter)10     logging.getLogger(‘‘).addHandler(console)11     Rthandler = RotatingFileHandler(‘debug.log‘, maxBytes=1*1024*1024,backupCount=5)12     Rthandler.setLevel(logging.INFO)13     formatter = logging.Formatter(‘%(name)-12s: %(levelname)-8s %(message)s‘)14     Rthandler.setFormatter(formatter)15     logging.getLogger(‘‘).addHandler(Rthandler)

其中包括了一些，設定日誌的格式，以及記錄檔的最大大小。

定時運行：

定時運行，可以每天抓取指定時間的PM2.5資料，結合衛星過境時間來做進一步的分析。定時使用的也是python內建的sched模組。

 1 def run(): 2     while True: 3         s = sched.scheduler(time.time, time.sleep) 4         s.enterabs(each_day_time(9,50,30), 1, getPM25, ()) 5         try: 6             s.run() 7         except: 8             s.run() 9         time.sleep(60*60)10         logging.info("second run")11         while getPM25():12             pass13         time.sleep( 60*60)14         logging.info("third run")15         while getPM25():16             pass17         time.sleep(60*60)18         logging.info("fourth run")19         while getPM25():20             pass21         logging.info(u"\n\n等待下次運行...")

其中each_day_time是一個擷取指定時間的函數

1 def each_day_time(hour,minute,sec):2         today = datetime.datetime.today()3         today = datetime.datetime(today.year,today.month,today.day,hour,minute,sec)4         tomorrow = today + datetime.timedelta(days=1)5         xtime = time.mktime(tomorrow.timetuple())6         #xtime = time.mktime(today.timetuple())7         return xtime

另外，如果指定的時間已經過去，他就會一直運行下去。

完整代碼下載（python 2.7）： http://files.cnblogs.com/files/pasion-forever/pm2-5.v1.rar

另：直接雙擊pyw檔案，會調用pythonw.exe 來執行，如果沒有GUI，預設的就是後台運行。

抓取的結果：

python Beautiful Soup 抓取解析網頁

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More