Python庫詳解之網路(2)

Python庫詳解之網路(2)–解析網頁

最後更新：2018-12-03 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

昨天試了下用HTMLParser類來解析網頁，可發現結果並不理想。不管怎麼說，先寫下過程，希望後來人能在此基礎上解決我所遇到的問題。

寫了2套解決方案，當然這2套只能對特定網站有效。我這裡主要說明下對BBC首頁www.bbc.co.uk和對網易www.163.com的解析。

對於BBC：

這套要簡單得多，可能是該網頁的編碼比較標準吧

import html.parser
import urllib.request

class parseHtml(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a {} start tag".format(tag))
    def handle_endtag(self, tag):
         print("Encountered a {} end tag".format(tag))
    def handle_charref(self,name):
        print("charref")
    def handle_entityref(self,name):
        print("endtiyref")
    def handle_data(self,data):
        print("data")
    def handle_comment(self,data):
        print("comment")
    def handle_decl(self,decl):
        print("decl")
    def handle_pi(self,decl):
        print("pi")

#從這裡開始看起，上面那個繼承很簡單，全部重載父類函數

#以二進位寫的方式儲存BBC網頁，這是上篇內容(http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx),不贅述

file=open("bbc.html",'wb') #it's 'wb',not 'w'
url=urllib.request.urlopen("http://www.bbc.co.uk/")
while(1):
    line=url.readline()
    if len(line)==0:
        break
    file.write(line)

#產生一個對象

pht=parseHtml()

#對於這個網站，我使用'utf-8'開啟，否則會出錯，其他網站可能就不需要，utf-8是UNICODE編碼
file=open("bbc.html",encoding='utf-8',mode='r')

#處理網頁，feed
while(1):
    line=file.readline()
    if len(line)==0:
        break
    pht.feed(line)
file.close()
pht.close()

對於163：

#對於這個網頁的解析，如果使用上面的方法，碰到CSS和javascript部分會發生異常，

#所以我在此去掉了那2部分，來看代碼：

import html.parser
import urllib.request

#從這裡看起，我定義了4個函數用於處理CSS和javascript部分

def EncounterCSS(line):
    if line.find("""<style type="text/css">""")==-1:
        return 0
    return 1
def PassCSS(file,line):
   # print(line)
    while(1):
        if line.find("</style>")!=-1:
            break
        line=file.readline()

def EncounterJavascript(line):
    if line.find("""<script type="text/javascript">""")==-1:
        return 0
    return 1
def PassJavascript(file,line):
    print(line)
    while(1):
        if line.find("</script>")!=-1:
            break
        line=file.readline()

website="http://www.163.com"
file=open("163.html",mode='wb') #it's 'wb',not 'w'
url=urllib.request.urlopen(website)
while(1):
    line=url.readline()
    if len(line)==0:
        break
    file.write(line)

pht=parseHtml()
file=open("163.html",mode='r')

while(1):
    line=file.readline()
    if len(line)==0:
        break

#在這個while迴圈中，先去掉CSS和Javascript部分
    if EncounterCSS(line)==1:
        PassCSS(file)
    elif EncounterJavascript(line)==1:
        PassJavascript(file)
    else:
        pht.feed(line)
file.close()
pht.close()

雖然都能成功，但卻不是我所想要的，我希望處理網頁有通用的方法。

本來想用下BeautifulSoup，希望這個類能幫忙解決，可惜咱python版本太新，不能用，等以後再看。

當然，處理網頁也許並不需要HTMLParser類，我們可以自己寫針對我們所需的代碼，只有那樣，我們才能對網頁解析有更多的主動權，而且更能提升自身的能力。也就是說，我們只需要pyhon幫我們下載網頁（網頁元素），解析部分，還是自己來處理吧。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More