Python抓取中文網頁

最後更新：2018-12-06 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

早就有想法把部落格每天的訪問流量記下來，剛好現在申請了GAE的應用，又開始學Python，正好拿這個練手。打算先利用Python把訪問記錄儲存在本地，熟悉之後可以部署到GAE，利用GAE提供的cron就可以每天更近訪問流量了。OK，開始~

　　首先是簡單的網頁抓取程式：

　　[python] view plaincopy import sys， urllib2

　　req = urllib2.Request（"http://blog.csdn.net/nevasun"）

　　fd = urllib2.urlopen（req）

　　while True：data = fd.read（1024）

　　if not len（data）：break sys.stdout.write（data）

　　在終端運行提示urllib2.HTTPError： HTTP Error 403： Forbidden，怎麼回事呢？

　　這是由於網站禁止爬蟲，可以在請求加上頭資訊，偽裝成瀏覽器訪問。添加和修改：

　　[python] view plaincopy headers = {'User-Agent'：'Mozilla/5.0 （Windows； U； Windows NT 6.1； en-US； rv：1.9.1.6） Gecko/20091201 Firefox/3.5.6'} req = urllib2.Request（"http://blog.csdn.net/nevasun"， headers=headers）

　　再試一下，HTTP Error 403沒有了，但是中文全都是亂碼。又是怎麼回事？

　　這是由於網站是utf-8編碼的，需要轉換成本地系統的編碼格式：

　　[python] view plaincopy import sys， urllib2

　　headers = {'User-Agent'：'Mozilla/5.0 （Windows； U； Windows NT 6.1； en-US； rv：1.9.1.6） Gecko/20091201 Firefox/3.5.6'} req = urllib2.Request（"http://blog.csdn.net/nevasun"， headers=headers）

　　content = urllib2.urlopen（req）。read（） # UTF-8

　　type = sys.getfilesystemencoding（） # local encode format print content.decode（"UTF-8"）。encode（type） # convert encode format OK，大功告成，可以抓取中文頁面了。下一步就是在GAE上做個簡單的應用了~

轉自：http://linux.chinaitlab.com/Python/878184.html

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python抓取中文網頁

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support