python cralwer (爬蟲)心得

最後更新：2015-05-26 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python tools

最近用python做了個小crawler，可以自動整理一些網站的內容，推送到當地檔案中，做個小小的總結。

主要lib就是urllib 和 beautifulsoup.

urllib和urllib2是很方便的網頁提取庫，核心就是發送各種自訂的url request,然後可以返回網頁內容。最簡單的函數，判定一個網頁是否存在：

def isUrlExists(url):  req = urllib2.Request(url, headers=headers)  try:    urllib2.urlopen(req)  except:    return 0;  return 1;

headers可以自訂，也可以留空。自訂的主要目的是模仿成一般瀏覽器的header，繞過一些網站對crawler的封鎖。

如果想獲得網站內容，並且擷取返回異常的內容，可以這樣：

def fetchLink(url):  req = urllib2.Request(url, headers=headers)  try:    response = urllib2.urlopen(req)  except urllib2.URLError, e:    print 'Got Url Error while retrieving: ', url, ' and the exception is: ', e.reason  except urllib2.HTTPError, e:    print 'Got Http Error while retrieving: ', url,  ' with reponse code: ', e.getcode(), ' and exception: ', e.reason  else:    htmlContent = response.read()    return htmlContent

以上代碼直接返回html。

BeautifulSoup (documentaion: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ )是一個簡潔的html分析工具庫。獲得了html後，可以直接用python內建的Regex來獲得想要的資訊, 但是略顯繁瑣。Beautifulshop直接將html 用類似json的方法分析好，形成標準的樹狀結構，可以直接進行擷取元素的操作。另外，還支援元素的搜尋等等。

  content = bs4.BeautifulSoup(content,from_encoding='GB18030')  posts = content.find_all(class_='post-content')  for post in posts:    postText = post.find(class_='entry-title').get_text()

這個例子中content先被轉化為bs4對象，然後找到所有class=post-content的區塊，再獲得其中class=entry-title的文字。注意，第一行中parse的時候可以選擇encoding，這裡是用的是簡體中文。

以上都是html text內容的擷取。如果其中有圖片，以上代碼會直接產生圖片串連到原來的圖片位置。如果需要進行任何下載，可以使用urlretrieve方法。這裡就不多說了。

python cralwer (爬蟲)心得

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More