python小執行個體一：簡單爬蟲_

python小執行個體一：簡單爬蟲__python

最後更新：2018-07-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

本文所謂的爬蟲就是通過本地遠端存取url，然後將url的讀成原始碼形式，然後對原始碼進行解析，擷取自己需要的資料，相當於簡單資料採礦。本文實現的是將一個網頁的圖片爬出儲存到本地的過程，例子很簡單，用的是python 3.5.2版本，以前的版本可能匯入的包的名字不一樣，調用的庫函數方式有些差別。代碼如下：

#coding =utf-8import urllib.requestimport redef getHtml(url):page = urllib.request.urlopen(url)  ##開啟頁面html = page.read() ##擷取目標頁面的源碼return htmldef getImg(html):reg = 'src="(.+?\.png)"'  ##Regex篩選靶心圖表片格式，有些是'data-original="(.+?\.jpg)"'img = re.compile(reg)html = html.decode('utf-8')  ##編碼方式為utf-8imglist = re.findall(img, html) ##解析頁面源碼擷取圖片列表#print(imglist)x = 0#length = len(imglist)for i in range(6):  ##取前6張圖片儲存imgurl = imglist[i]#imgurl = re.sub('"(.*?)"',r'\1',imgurl) #取單引號裡的雙引號內容#print(imgurl)urllib.request.urlretrieve(imgurl,'%s.jpg' % x) ##將圖片從遠程下載到本地並儲存x += 1global Max_NumMax_Num = 1##有時候無法開啟目標網頁，需要嘗試多次，這裡設定為1次for i in range(Max_Num):try:html = getHtml("view-source:http://www.shangxueba.com/jingyan/2438398.html")getImg(html)breakexcept:if i < Max_Num - 1:continueelse:print ('URLError: <urlopen error timed out> All times is failed ')

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

python小執行個體一：簡單爬蟲__python

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support