標籤:linux 爬蟲 python
當Google創始人用python寫下他們第一個簡陋的爬蟲, 運行在同樣簡陋的伺服器上的時候 ;
很少有人能夠想象 , 在接下的數十年間 , 他們是怎樣地顛覆了互連網乃至於人類的世界 。
有網路的地方就有爬蟲,爬蟲英文名稱spider。它是用來抓取網站資料的程式。比如: 我們通過一段程式,定期去抓取類似百度糯米、福士點評上的資料,將這些資訊儲存到資料庫裡,然後加上展示頁面,一個團購導航站就問世了。毫無疑問,爬蟲是很多網站的初期資料來源。
一、第一個爬蟲功能的實現
——查看博文目錄第一篇文章的URL
首先需要引入urllib模組,使用find函數尋找url,經過字元處理就都得到了需要的URL。
#!/usr/bin/env pythonimport urlliburl = [‘‘]*40i = 0con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()title = con.find(r‘<a title=‘)href = con.find(r‘href=‘,title)html = con.find(r‘.html‘,href)url = con[href +6 :html +5 ]print url
二、查看博文目錄第一頁所有文章的URL
A:
#!/usr/bin/env pythonimport urlliburl = [‘‘]*40i = 0con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()title = con.find(r‘<a title=‘)href = con.find(r‘href=‘,title)html = con.find(r‘.html‘,href)url[0] = con[href +6 :html +5 ]print urlwhile title != -1 and href != -1 and html != -1 and i < 40: url[i] = con[href +6 :html +5 ] print url[i] title = con.find(r‘<a title=‘,html) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) i = i +1
或者B:
#!/usr/bin/env pythonimport urllibi = 0con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()title = con.find(r‘<a title=‘)href = con.find(r‘href=‘,title)html = con.find(r‘.html‘,href)url = con[href +6 :html +5 ]while title != -1 and href != -1 and html != -1 and i < 50: title = con.find(r‘<a title=‘,html) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) url = con[href +6 :html +5 ] print url i = i + 1
三、下載博文目錄第一頁所有的文章
A:
#!/usr/bin/env pythonimport urllibi = 0url = [‘‘]*40con = urllib.urlopen(‘http://www.zhihu.com/collection/19668036‘).read()target = con.find(r‘<a target="_blank‘)base = con.find(r‘href=‘,target)end = con.find(‘>‘,base)url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]print url[0]while i < 20: url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1] print url[0] target = con.find(r‘<a target="_blank‘,end) base = con.find(r‘href=‘,target) end = con.find(‘>‘,base) i = i + 1while j < 30: content = urllib.urlopen(url[j]).read() print url[0] open(r‘zhihu/‘+url[j],‘w+‘).write(content) print ‘downloading‘, j = j + 1 time.sleep(15)
或者B:
#!/usr/bin/env pythonimport timeimport urllibi = 0j = 0url = [‘‘]*30name = [‘‘]*30con = urllib.urlopen(‘http://www.zhihu.com/collection/19668036‘).read()target = con.find(r‘<a target="_blank‘)base = con.find(r‘href=‘,target)end = con.find(‘>‘,base)url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]while target != -1 and base != -1 and end != -1 and i < 30: url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1] name[0] = con[base +16 :end - 1] target = con.find(r‘<a target="_blank‘,end) base = con.find(r‘href=‘,target) end = con.find(‘>‘,base) content = urllib.urlopen(url[0]).read() open(r‘zhihu/‘+name[0]+‘.html‘,‘w+‘).write(content) print ‘downloading‘,name[0] time.sleep(5) i = i + 1
四、下載所有文章
A:
import timeimport urllibpage = 1url = [‘‘]*350i = 0link = 1while page <= 7: con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read() title = con.find(r‘<a title=‘) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) while title != -1 and href != -1 and html != -1 and i < 350: url[i] = con[href +6 :html +5 ] print link,url[i] title = con.find(r‘<a title=‘,html) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) link = link + 1 i = i +1 else: print ‘find end!‘ page = page + 1else: print ‘all find end‘j = 0while j < 50: content = urllib.urlopen(url[j]).read() open(r‘tmp/‘+url[j][-26:],‘w+‘).write(content) j = j + 1 time.sleep(5)else: print ‘Download over!‘
B:
#!/usr/bin/env pythonimport timeimport urllibi = 0link = 1page = 1url = [‘‘]*350while page <= 7: con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read() title = con.find(r‘<a title=‘) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) while title != -1 and href != -1 and html != -1 and i < 350: url[i] = con[href +6 :html +5 ] print link,url[i] title = con.find(r‘<a title=‘,html) href = con.find(r‘href=‘,title) html = con.find(r‘.html‘,href) content = urllib.urlopen(url[i]).read() open(r‘/tmp/sina/‘+url[i][-26:],‘w+‘).write(content) time.sleep(5) link = link + 1 i = i +1 page = page + 1else: print ‘Download Over!‘
運行結果:
650) this.width=650;" src="http://s3.51cto.com/wyfs02/M01/70/87/wKiom1W5lmbzdtMBAAJeGvbqJRE714.jpg" title="1.png" alt="wKiom1W5lmbzdtMBAAJeGvbqJRE714.jpg" />
本文出自 “World” 部落格,請務必保留此出處http://xiajie.blog.51cto.com/6044823/1679997
Python 簡單爬蟲功能實現