標籤:python web spider 網路爬蟲
Python寫的Web spider:
<span style="font-size:14px;"># web spider# author vince 2015/7/29import urllib2import re# get href contentpattern = '<a(?:\\s+.+?)*?\\s+href=\"([h]{1}[^\"]*?)\"'t = set("") # collection of urldef fecth(url): http_request = urllib2.Request(url) http_request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36') http_response = urllib2.urlopen(http_request) print http_response.code if http_response.code == 200: for i in range(0,2000): # 2000 rows html = http_response.readline() if html == '': break else: a = re.search(pattern, html) if a: for href in a.groups(): print href t.add(href)# main start#if __name__ == '__main__': url = 'http://blog.csdn.net/' # target sitet.clear()t.add(url)while (len(t) != 0): uu = t.pop() print uu fecth(uu)</span>
如果沒有設定User-Agent,有些網站會不讓訪問,報403
著作權聲明:本文為博主原創文章,未經博主允許不得轉載。
Python寫的Web spider(網路爬蟲)