Python-written web spider:
<span style= "FONT-SIZE:14PX;" ># Web spider# author Vince 2015/7/29import urllib2import re# get href Contentpattern = ' <a (?: \ \s+.+?) *?\\s+href=\ "([h]{1}[^\"]*?) \ "' t = Set (" ") # Collection of Urldef fecth (URL): http_request = Urllib2. Request (URL) http_request.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/43.0.2357.134 safari/537.36 ') Http_response = Urllib2.urlopen (http_request) print Http_response.code if H Ttp_response.code = = 200:for I in range (0,2000): # $ rows html = Http_response.readline () If html = = ": Break else:a = Re.search (pattern, HTML) if a: For href in a.groups (): Print href t.add (href) # main STA rt#if __name__ = = ' __main__ ': url = ' http://blog.csdn.net/' # target sitet.clear () t.add (URL) while (len (t)! = 0): UU = T.pop () print UU fecth (UU) </span>
If you do not set user-agent, some websites will not allow access, the newspaper 403
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python written by web spider (web crawler)