Recently practice writing crawler, originally climbed a few mm chart to do the test, but climbed to dozens of pieces of time will return 403 error, this is the site server found, I was blocked.
Therefore, you need to use proxy IP. In order to facilitate later use, I intend to write an automatic crawling IP agent crawler, is so-called, Ax, after reading High school again work!
First look at the results of the operation:
function returns a list
Talk less, put the code out:
#-*-coding:utf-8-*-ImportUrllibImportUrllib2ImportReImport Time#obtain some IP and port for spider from a site,xicidaili.com.classObtainproxy:def __init__(Self,region ='Domestic General'): Self.region= {'Domestic General':'nt/','Domestic High Stealth':'nn/','Foreign General':'wt/','Foreign High Stealth':'wn/','SOCKS':'qq/'} self.url='http://www.xicidaili.com/'+Self.region[region] Self.header={} self.header['user-agent'] ='mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/31.0.1650.63 safari/537.36' defGet_prpxy (self): req= Urllib2. Request (self.url,headers =self.header) Resp=Urllib2.urlopen (req) content=resp.read () self.get_ip= Re.findall (r'(\d+\.\d+\.\d+\.\d+) </td>\s*<td> (\d+) </td>', content) Self.pro_list= [] foreachinchSelf.get_ip:a_info= Each[0] +':'+ each[1] Self.pro_list.append (a_info)returnself.pro_listdefSave_pro_info (self): with open ('Proxy','W') as F: foreachinchSelf.get_ip:a_info= Each[0] +':'+ each[1] +'\ n'f.writelines (a_info)if __name__=='__main__': Proxy=Obtainproxy ()PrintProxy.get_prpxy ()
This thing is still very good.
Python Get IP Proxy list crawler