First, why the need to set up a reptile agent IP pool
In a number of Web site anti-crawling measures, one is based on the frequency of access to the IP limit, in a certain period of time, when an IP access to a certain threshold, the IP will be pulled black, in a period of time is forbidden to access.
This can be done by reducing the frequency of the crawler, or by changing the IP address. The latter requires an available proxy IP pool for the crawler to switch on when it works.
Second, how to set up a reptile agent IP pool
Ideas: 1, to find a free IP proxy website (such as: West Thorn agent)
2. Crawl IP (normal crawl Requests+beautifulsoup)
3. Verify the IP validity (carry the crawled IP, go to the specified URL to see if the returned status code is 200)
4. Record IP (write to document)
The code is as follows:
#!/usr/bin/env python3#-*-coding:utf-8-*-import requests,threading,datetimefrom BS4 Import BeautifulSoupimport Random "" "1, grab the agent of the IP2 Agent website, and according to the specified target URL, to fetch the validity of the IP to verify 3, the last saved to the specified path" "" #----------------------------------------- -------------document Processing--------------------------------------------------------# Write document def write (Path,text): With open ( Path, ' A ', encoding= ' Utf-8 ') as F:f.writelines (text) f.write (' \ n ') # Clear document def truncatefile (path): with open (Path, ' W ', encoding= ' Utf-8 ') as F:f.truncate () # reads the document DEF read (path): With open (path, ' R ', encoding= ' utf-8 ') as F:txt = [] for s in F.readlines (): Txt.append (S.strip ()) return txt#------------------------ ----------------------------------------------------------------------------------------------# Calculate time difference, format: time Division seconds def Gettimediff (start,end): seconds = (end-start). Seconds m, s = divmod (seconds,) h, m = Divmod (M,.) diff = ("%02d:%02d:%02d"% (H, M, s)) return diff#----------------------------------------------------------------------------------------------------------------------# Returns a random request header Headersdef getheaders (): User_agent_list = ["mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "" mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Win dows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0 ) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windo WS NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6 .2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "] Useragent=random.choice (user_agent _list) headers = {' User-agent ': useragent} return headers#-----------------------------------------------------check IP Available----------------------------------------------------def checkip (TARGETURL,IP): Headers =getheaders () # Custom Request Header PR oxies = {"http": "//" +ip, "https": "/http" +ip} # Proxy IP try:response=requests.get (url=targeturl,proxies=p roxies,headers=headers,timeout=5). status_code If response = = 200:return True else:r Eturn False Except:return false#-------------------------------------------------------Gets the proxy method------------------ ----------------------------------# free agent xicidailidef Findip (type,pagenum,targeturl,path): # IP Type, page number, destination URL, storage IPPath list={' 1 ': ' http://www.xicidaili.com/nt/', # Xicidaili domestic General Agent ' 2 ': ' http://www.xicidaili.com/nn/', # Xicida ILI domestic high Stealth proxy ' 3 ': ' http://www.xicidaili.com/wn/', # Xicidaili domestic HTTPS proxy ' 4 ': ' http://www.xicidaili.com/wt/'} # Xicidaili Foreign HTTP Proxy url=list[str (type)]+str (pagenum) # configuration URL headers = getheaders () # Custom request header html=requests.get (ur L=url,headers=headers,timeout = 5). Text Soup=beautifulsoup (HTML, ' lxml ') all=soup.find_all (' tr ', class_= ' odd ') for I in All:t=i.find_all (' TD ') ip=t[1].text+ ': ' +t[2].text is_avail = Checkip (TARGETURL,IP) if is _avail = = True:write (path=path,text=ip) print (IP) #------------------------------------------------- ----Multi-threaded crawl IP ingress---------------------------------------------------def getip (Targeturl,path): Truncatefile (PATH) # Empty document before crawling start = Datetime.datetime.now () # start time threads=[] for type in range (4): # Four types of IP, each type takes the first three pages, a total of 12 threads For Pagenum in range (3): T=threading. Thread (target=findip,args= (Type+1,pagenum+1,targeturl,path)) threads.append (t) print (' Start crawl proxy IP ') for s In Threads: # Open Multi-threaded crawl S.start () for E in threads: # Wait for all threads to end E.join () print (' crawl complete ') end = Dat Etime.datetime.now () # End time diff = Gettimediff (start, end) # COMPUTE time-consuming ips = Read (path) # Read the number of crawled IP print (' Total crawl agent IP:%s, total time:%s \ n '% (len (IPs), diff)) #------------------------------------------------------- Start-----------------------------------------------------------if __name__ = = ' __main__ ': path = ' Ip.txt ' # holds document Pat for crawling IP H targeturl = ' http://www.cnblogs.com/rianley/' # verify IP validity for the specified URL getip (targeturl,path)
Results:
"Python3" How to set up a reptile proxy IP pool