"Python3" How to set up a reptile proxy IP pool

Last Update:2018-05-16 Source: Internet

Author: User

Tags diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, why the need to set up a reptile agent IP pool

In a number of Web site anti-crawling measures, one is based on the frequency of access to the IP limit, in a certain period of time, when an IP access to a certain threshold, the IP will be pulled black, in a period of time is forbidden to access.

This can be done by reducing the frequency of the crawler, or by changing the IP address. The latter requires an available proxy IP pool for the crawler to switch on when it works.

Second, how to set up a reptile agent IP pool

Ideas: 1, to find a free IP proxy website (such as: West Thorn agent)

2. Crawl IP (normal crawl Requests+beautifulsoup)

3. Verify the IP validity (carry the crawled IP, go to the specified URL to see if the returned status code is 200)

4. Record IP (write to document)

The code is as follows:

#!/usr/bin/env python3#-*-coding:utf-8-*-import requests,threading,datetimefrom BS4 Import BeautifulSoupimport Random "" "1, grab the agent of the IP2 Agent website, and according to the specified target URL, to fetch the validity of the IP to verify 3, the last saved to the specified path" "" #----------------------------------------- -------------document Processing--------------------------------------------------------# Write document def write (Path,text): With open ( Path, ' A ', encoding= ' Utf-8 ') as F:f.writelines (text) f.write (' \ n ') # Clear document def truncatefile (path): with open (Path, ' W ', encoding= ' Utf-8 ') as F:f.truncate () # reads the document DEF read (path): With open (path, ' R ', encoding= ' utf-8 ') as F:txt = [] for s in F.readlines (): Txt.append (S.strip ()) return txt#------------------------ ----------------------------------------------------------------------------------------------# Calculate time difference, format: time Division seconds def  Gettimediff (start,end): seconds = (end-start). Seconds m, s = divmod (seconds,) h, m = Divmod (M,.) diff = ("%02d:%02d:%02d"% (H, M, s)) return diff#----------------------------------------------------------------------------------------------------------------------# Returns a random request header Headersdef getheaders (): User_agent_list = ["mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "" mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Win dows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0 ) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windo WS NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Apple  webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6 .2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "] Useragent=random.choice (user_agent _list) headers = {' User-agent ': useragent} return headers#-----------------------------------------------------check IP Available----------------------------------------------------def checkip (TARGETURL,IP): Headers =getheaders () # Custom Request Header PR oxies = {"http": "//" +ip, "https": "/http" +ip} # Proxy IP try:response=requests.get (url=targeturl,proxies=p roxies,headers=headers,timeout=5). status_code If response = = 200:return True else:r Eturn False Except:return false#-------------------------------------------------------Gets the proxy method------------------ ----------------------------------# free agent xicidailidef Findip (type,pagenum,targeturl,path): # IP Type, page number, destination URL, storage IPPath list={' 1 ': ' http://www.xicidaili.com/nt/', # Xicidaili domestic General Agent ' 2 ': ' http://www.xicidaili.com/nn/', # Xicida  ILI domestic high Stealth proxy ' 3 ': ' http://www.xicidaili.com/wn/', # Xicidaili domestic HTTPS proxy ' 4 ': ' http://www.xicidaili.com/wt/'} # Xicidaili Foreign HTTP Proxy url=list[str (type)]+str (pagenum) # configuration URL headers = getheaders () # Custom request header html=requests.get (ur L=url,headers=headers,timeout = 5). Text Soup=beautifulsoup (HTML, ' lxml ') all=soup.find_all (' tr ', class_= ' odd ') for I in All:t=i.find_all (' TD ') ip=t[1].text+ ': ' +t[2].text is_avail = Checkip (TARGETURL,IP) if is _avail = = True:write (path=path,text=ip) print (IP) #------------------------------------------------- ----Multi-threaded crawl IP ingress---------------------------------------------------def getip (Targeturl,path): Truncatefile (PATH) #         Empty document before crawling start = Datetime.datetime.now () # start time threads=[] for type in range (4): # Four types of IP, each type takes the first three pages, a total of 12 threads For Pagenum in range (3):            T=threading.  Thread (target=findip,args= (Type+1,pagenum+1,targeturl,path)) threads.append (t) print (' Start crawl proxy IP ') for s In Threads: # Open Multi-threaded crawl S.start () for E in threads: # Wait for all threads to end E.join () print (' crawl complete ') end = Dat Etime.datetime.now () # End time diff = Gettimediff (start, end) # COMPUTE time-consuming ips = Read (path) # Read the number of crawled IP print (' Total crawl agent IP:%s, total time:%s \ n '% (len (IPs), diff)) #------------------------------------------------------- Start-----------------------------------------------------------if __name__ = = ' __main__ ': path = ' Ip.txt ' # holds document Pat for crawling IP H targeturl = ' http://www.cnblogs.com/rianley/' # verify IP validity for the specified URL getip (targeturl,path)

Results:

"Python3" How to set up a reptile proxy IP pool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More