"Python3" How to set up a reptile proxy IP pool

Source: Internet
Author: User
Tags diff

First, why the need to set up a reptile agent IP pool

In a number of Web site anti-crawling measures, one is based on the frequency of access to the IP limit, in a certain period of time, when an IP access to a certain threshold, the IP will be pulled black, in a period of time is forbidden to access.

This can be done by reducing the frequency of the crawler, or by changing the IP address. The latter requires an available proxy IP pool for the crawler to switch on when it works.

Second, how to set up a reptile agent IP pool

Ideas: 1, to find a free IP proxy website (such as: West Thorn agent)

2. Crawl IP (normal crawl Requests+beautifulsoup)

3. Verify the IP validity (carry the crawled IP, go to the specified URL to see if the returned status code is 200)

4. Record IP (write to document)

The code is as follows:

#!/usr/bin/env python3#-*-coding:utf-8-*-import requests,threading,datetimefrom BS4 Import BeautifulSoupimport Random "" "1, grab the agent of the IP2 Agent website, and according to the specified target URL, to fetch the validity of the IP to verify 3, the last saved to the specified path" "" #----------------------------------------- -------------document Processing--------------------------------------------------------# Write document def write (Path,text): With open ( Path, ' A ', encoding= ' Utf-8 ') as F:f.writelines (text) f.write (' \ n ') # Clear document def truncatefile (path): with open (Path, ' W ', encoding= ' Utf-8 ') as F:f.truncate () # reads the document DEF read (path): With open (path, ' R ', encoding= ' utf-8 ') as F:txt = [] for s in F.readlines (): Txt.append (S.strip ()) return txt#------------------------ ----------------------------------------------------------------------------------------------# Calculate time difference, format: time Division seconds def  Gettimediff (start,end): seconds = (end-start). Seconds m, s = divmod (seconds,) h, m = Divmod (M,.) diff = ("%02d:%02d:%02d"% (H, M, s)) return diff#----------------------------------------------------------------------------------------------------------------------# Returns a random request header Headersdef getheaders (): User_agent_list = ["mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "" mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Win dows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0 ) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windo WS NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Apple webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Apple  webkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6 .2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "] Useragent=random.choice (user_agent _list) headers = {' User-agent ': useragent} return headers#-----------------------------------------------------check IP Available----------------------------------------------------def checkip (TARGETURL,IP): Headers =getheaders () # Custom Request Header PR oxies = {"http": "//" +ip, "https": "/http" +ip} # Proxy IP try:response=requests.get (url=targeturl,proxies=p roxies,headers=headers,timeout=5). status_code If response = = 200:return True else:r Eturn False Except:return false#-------------------------------------------------------Gets the proxy method------------------ ----------------------------------# free agent xicidailidef Findip (type,pagenum,targeturl,path): # IP Type, page number, destination URL, storage IPPath list={' 1 ': ' http://www.xicidaili.com/nt/', # Xicidaili domestic General Agent ' 2 ': ' http://www.xicidaili.com/nn/', # Xicida  ILI domestic high Stealth proxy ' 3 ': ' http://www.xicidaili.com/wn/', # Xicidaili domestic HTTPS proxy ' 4 ': ' http://www.xicidaili.com/wt/'} # Xicidaili Foreign HTTP Proxy url=list[str (type)]+str (pagenum) # configuration URL headers = getheaders () # Custom request header html=requests.get (ur L=url,headers=headers,timeout = 5). Text Soup=beautifulsoup (HTML, ' lxml ') all=soup.find_all (' tr ', class_= ' odd ') for I in All:t=i.find_all (' TD ') ip=t[1].text+ ': ' +t[2].text is_avail = Checkip (TARGETURL,IP) if is _avail = = True:write (path=path,text=ip) print (IP) #------------------------------------------------- ----Multi-threaded crawl IP ingress---------------------------------------------------def getip (Targeturl,path): Truncatefile (PATH) #         Empty document before crawling start = Datetime.datetime.now () # start time threads=[] for type in range (4): # Four types of IP, each type takes the first three pages, a total of 12 threads For Pagenum in range (3):            T=threading.  Thread (target=findip,args= (Type+1,pagenum+1,targeturl,path)) threads.append (t) print (' Start crawl proxy IP ') for s In Threads: # Open Multi-threaded crawl S.start () for E in threads: # Wait for all threads to end E.join () print (' crawl complete ') end = Dat Etime.datetime.now () # End time diff = Gettimediff (start, end) # COMPUTE time-consuming ips = Read (path) # Read the number of crawled IP print (' Total crawl agent IP:%s, total time:%s \ n '% (len (IPs), diff)) #------------------------------------------------------- Start-----------------------------------------------------------if __name__ = = ' __main__ ': path = ' Ip.txt ' # holds document Pat for crawling IP H targeturl = ' http://www.cnblogs.com/rianley/' # verify IP validity for the specified URL getip (targeturl,path)

  

Results:

"Python3" How to set up a reptile proxy IP pool

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.