This article mainly introduces the Python get proxy IP instance sharing, has a certain reference value, now share to everyone, the need for friends can refer to
Usually when we need to crawl some of the data we need, always some sites prohibit duplicate access to the same IP, this time we should use proxy IP, each visit before the disguise themselves, so that the "enemy" can not be detected.
Ooooooooooooooook, let us have a pleasant start!
This is the file that gets the proxy IP, and I'm going to modularize them into three functions
Note: The text will be some English comments, is to write code convenient, after all, English one or two words will be OK
#!/usr/bin/python#-*-coding:utf-8-*-"" "Author:dasuda" "" Import urllib2import reimport socketimport ThreadingfindIP = [] #获取的原始IP数据IP_data = [] #拼接端口后的IP数据IP_data_checked = [] #检查可用性后的IP数据findPORT = [] #IP对应的端口available_table = [] #可用IP的索引d EF GetIP (url_target): Patternip = Re.compile (R ' (?<=<td>) [\d]{1,3}\.[ \d]{1,3}\. [\d] {1,3}\. [\d] {1,3} ') Patternport = Re.compile (R ' (?<=<td>) [\d]{2,5} (?=</td>) ') print "Now,start to refresh proxy IP ... "For page in range (1,4): url = ' http://www.xicidaili.com/nn/' +str (page) headers = {" User-agent ":" Mozilla/5.0 (Windows NT 10.0; WOW64) "} request = Urllib2. Request (Url=url, headers=headers) response = Urllib2.urlopen (request) content = Response.read () Findip = Re.findall (Pat TERNIP,STR (content)) Findport = Re.findall (patternport,str (content)) #assemble the IP and port for I in range (Len (Findi P)): findip[i] = Findip[i] + ":" + findport[i] ip_data.extend (findip) print (' Get page ', page) print "Refresh done!!! "#use MultithrEading Mul_thread_check (url_target) return ip_data_checkeddef Check_one (url_check,i): #get Lock lock = threading. Lock () #setting timeout socket.setdefaulttimeout (8) try:ppp = {"http": ip_data[i]} proxy_support = Urllib2. Proxyhandler (PPP) Openercheck = Urllib2.build_opener (proxy_support) Urllib2.install_opener (openercheck) request = Urllib2. Request (Url_check) request.add_header (' User-agent ', "mozilla/5.0 (Windows NT 10.0; WOW64) ") HTML = urllib2.urlopen (Request). Read () Lock.acquire () print (Ip_data[i], ' is OK ') #get available IP index Avai Lable_table.append (i) lock.release () except Exception as E:lock.acquire () print (' Error ') lock.release () def Mul_thread _check (url_mul_check): Threads = [] for i in range (len (ip_data)): #creat thread ... thread = Threading. Thread (Target=check_one, Args=[url_mul_check,i,]) threads.append (thread) Thread.Start () print "New thread start", I for Thread in Threads:thread.join () #get the ip_data_checked[] for error_cnt in range (Len (available_table)): aseemble_ip = {' http ': ip_data[available_table[error_cnt]]} ip_data_checked.append (aseemble_ip) print " Available proxy IP: ", Len (available_table)
First, GetIP (Url_target): The main function passed in parameter is: Verify proxy IP availability URL, recommended Ipchina
Get proxy IP, from the http://www.xicidaili.com/nn/website, it is a free proxy IP site, but the inside of the IP is not all available, and combined with your actual location, network conditions, access to the target server, etc., can use less than 20 %, at least my case is like this.
To access the http://www.xicidaili.com/nn/Web site using the normal way, the returned page content through regular queries to obtain the required IP and corresponding port, the code is as follows:
Patternip = Re.compile (R ' (?<=<td>) [\d]{1,3}\.[ \d]{1,3}\. [\d] {1,3}\. [\d] {1,3} ') Patternport = Re.compile (R ' (?<=<td>) [\d]{2,5} (?=</td>) ') ... Findip = Re.findall (patternIP,str (content)) Findport = Re.findall (patternport,str (content))
You can refer to other articles about how to construct regular expressions:
The obtained IP is saved in Findip, the corresponding port is in Findport, the two are indexed by index, and the normal number of one page IP is 100.
Next IP and Port stitching
Final usability Check
Second, Check_one (url_check,i): Thread function
This access Url_check still use normal access, when the access page has returned, then the proxy IP is available, the current index value is recorded, used for the subsequent removal of all available IP.
Third, Mul_thread_check (Url_mul_check): Multi-threaded generation
This function turns on multithreading to check proxy IP availability, and each IP opens a thread for checking.
This project calls GetIP () directly and passes in the URL for checking availability to return a list of IP lists that have been checked for availability, in the format
[' Ip1:port1 ', ' Ip2:port2 ',....]