Python get instance sharing of proxy IP

Source: Internet
Author: User
This article mainly introduces the Python get proxy IP instance sharing, has a certain reference value, now share to everyone, the need for friends can refer to

Usually when we need to crawl some of the data we need, always some sites prohibit duplicate access to the same IP, this time we should use proxy IP, each visit before the disguise themselves, so that the "enemy" can not be detected.

Ooooooooooooooook, let us have a pleasant start!

This is the file that gets the proxy IP, and I'm going to modularize them into three functions

Note: The text will be some English comments, is to write code convenient, after all, English one or two words will be OK

#!/usr/bin/python#-*-coding:utf-8-*-"" "Author:dasuda" "" Import urllib2import reimport socketimport ThreadingfindIP = [] #获取的原始IP数据IP_data = [] #拼接端口后的IP数据IP_data_checked = [] #检查可用性后的IP数据findPORT = [] #IP对应的端口available_table = [] #可用IP的索引d EF GetIP (url_target): Patternip = Re.compile (R ' (?<=<td>) [\d]{1,3}\.[ \d]{1,3}\. [\d] {1,3}\. [\d] {1,3} ') Patternport = Re.compile (R ' (?<=<td>) [\d]{2,5} (?=</td>) ') print "Now,start to refresh proxy IP ... "For page in range (1,4): url = ' http://www.xicidaili.com/nn/' +str (page) headers = {" User-agent ":" Mozilla/5.0 (Windows NT 10.0; WOW64) "} request = Urllib2. Request (Url=url, headers=headers) response = Urllib2.urlopen (request) content = Response.read () Findip = Re.findall (Pat TERNIP,STR (content)) Findport = Re.findall (patternport,str (content)) #assemble the IP and port for I in range (Len (Findi P)): findip[i] = Findip[i] + ":" + findport[i] ip_data.extend (findip) print (' Get page ', page) print "Refresh done!!! "#use MultithrEading Mul_thread_check (url_target) return ip_data_checkeddef Check_one (url_check,i): #get Lock lock = threading. Lock () #setting timeout socket.setdefaulttimeout (8) try:ppp = {"http": ip_data[i]} proxy_support = Urllib2. Proxyhandler (PPP) Openercheck = Urllib2.build_opener (proxy_support) Urllib2.install_opener (openercheck) request = Urllib2. Request (Url_check) request.add_header (' User-agent ', "mozilla/5.0 (Windows NT 10.0; WOW64) ") HTML = urllib2.urlopen (Request). Read () Lock.acquire () print (Ip_data[i], ' is OK ') #get available IP index Avai Lable_table.append (i) lock.release () except Exception as E:lock.acquire () print (' Error ') lock.release () def Mul_thread _check (url_mul_check): Threads = [] for i in range (len (ip_data)): #creat thread ... thread = Threading.  Thread (Target=check_one, Args=[url_mul_check,i,]) threads.append (thread) Thread.Start () print "New thread start", I for Thread in Threads:thread.join () #get the ip_data_checked[] for error_cnt in range (Len (available_table)): aseemble_ip = {' http ': ip_data[available_table[error_cnt]]} ip_data_checked.append (aseemble_ip) print " Available proxy IP: ", Len (available_table)

First, GetIP (Url_target): The main function passed in parameter is: Verify proxy IP availability URL, recommended Ipchina

Get proxy IP, from the http://www.xicidaili.com/nn/website, it is a free proxy IP site, but the inside of the IP is not all available, and combined with your actual location, network conditions, access to the target server, etc., can use less than 20 %, at least my case is like this.

To access the http://www.xicidaili.com/nn/Web site using the normal way, the returned page content through regular queries to obtain the required IP and corresponding port, the code is as follows:

Patternip = Re.compile (R ' (?<=<td>) [\d]{1,3}\.[ \d]{1,3}\. [\d] {1,3}\. [\d] {1,3} ') Patternport = Re.compile (R ' (?<=<td>) [\d]{2,5} (?=</td>) ') ... Findip = Re.findall (patternIP,str (content)) Findport = Re.findall (patternport,str (content))

You can refer to other articles about how to construct regular expressions:

The obtained IP is saved in Findip, the corresponding port is in Findport, the two are indexed by index, and the normal number of one page IP is 100.

Next IP and Port stitching

Final usability Check

Second, Check_one (url_check,i): Thread function

This access Url_check still use normal access, when the access page has returned, then the proxy IP is available, the current index value is recorded, used for the subsequent removal of all available IP.

Third, Mul_thread_check (Url_mul_check): Multi-threaded generation

This function turns on multithreading to check proxy IP availability, and each IP opens a thread for checking.

This project calls GetIP () directly and passes in the URL for checking availability to return a list of IP lists that have been checked for availability, in the format

[' Ip1:port1 ', ' Ip2:port2 ',....]
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.