Python scans the proxy and obtains the instance of the available proxy ip address, pythonproxy

Last Update:2017-08-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, I wrote a practical tool to scan and obtain available proxies.

First of all, I first Baidu found a Website: http://www.xicidaili.com as an Example

This website has published many available proxy ip addresses and ports at home and abroad.

Let's analyze it as we used to, so we should first scan all domestic proxies.

Click the domestic part for review and find that the domestic proxy and directory are the following urls:

Http://www.xicidaili.com/nn/x

This x is more than two thousand pages, so it seems that thread processing is required...

As always, we try to get the content directly with the simplest requests. get ()

If 503 is returned, a simple headers is added.

Returns 200.

Now, we can analyze the webpage content and obtain the desired content.

We found that the content containing ip information is in the <tr> tag, so we can easily use bs to obtain the TAG content.

However, we can find that the ip, port, and Protocol content are in the three <td> tags extracted from the <tr> label, namely, 3, 6.

So we started to write the code and thought about it:

During page processing, the tr tag is extracted first, and then the td tag in the tr tag is extracted.

So two bs operations are used, and str processing is required for the second bs operation.

Because after we get tr, we need 2, 3, and 6,

However, when we use the I output by a for loop, we cannot perform group operations.

Therefore, after performing the second operation on the soup of each td, we can extract 2, 3, and 6 directly.

After extraction, Add. string to extract the content.

r = requests.get(url = url,headers = headers) soup = bs(r.content,"html.parser") data = soup.find_all(name = 'tr',attrs = {'class':re.compile('|[^odd]')}) for i in data:  soup = bs(str(i),'html.parser')  data2 = soup.find_all(name = 'td')  ip = str(data2[1].string)  port = str(data2[2].string)  types = str(data2[5].string).lower()   proxy = {}  proxy[types] = '%s:%s'%(ip,port)

In this way, each cycle can generate the corresponding proxy dictionary, so that we can verify the ip availability

Note here: we have an operation that converts types to lowercase, because the protocol name written to proxies in the get method should be lower-case, while the web page captures the upper-case content, therefore, a case-sensitive conversion is performed.

How can we verify ip availability?

It's easy. We use get and our proxy to request the website:

Http://1212.ip138.com/ic.asp

This is a magic website. What is your Internet ip address?

url = 'http://1212.ip138.com/ic.asp'r = requests.get(url = url,proxies = proxy,timeout = 6)

Here we need to add timeout to remove the proxies that have been waiting for too long. I set it to 6 seconds.

We try with an ip address and analyze the returned page

The returned content is as follows:

<Html> 
Then we only need to extract the content of [] in the webpage.
If our proxy is available, the proxy ip address will be returned.
(The returned address is still the Internet ip address of our local machine. Although I am not very clear about it, I will exclude this situation and the proxy should still be unavailable)
Then we can make a judgment. If the returned ip address is the same as the ip address in the proxy dictionary, we will consider this ip address as an available proxy and write it into the file.
Our idea is to handle the queue and threading threads at last.
Code:
#coding=utf-8import requestsimport refrom bs4 import BeautifulSoup as bsimport Queueimport threading class proxyPick(threading.Thread): def __init__(self,queue):  threading.Thread.__init__(self)  self._queue = queue def run(self):  while not self._queue.empty():   url = self._queue.get()   proxy_spider(url)def proxy_spider(url): headers = {   .......  } r = requests.get(url = url,headers = headers) soup = bs(r.content,"html.parser") data = soup.find_all(name = 'tr',attrs = {'class':re.compile('|[^odd]')}) for i in data:  soup = bs(str(i),'html.parser')  data2 = soup.find_all(name = 'td')  ip = str(data2[1].string)  port = str(data2[2].string)  types = str(data2[5].string).lower()   proxy = {}  proxy[types] = '%s:%s'%(ip,port)  try:   proxy_check(proxy,ip)  except Exception,e:   print e   passdef proxy_check(proxy,ip): url = 'http://1212.ip138.com/ic.asp' r = requests.get(url = url,proxies = proxy,timeout = 6) f = open('E:/url/ip_proxy.txt','a+') soup = bs(r.text,'html.parser') data = soup.find_all(name = 'center') for i in data:  a = re.findall(r'\[(.*?)\]',i.string)  if a[0] == ip:   #print proxy   f.write('%s'%proxy+'\n')   print 'write down'    f.close()#proxy_spider()def main(): queue = Queue.Queue() for i in range(1,2288):  queue.put('http://www.xicidaili.com/nn/'+str(i)) threads = [] thread_count = 10 for i in range(thread_count):  spider = proxyPick(queue)  threads.append(spider) for i in threads:  i.start() for i in threads:  i.join() print "It's down,sir!"if __name__ == '__main__': main()
In this case, we can write all the proxies that can be used on the website to the ip_proxy.txt file.
The above example of python scanning proxy and obtaining the available proxy ip address is all the content shared by Alibaba Cloud. I hope you can give us a reference and support for the customer's house.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python scans the proxy and obtains the instance of the available proxy ip address, pythonproxy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python scans the proxy and obtains the instance of the available proxy ip address, pythonproxy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support