Python scans the proxy and obtains the instance of the available proxy ip address, pythonproxy

Source: Internet
Author: User

Python scans the proxy and obtains the instance of the available proxy ip address, pythonproxy

Today, I wrote a practical tool to scan and obtain available proxies.

First of all, I first Baidu found a Website: http://www.xicidaili.com as an Example

This website has published many available proxy ip addresses and ports at home and abroad.

Let's analyze it as we used to, so we should first scan all domestic proxies.

Click the domestic part for review and find that the domestic proxy and directory are the following urls:

Http://www.xicidaili.com/nn/x

This x is more than two thousand pages, so it seems that thread processing is required...

As always, we try to get the content directly with the simplest requests. get ()

If 503 is returned, a simple headers is added.

Returns 200.

Now, we can analyze the webpage content and obtain the desired content.

We found that the content containing ip information is in the <tr> tag, so we can easily use bs to obtain the TAG content.

However, we can find that the ip, port, and Protocol content are in the three <td> tags extracted from the <tr> label, namely, 3, 6.

So we started to write the code and thought about it:

During page processing, the tr tag is extracted first, and then the td tag in the tr tag is extracted.

So two bs operations are used, and str processing is required for the second bs operation.

Because after we get tr, we need 2, 3, and 6,

However, when we use the I output by a for loop, we cannot perform group operations.

Therefore, after performing the second operation on the soup of each td, we can extract 2, 3, and 6 directly.

After extraction, Add. string to extract the content.

r = requests.get(url = url,headers = headers) soup = bs(r.content,"html.parser") data = soup.find_all(name = 'tr',attrs = {'class':re.compile('|[^odd]')}) for i in data:  soup = bs(str(i),'html.parser')  data2 = soup.find_all(name = 'td')  ip = str(data2[1].string)  port = str(data2[2].string)  types = str(data2[5].string).lower()   proxy = {}  proxy[types] = '%s:%s'%(ip,port)

In this way, each cycle can generate the corresponding proxy dictionary, so that we can verify the ip availability

Note here: we have an operation that converts types to lowercase, because the protocol name written to proxies in the get method should be lower-case, while the web page captures the upper-case content, therefore, a case-sensitive conversion is performed.

How can we verify ip availability?

It's easy. We use get and our proxy to request the website:

Http://1212.ip138.com/ic.asp

This is a magic website. What is your Internet ip address?

url = 'http://1212.ip138.com/ic.asp'r = requests.get(url = url,proxies = proxy,timeout = 6)

Here we need to add timeout to remove the proxies that have been waiting for too long. I set it to 6 seconds.

We try with an ip address and analyze the returned page

The returned content is as follows:

<Html> 

Then we only need to extract the content of [] in the webpage.

If our proxy is available, the proxy ip address will be returned.

(The returned address is still the Internet ip address of our local machine. Although I am not very clear about it, I will exclude this situation and the proxy should still be unavailable)

Then we can make a judgment. If the returned ip address is the same as the ip address in the proxy dictionary, we will consider this ip address as an available proxy and write it into the file.

Our idea is to handle the queue and threading threads at last.

Code:

#coding=utf-8import requestsimport refrom bs4 import BeautifulSoup as bsimport Queueimport threading class proxyPick(threading.Thread): def __init__(self,queue):  threading.Thread.__init__(self)  self._queue = queue def run(self):  while not self._queue.empty():   url = self._queue.get()   proxy_spider(url)def proxy_spider(url): headers = {   .......  } r = requests.get(url = url,headers = headers) soup = bs(r.content,"html.parser") data = soup.find_all(name = 'tr',attrs = {'class':re.compile('|[^odd]')}) for i in data:  soup = bs(str(i),'html.parser')  data2 = soup.find_all(name = 'td')  ip = str(data2[1].string)  port = str(data2[2].string)  types = str(data2[5].string).lower()   proxy = {}  proxy[types] = '%s:%s'%(ip,port)  try:   proxy_check(proxy,ip)  except Exception,e:   print e   passdef proxy_check(proxy,ip): url = 'http://1212.ip138.com/ic.asp' r = requests.get(url = url,proxies = proxy,timeout = 6) f = open('E:/url/ip_proxy.txt','a+') soup = bs(r.text,'html.parser') data = soup.find_all(name = 'center') for i in data:  a = re.findall(r'\[(.*?)\]',i.string)  if a[0] == ip:   #print proxy   f.write('%s'%proxy+'\n')   print 'write down'    f.close()#proxy_spider()def main(): queue = Queue.Queue() for i in range(1,2288):  queue.put('http://www.xicidaili.com/nn/'+str(i)) threads = [] thread_count = 10 for i in range(thread_count):  spider = proxyPick(queue)  threads.append(spider) for i in threads:  i.start() for i in threads:  i.join() print "It's down,sir!"if __name__ == '__main__': main()

In this case, we can write all the proxies that can be used on the website to the ip_proxy.txt file.

The above example of python scanning proxy and obtaining the available proxy ip address is all the content shared by Alibaba Cloud. I hope you can give us a reference and support for the customer's house.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.