Python scan proxy and how to get a sample share of available proxy IPs

Source: Internet
Author: User
The following small series for everyone to bring a Python scan proxy and obtain an instance of the available proxy IP. Small series feel very good, now share to everyone, also for everyone to make a reference. Let's take a look at it with a little knitting.

Today we write a very useful tool, is to scan and get the available proxy

First of all, I first Baidu to find a website: www.xicidaili.com as an example

Many of the IPs and ports available at home and abroad are published in this website.

We still look at the same as the old analysis, we will first sweep all the domestic proxy.

Click to open the domestic part of the review found that the domestic proxy and directory for the following URL:

www.xicidaili.com/nn/x

This x is almost 2000 pages, then it seems to be threading again ...

As usual, we try to get the content directly with the simplest requests.get ()

Return 503, then we add a simple headers

Return 200, come on

Okay, let's start with the Web content analysis and get what we want.

We found that the contents of the IP information contained in the <tr> tag, so we can easily use BS to obtain the label content

However, we subsequently found that the contents of IP, port and protocol were in the 2nd, 3, 63 <td> tags of the extracted <tr> tags.

So we began to try to write, for the idea of writing:

When processing the page, the TR tag is extracted first, and the TD tag in the TR tag is extracted

Therefore, two BS operations are used, and STR processing is required for the second use of the BS operation

Because after we get TR, we need the 2,3,6 number of things,

But when we use a For loop output I can't do a group operation

So we simply separate each TD soup for the second operation after the direct extraction of 2,3,6

After extraction, directly add the. String to extract the content


r = requests.get (url = url,headers = headers) soup = BS (r.content, "html.parser") data = Soup.find_all (name = ' TR ', attrs = {' class ': Re.compile (' |[ ^odd] ')}) for I in data:  soup = BS (str (i), ' Html.parser ')  data2 = soup.find_all (name = ' TD ')  IP = str (data2[1]. string)  port = str (data2[2].string)  types = str (data2[5].string). Lower ()   proxy = {}  proxy[types] = '%s ' :%s '% (Ip,port)

This allows us to generate a corresponding proxy dictionary each time we loop, so that we can then verify the use of IP availability

There's a note in the dictionary here, we have an operation that turns types into lowercase, because the protocol name written in proxies in the Get method should be lowercase, and the page fetches uppercase content, so a case conversion

So what's the idea of verifying IP availability?

Very simply, we use GET, plus our agent, to request the website:

Http://1212.ip138.com/ic.asp

This is a magical website that can return what your extranet IP is


url = ' http://1212.ip138.com/ic.asp ' r = requests.get (url = url,proxies = Proxy,timeout = 6)

Here we need to add timeout to remove the agents that have waited too long, I set to 6 seconds

We try with an IP and parse the returned page

The returned content is as follows:



Then we just need to extract the contents of [] within the page.

If our agent is available, it returns the IP of the proxy

(here will appear the return address or our local network IP, although I am not very clear, but I exclude this situation, should still be agent unavailable)

Then we can make a judgment, if the IP and proxy dictionary are returned the same IP, the IP is considered to be an available proxy, and write it to the file

This is our idea, and finally the queue and threading threads are processed

On the code:


#coding =utf-8import requestsimport refrom bs4 import beautifulsoup as Bsimport Queueimport threading class Proxypick (thre Ading. Thread): Def __init__ (self,queue): Threading. Thread.__init__ (self) self._queue = Queue def run (self): And not Self._queue.empty (): url = self._queue.get () ProX Y_spider (URL) def proxy_spider (URL): headers = {...} r = requests.get (url = url,headers = headers) soup = BS (R.con Tent, "html.parser") data = Soup.find_all (name = ' TR ', attrs = {' class ': Re.compile (' |[  ^odd] ')}) for i in data:soup = BS (str (i), ' html.parser ') data2 = soup.find_all (name = ' td ') IP = str (data2[1].string) Port = str (data2[2].string) types = str (data2[5].string). lower () proxy = {} Proxy[types] = '%s:%s '% (ip,port) try:p Roxy_check (PROXY,IP) except Exception,e:print e passdef proxy_check (proxy,ip): url = ' http://1212.ip138.com/ic.asp ' r = requests.get (url = url,proxies = Proxy,timeout = 6) F = open (' E:/url/ip_proxy.txt ', ' A + ') soup = BS (R.text, ' Html.parser ') data = sOup.find_all (name = ' center ') for i in data:a = Re.findall (R ' \[(. *?) \] ', i.string) if a[0] = = IP: #print proxy f.write ('%s '%proxy+ ' \ n ') print ' Write Down ' f.close () #proxy_spider () de F Main (): queue = Queue.queue () for I in Range (1,2288): Queue.put (' http://www.xicidaili.com/nn/' +str (i)) threads = [] Thr Ead_count = Ten for I in Range (thread_count): Spider = Proxypick (queue) threads.append (spider) for I in Threads:i.start () for I in Threads:i.join () print "It ' s down,sir!" if __name__ = = ' __main__ ': Main ()

So that we can write all the available proxy IP on the website to the file Ip_proxy.txt file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.