The following small series for everyone to bring a Python scan proxy and obtain an instance of the available proxy IP. Small series feel very good, now share to everyone, also for everyone to make a reference. Let's take a look at it with a little knitting.
Today we write a very useful tool, is to scan and get the available proxy
First of all, I first Baidu to find a website: www.xicidaili.com as an example
Many of the IPs and ports available at home and abroad are published in this website.
We still look at the same as the old analysis, we will first sweep all the domestic proxy.
Click to open the domestic part of the review found that the domestic proxy and directory for the following URL:
www.xicidaili.com/nn/x
This x is almost 2000 pages, then it seems to be threading again ...
As usual, we try to get the content directly with the simplest requests.get ()
Return 503, then we add a simple headers
Return 200, come on
Okay, let's start with the Web content analysis and get what we want.
We found that the contents of the IP information contained in the <tr> tag, so we can easily use BS to obtain the label content
However, we subsequently found that the contents of IP, port and protocol were in the 2nd, 3, 63 <td> tags of the extracted <tr> tags.
So we began to try to write, for the idea of writing:
When processing the page, the TR tag is extracted first, and the TD tag in the TR tag is extracted
Therefore, two BS operations are used, and STR processing is required for the second use of the BS operation
Because after we get TR, we need the 2,3,6 number of things,
But when we use a For loop output I can't do a group operation
So we simply separate each TD soup for the second operation after the direct extraction of 2,3,6
After extraction, directly add the. String to extract the content
r = requests.get (url = url,headers = headers) soup = BS (r.content, "html.parser") data = Soup.find_all (name = ' TR ', attrs = {' class ': Re.compile (' |[ ^odd] ')}) for I in data: soup = BS (str (i), ' Html.parser ') data2 = soup.find_all (name = ' TD ') IP = str (data2[1]. string) port = str (data2[2].string) types = str (data2[5].string). Lower () proxy = {} proxy[types] = '%s ' :%s '% (Ip,port)
This allows us to generate a corresponding proxy dictionary each time we loop, so that we can then verify the use of IP availability
There's a note in the dictionary here, we have an operation that turns types into lowercase, because the protocol name written in proxies in the Get method should be lowercase, and the page fetches uppercase content, so a case conversion
So what's the idea of verifying IP availability?
Very simply, we use GET, plus our agent, to request the website:
Http://1212.ip138.com/ic.asp
This is a magical website that can return what your extranet IP is
url = ' http://1212.ip138.com/ic.asp ' r = requests.get (url = url,proxies = Proxy,timeout = 6)
Here we need to add timeout to remove the agents that have waited too long, I set to 6 seconds
We try with an IP and parse the returned page
The returned content is as follows:
Then we just need to extract the contents of [] within the page.
If our agent is available, it returns the IP of the proxy
(here will appear the return address or our local network IP, although I am not very clear, but I exclude this situation, should still be agent unavailable)
Then we can make a judgment, if the IP and proxy dictionary are returned the same IP, the IP is considered to be an available proxy, and write it to the file
This is our idea, and finally the queue and threading threads are processed
On the code:
#coding =utf-8import requestsimport refrom bs4 import beautifulsoup as Bsimport Queueimport threading class Proxypick (thre Ading. Thread): Def __init__ (self,queue): Threading. Thread.__init__ (self) self._queue = Queue def run (self): And not Self._queue.empty (): url = self._queue.get () ProX Y_spider (URL) def proxy_spider (URL): headers = {...} r = requests.get (url = url,headers = headers) soup = BS (R.con Tent, "html.parser") data = Soup.find_all (name = ' TR ', attrs = {' class ': Re.compile (' |[ ^odd] ')}) for i in data:soup = BS (str (i), ' html.parser ') data2 = soup.find_all (name = ' td ') IP = str (data2[1].string) Port = str (data2[2].string) types = str (data2[5].string). lower () proxy = {} Proxy[types] = '%s:%s '% (ip,port) try:p Roxy_check (PROXY,IP) except Exception,e:print e passdef proxy_check (proxy,ip): url = ' http://1212.ip138.com/ic.asp ' r = requests.get (url = url,proxies = Proxy,timeout = 6) F = open (' E:/url/ip_proxy.txt ', ' A + ') soup = BS (R.text, ' Html.parser ') data = sOup.find_all (name = ' center ') for i in data:a = Re.findall (R ' \[(. *?) \] ', i.string) if a[0] = = IP: #print proxy f.write ('%s '%proxy+ ' \ n ') print ' Write Down ' f.close () #proxy_spider () de F Main (): queue = Queue.queue () for I in Range (1,2288): Queue.put (' http://www.xicidaili.com/nn/' +str (i)) threads = [] Thr Ead_count = Ten for I in Range (thread_count): Spider = Proxypick (queue) threads.append (spider) for I in Threads:i.start () for I in Threads:i.join () print "It ' s down,sir!" if __name__ = = ' __main__ ': Main ()
So that we can write all the available proxy IP on the website to the file Ip_proxy.txt file