Preface
Take the recent discovery of a free proxy IP site for example: http://www.xicidaili.com/nn/. In the use of the time to find a lot of IP is not used.
So I wrote a script in Python that could detect the proxy IP that could be used.
The script is as follows:
#encoding =utf8import urllib2from BS4 import beautifulsoupimport urllibimport socket user_agent = ' mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) gecko/20100101 firefox/43.0 ' header = {}header[' user-agent '] = User_agent ' Get all proxy IP address ' def getproxyip (): Proxy = [] for i in range: Try:url = ' http://www.xicidaili.com/nn/' +str (i) req = Urllib2. Request (url,headers=header) res = Urllib2.urlopen (req). Read () soup = beautifulsoup (res) ips = Soup.findall (' tr ') f or x in range (1,len (IPS)): IP = ips[x] TDs = Ip.findall ("td") Ip_temp = tds[1].contents[0]+ "\ T" +tds[2].contents[0 ] Proxy.append (ip_temp) except:continue return proxy ' verifies that the obtained proxy IP address is available ' def validateip: url = ' HTTP://IP. Chinaz.com/getip.aspx "f = open (" E:\ip.txt "," W ") Socket.setdefaulttimeout (3) for I in range (0,len (proxy)): TRY:IP = pr Oxy[i].strip (). Split ("\ t") Proxy_host = "/http" +ip[0]+ ":" +ip[1] proxy_temp = {"http":p roxy_host} res = Urllib.urlo Pen (url,proxies=proxy_temp). Read () F.WRite (proxy[i]+ ' \ n ') print proxy[i] except exception,e:continue f.close () if __name__ = = ' __main__ ': Proxy = Getp Roxyip () Validateip (proxy)
After successful operation, open the file under E, you can see the following proxy IP address and port available:
Summarize
This is just the IP address of the first page crawled, and you can crawl a few more pages if you need to. At the same time, the site is always updated, it is recommended to crawl only the first few pages. The above is the whole content of this article, I hope that you learn to use Python can be helpful.