Use Python to crawl available proxy IP addresses and python to crawl proxy ip addresses
Preface
Take the latest free proxy IP website as an example: http://www.xicidaili.com/nn /. Many IP addresses cannot be used.
So I wrote a script in Python, which can detect available proxy IP addresses.
The script is as follows:
# Encoding = utf8import urllib2from bs4 import BeautifulSoupimport urllibimport socket User_Agent = 'mozilla/5.0 (Windows NT 6.3; WOW64; rv: 43.0) gecko/20100101 Firefox/43.0 'header = {} header ['user-agent'] = User_Agent ''' get all proxy IP addresses ''' def getProxyIp (): proxy = [] for I in range (1, 2): try: url = 'HTTP: // www.xicidaili.com/nn/'{str (I) req = urllib2.Request (url, headers = header) res = urllib2.urlopen (req ). read () soup = BeautifulSoup (res) ips = soup. findAll ('tr') for x in range (1, len (ips): ip = ips [x] tds = ip. findAll ("td") ip_temp = tds [1]. contents [0] + "\ t" + tds [2]. contents [0] proxy. append (ip_temp) failed T: continue return proxy ''' verify that the obtained proxy IP address is available ''' def validateIp (proxy): url = "http://ip.chinaz.com/getip.aspx" f = open ("E: \ ip.txt "," w ") socket. setdefatimetimeout (3) for I in range (0, len (proxy): try: ip = proxy [I]. strip (). split ("\ t") proxy_host = "http: //" + ip [0] + ":" + ip [1] proxy_temp = {"http ": proxy_host} res = urllib. urlopen (url, proxies = proxy_temp ). read () f. write (proxy [I] + '\ n') print proxy [I] failed t Exception, e: continue f. close () if _ name _ = '_ main _': proxy = getProxyIp () validateIp (proxy)
After running successfully, open the file on the Elastic Block Storage, and you can see the following available proxy IP addresses and ports:
Summary
This is only the IP address on the first page to be crawled. If necessary, you can crawl several more pages. At the same time, the website is updated from time to time. It is recommended that only the first few pages be crawled. The above is all the content of this article, hoping to help you learn how to use Python.