This article mainly introduces Python implementation to check whether the proxy IP Address can go through the wall. This article provides the implementation code directly. If you need it, you can refer to the wall which is really hateful! In the IT circle, you often need to use gg to query information (you can also access 1024, ^ _ ^ ...). Of course, you can also use Baidu. In fact, it is not because I do not like Baidu. It is a natural thing and I want to hear it carefully. Once, I was so anxious to see if someone had copied my blog (although I was not familiar with my blog). So Baidu gave me a try and the result was amazing. I found that my own blog, even if I search for the entire title, I often cannot find the results of a bunch of crawlers. Here we will not talk about the specifics. You can use your own blog to give it a try. I used to collect several IP addresses manually for a period of time. I will collect several IP addresses again after they become invalid. This is so repetitive and annoying! As a result, I want to write a crawler to capture the proxy IP address, and then find a few items in the database each time. However, a large number of IP addresses crawled by crawlers are no longer valid. At this time, it is reduced to manual testing. Isn't that a little more headache for yourself? So I wrote a program to check whether the proxy IP address is available and asked the program to check it for me. In this way, I can get the available proxy IP address each time. Because crawlers use scrapy for ease of maintenance, IP detection is a part of scrapy crawlers. So we have the following program:
1. Create a file: checkproxy. py
# Coding = UTF-8 import urllib2import urllibimport timeimport socketip_check_url =' http://www.google.com.hk/ 'User _ agent = 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 12.0) Gecko/20100101 Firefox/12.0 'socket _ timeout = 30 # Check proxydef check_proxy (protocol, pip): try: proxy_handler = urllib2.ProxyHandler ({protocol: pip}) opener = urllib2.build _ opener (proxy_handler) # opener. addheaders = [('user-agent', user_agent)] # This statement cannot be detected normally in the future. I don't know why. Urllib2.install _ opener (opener) req = urllib2.Request (ip_check_url) time_start = time. time () conn = urllib2.urlopen (req) # conn = urllib2.urlopen (ip_check_url) time_end = time. time () detected_pip = conn. read () proxy_detected = True bytes t urllib2.HTTPError, e: print "ERROR: Code", e. code return False failed t Exception, detail: print "ERROR:", detail return False return proxy_detecteddef main (): socket. setdefatimetimeout (socket_timeout) print protocol = "http" current_proxy = "protocol: 80" proxy_detected = check_proxy (protocol, current_proxy) if proxy_detected: print ("WORKING:" + current_proxy) else: print "FAILED: % s" % (current_proxy,) if _ name _ = '_ main _': main ()
2. test:
[root@bogon proxyipspider]# python checkproxy.py WORKING: 212.82.126.32:80
Of course, this is just a prototype of the program, and the real detection program also needs to be completed in combination with database or file operations. The proxy IP is detected, and the rest is the settings. After the configuration is complete, enjoy gg. 1024 you can see how long it takes, but it is better not to see more, you know. If you want to go to Facebook, Tartan, and Twitter, you will be on your own. Here is just gg.
Programmers always want to solve some problems through their hands. The mind for changing the world has not changed, just like the blog garden's "code changes the world ". If you see something that is uncomfortable, create one by yourself. There are too many such examples in the IT field, such as Vi and github used every day. All right, let's go here, 1024. Let's get started...
That wall is really hateful!