That wall is really hateful! In the It circle, often need to use GG data (you can also use to access the 1024x768, ^_^ ... )。 Of course, you can also use Baidu. In fact, it is not that I do not love Baidu, there is a reason, and listen to my thin way. Once had the egg ache, wanted to see if someone would copy my blog (although the blog did not learn well), so Baidu a bit, the results are amazing. I found myself writing a blog, even with the whole title to search, often can't search, search is a bunch of crawler crawl results. Specifically what, here do not say, each can take their own blog to try. Before always manually collect several IP for a period of time, failed to re-collect a few later, so repeated, annoying! So, want to write a crawler crawling proxy IP, and then every time directly in the database to find a few out of the line. However, many of the bots crawling over the IP have failed. This is also reduced to manual testing, this is not to add more trouble for themselves? So write a detection agent IP is available program, let the program to help me detect. So every time I can get the available proxy IP. As the crawler is written in Scrapy, in order to facilitate maintenance, IP detection as a part of the Scrapy crawler is good. Therefore, the following procedures for testing:
1. Create File: checkproxy.py
#coding =utf-8 Import urllib2import urllibimport timeimport socketip_check_url = ' http://www.google.com.hk/' user_agent = ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) gecko/20100101 firefox/12.0 ' socket_timeout = # Check proxydef check_proxy (Protocol, PIP): Try:proxy_hand Ler = Urllib2. Proxyhandler ({Protocol:pip}) opener = Urllib2.build_opener (proxy_handler) # opener.addheaders = [(' User-agent ', User _agent)] #这句加上以后无法正常检测, I don't know what the reason is. Urllib2.install_opener (opener) req = Urllib2. Request (ip_check_url) Time_start = Time.time () conn = Urllib2.urlopen (req) # conn = Urllib2.urlopen (Ip_check_url) Time_end = Time.time () Detected_pip = Conn.read () proxy_detected = True except Urllib2. Httperror, E:print "Error:code", E.code return False except Exception, Detail:print "ERROR:", Detail re Turn False return proxy_detecteddef main (): Socket.setdefaulttimeout (socket_timeout) Print protocol = "HTTP" Curr Ent_proxy = "212.82.126.32:80" proxy_detected = check_proxy (protocol, Current_proxy) if Proxy_detected:print ("Working:" + current_proxy) Else: Print "FAILED:%s"% (Current_proxy,) if __name__ = = ' __main__ ': Main ()
2. Test:
[Root@bogon proxyipspider]# python checkproxy.py working:212.82.126.32:80
Of course, this is just a prototype of the program, the actual detection of the program needs to be combined with database or file operations to complete. Proxy IP detected, then the rest is set up. After setting up, enjoy the GG bar. 1024 you want to watch as long as you like, but still do not look more good, you understand. If you want to put on Facebook, the oil turtle, and Twitter, it's up to you, and this is just GG.
Program APE, always want to use their own hands to solve the problem. The change of the world's heart has not changed, just like the slogan of the blog Park, "Code changes the universe." If you see something uncomfortable, build one yourself. It is such a lot of examples, daily use of VI, GitHub and so on. All right, here we go, 1024.
That wall is really hateful!