Python gets a free agent available
When using crawlers to crawl the same site, often by the site's IP anti-crawler mechanism to be banned, this can be resolved by using a proxy. There are a lot of sites on the Internet that offer the latest free proxy lists. There are a number of proxy hosts that are available in these lists, but some are unavailable, so you need to filter them further. Python makes it very easy to filter out the list of available proxies.
To provide free proxy information for the site IPCN country free agent For example, here is a crawl to the site provided by the agent information and filter available proxy host program. Mainly used in requests and lxml, the detailed code is:
#-*-Coding:utf-8-*-ImportRequests fromlxmlImportEtree def get_proxies_from_site():URL =' http://proxy.ipcn.org/country/'XPath ='/html/body/div[last ()]/table[last ()]/tr/td/text () 'r = requests.get (URL) tree = etree. HTML (r.text) results = Tree.xpath (XPath) proxies = [Line.strip () forLineinchResultsreturnProxies#使用http://LWONS.COM/WX Web page to test if the agent host is available def get_valid_proxies(proxies, Count):URL =' HTTP://LWONS.COM/WX 'results = [] cur =0 forPinchProxies:proxy = {' http ':'/http '+ P} succeed =False Try: R = Requests.get (URL, proxies=proxy)ifR.text = =' Default ': Succeed =True exceptException, E:Print ' ERROR: ', p succeed =False ifSUCCEED:Print ' succeed: ', p Results.append (p) cur + =1 ifCur >= Count: Breakif__name__ = =' __main__ ':Print ' Get '+ STR (len (get_valid_proxies (), Get_proxies_from_site (), -))) +' proxies '
Python gets a free agent available