In the previous article, we used the RE module to match some of the contents of a long string. Next we'll make a match "[email protected] Advantage 314159265358 1892673 3.14 little Girl try_your_best [email protected] Python 3 "
Our goal is to match ' 56 ', where \d represents a matching number, {2} indicates a match number of two times, {M,n},m,n are non-negative integers, m<=n, which indicates a match m-n times. Adding R in front of the matching rule means that the native string is represented.
In fact, when we use regular expressions, we usually compile them into pattern objects and compile them using the Re.compile () method. Let's match the IP address such as: 192.168.1.1.
1 Import Re 2 3 str='192.168.1.1'45 re.search (R'([01]{ 0,1}\D{0,1}\D|2[0-4]\D|25[0-5]) \.) {3} ([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) ', str)
It can be seen that the regular use is not simple. In the above rule, we used three subgroups, if we use FindAll to match all the IP on the webpage, it will classify the result into (' 192 ', ' 168 ', ' 1 ', ' 1 '). Obviously that's not what we want. At this time, we need to use (? :... ) to represent a non-capturing group, that is, the string that the subgroup matches cannot be fetched from behind.
With the previous foundation, I tried to write down the following code, crawling the IP address from the Web site and verifying that it was available using the proxy access site. When using Python's exception handling mechanism. Although the code is not mature, but still share it, slowly improve.
1 Importurllib.request2 ImportRe3 4 5Url="http://www.xicidaili.com/"6useful_ip=[]7 defloadPage (URL):8headers = {"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36"}9Response=urllib.request.request (url,headers=headers)TenHtml=urllib.request.urlopen (response). Read (). Decode ("Utf-8") One returnHTML A - defgetproxy (): -Html=loadPage (URL) thePattern=re.compile (R'(<td>\d+</td>)') -duankou=pattern.findall (HTML) -Pattern=re.compile (R'(?:(?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) \.) {3} (?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])') -content_list=pattern.findall (HTML) +list_num=[] - forNuminchDuankou: +List_num.append (num[4:-5]) A forIinchRange (len (list_num)): atip=content_list[i]+":"+List_num[i] - whileTrue: -Proxy_support=urllib.request.proxyhandler ({'http': IP}) -Opener=Urllib.request.build_opener (Proxy_support) -opener.add_handler=[("user-agent","mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36")] - Urllib.request.install_opener (opener) in Try: - Print("attempting to access using%s ..."%IP) toIp_filter="http://www.whatsmyip.org/" +Ip_response=Urllib.request.urlopen (ip_filter) - exceptUrllib.error.URLError: the Print("access error, this IP is not available") * Break $ Else:Panax Notoginseng Print("Access Success! ") - Print("the available IP is:%s"%IP) the useful_ip.append (IP) + ifInput"continue crawling? ")=="N": A Print("valid IPs are as follows:") the forKeyinchuseful_ip: + Print(Key) - exit () $ Else: $ Break - - if __name__=="__main__": theGetProxy ()
When dealing with the port number corresponding to the IP address, I used a very stupid method. There is actually a better solution, and you can think about it. In the above code, use Urllib to access a range of knowledge points, such as Web site, handler processor custom opener, Python exception handling, regular match IP, and so on. Any knowledge, use more to be skilled.
You can see that it runs successfully, and when you find an available IP, you will be asked if you want to continue crawling. Of course, we can manually build a ippool that is the IP pool, customize a function, the IP can be used to write a file to save, here is not to repeat. There is a mature IP pool code on GitHub, and you can download it and read it, just to make a simple experiment with some of the previous usage, so it's not a perfect piece of code.
Python crawler (5)--Regular expression (ii)