A few days ago just the regular expression to see some, but also just a little understand a little bit, so want to write a simple program to try. Then think of the former in the search for free agent when there are many can not use, so try to write a crawler such a program, writing is not very good, write a very complex, and then write a little more concise.
First go directly to the code, and then talk about the key content inside.
Import re import urllib.request import socket def get_line (HTML): ' Returns the content that is useful for IP, the return value is a list! '
Line_re = Re.compile (R ' (?:td>) (. +) (?:</td>) ') List = Line_re.findall (HTML) return list def get_ip (HTML):
"Get all the IP content, return in the form of a list!" List = Get_line (html) ip_re = Re.compile (?: 25[0-5]\.| 2[0-4]?\d\.| [01]?\d\d?\.] {3} (?: 25[0-5]|2[0-4]\d| [01]?\d\d?]
\b ') ans_list = [] List_len = len (list) str = "for item in List:if Ip_re.match (item)!= None: Ans_list.append (str) str = ' str + + item continue if Re.search (' [\ U4e00-\u9fa5]+ ', item] = = None:str + = ' \ t ' + Item return ans_list def judge_ip (ip_list): ' Check IP for available ... ' url = ' http://ip.chinaz.com/getip.aspx ' F = open (' E:\\python_py\output.txt ', ' W ') #socket. SetDefault Timeout (3) #设置爬取网页的时间限制, there is also a method in the face of the open function to set the timeout parameter to 3 for the I in Range (0,len (ip_list)): IP = ip_list[i].split (' \
T ') If Len (IP) = = 3:try:proxy = {Ip[2]: ip[0] + ': ' + ip[1]} proxy_support = ur Llib.request.ProxyHandler (proxy) opener = Urllib.request.build_opener (proxy_support) HTML
= Opener.open (url,timeout=3). Read () F.write (ip_list[i] + ' \ n ') except Exception as E: Print (' proxy ' + ip[0]+ ' not available ') continue F.close () if __name__ = = ' __main__ ': url = ' http:// www.xicidaili.com/' rep = urllib.request.Request (URL) rep.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 10.0; W OW64) applewebkit/537.36 (khtml, like Gecko) chrome/61.0.3163.100 safari/537.36 ') reponse = Urllib.request.urlopen (Rep html = Reponse.read (). Decode (' utf-8 ') ip_list = get_ip (HTML) judge_ip (ip_list)
The contents of the regular expression used:
Ip_re = Re.compile (R ' (?: 25[0-5]\.| 2[0-4]?\d\.| [01]?\d\d?\.] {3} (?: 25[0-5]|2[0-4]\d| [01]?\d\d?] \b ')
line_re = Re.compile (R ' (?:td>) (. +) (?:</td>) ')
re.search (' [\u4e00-\u9fa5]+ ', item]
Ip_re represents a match to a proxy IP address
We found that there are <td> and <\td> on both sides of the useful information of the IP address so we first extract the content between the two, corresponding to the regular expression: the content of Line_re
The third is to match whether it contains Chinese characters information, what we're matching here is the IP address, the IP port number, and the IP type, and we found that there are Chinese characters in addition to these three, so we need to extract the useful information, which is the role of this regular expression.
In the following test the expression is available in the inside:
We determine if the connection attempt blocking wait time is greater than 3 seconds, we think that the IP address is problematic, so it is caught by the exception, continue to determine whether the subsequent IP address is available ....
The web site that is used to determine IP availability is:
Http://ip.chinaz.com/getip.aspx
Then write the IP address you can use to the file