Python 3 crawler get available IP address (small white) __python

Source: Internet
Author: User

A few days ago just the regular expression to see some, but also just a little understand a little bit, so want to write a simple program to try. Then think of the former in the search for free agent when there are many can not use, so try to write a crawler such a program, writing is not very good, write a very complex, and then write a little more concise.

First go directly to the code, and then talk about the key content inside.

Import re import urllib.request import socket def get_line (HTML): ' Returns the content that is useful for IP, the return value is a list! ' 
    Line_re = Re.compile (R ' (?:td>) (. +) (?:</td>) ') List = Line_re.findall (HTML) return list def get_ip (HTML):
    "Get all the IP content, return in the form of a list!" List = Get_line (html) ip_re = Re.compile (?: 25[0-5]\.| 2[0-4]?\d\.| [01]?\d\d?\.] {3} (?: 25[0-5]|2[0-4]\d| [01]?\d\d?]
            \b ') ans_list = [] List_len = len (list) str = "for item in List:if Ip_re.match (item)!= None: Ans_list.append (str) str = ' str + + item continue if Re.search (' [\ U4e00-\u9fa5]+ ', item] = = None:str + = ' \ t ' + Item return ans_list def judge_ip (ip_list): ' Check IP for available ... ' url = ' http://ip.chinaz.com/getip.aspx ' F = open (' E:\\python_py\output.txt ', ' W ') #socket. SetDefault Timeout (3) #设置爬取网页的时间限制, there is also a method in the face of the open function to set the timeout parameter to 3 for the I in Range (0,len (ip_list)): IP = ip_list[i].split (' \
       T ') If Len (IP) = = 3:try:proxy = {Ip[2]: ip[0] + ': ' + ip[1]} proxy_support = ur Llib.request.ProxyHandler (proxy) opener = Urllib.request.build_opener (proxy_support) HTML
                = Opener.open (url,timeout=3). Read () F.write (ip_list[i] + ' \ n ') except Exception as E: Print (' proxy ' + ip[0]+ ' not available ') continue F.close () if __name__ = = ' __main__ ': url = ' http:// www.xicidaili.com/' rep = urllib.request.Request (URL) rep.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 10.0; W OW64) applewebkit/537.36 (khtml, like Gecko) chrome/61.0.3163.100 safari/537.36 ') reponse = Urllib.request.urlopen (Rep html = Reponse.read (). Decode (' utf-8 ') ip_list = get_ip (HTML) judge_ip (ip_list)


The contents of the regular expression used:

Ip_re = Re.compile (R ' (?: 25[0-5]\.| 2[0-4]?\d\.| [01]?\d\d?\.] {3} (?: 25[0-5]|2[0-4]\d| [01]?\d\d?] \b ')
line_re = Re.compile (R ' (?:td>) (. +) (?:</td>) ')
re.search (' [\u4e00-\u9fa5]+ ', item]
Ip_re represents a match to a proxy IP address


We found that there are <td> and <\td> on both sides of the useful information of the IP address so we first extract the content between the two, corresponding to the regular expression: the content of Line_re


The third is to match whether it contains Chinese characters information, what we're matching here is the IP address, the IP port number, and the IP type, and we found that there are Chinese characters in addition to these three, so we need to extract the useful information, which is the role of this regular expression.


In the following test the expression is available in the inside:


We determine if the connection attempt blocking wait time is greater than 3 seconds, we think that the IP address is problematic, so it is caught by the exception, continue to determine whether the subsequent IP address is available ....


The web site that is used to determine IP availability is:


Http://ip.chinaz.com/getip.aspx

Then write the IP address you can use to the file


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.