Python crawler (5)--Regular expression (ii)

Source: Internet
Author: User

In the previous article, we used the RE module to match some of the contents of a long string. Next we'll make a match "[email protected] Advantage 314159265358 1892673 3.14 little Girl try_your_best [email protected] Python 3 "

Our goal is to match ' 56 ', where \d represents a matching number, {2} indicates a match number of two times, {M,n},m,n are non-negative integers, m<=n, which indicates a match m-n times. Adding R in front of the matching rule means that the native string is represented.

In fact, when we use regular expressions, we usually compile them into pattern objects and compile them using the Re.compile () method. Let's match the IP address such as: 192.168.1.1.

   

1 Import Re 2 3 str='192.168.1.1'45 re.search (R'([01]{ 0,1}\D{0,1}\D|2[0-4]\D|25[0-5]) \.) {3} ([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) ', str)

 

It can be seen that the regular use is not simple. In the above rule, we used three subgroups, if we use FindAll to match all the IP on the webpage, it will classify the result into (' 192 ', ' 168 ', ' 1 ', ' 1 '). Obviously that's not what we want. At this time, we need to use (? :... ) to represent a non-capturing group, that is, the string that the subgroup matches cannot be fetched from behind.

With the previous foundation, I tried to write down the following code, crawling the IP address from the Web site and verifying that it was available using the proxy access site. When using Python's exception handling mechanism. Although the code is not mature, but still share it, slowly improve.

1 Importurllib.request2 ImportRe3 4 5Url="http://www.xicidaili.com/"6useful_ip=[]7 defloadPage (URL):8headers = {"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36"}9Response=urllib.request.request (url,headers=headers)TenHtml=urllib.request.urlopen (response). Read (). Decode ("Utf-8") One     returnHTML A  - defgetproxy (): -Html=loadPage (URL) thePattern=re.compile (R'(<td>\d+</td>)') -duankou=pattern.findall (HTML) -Pattern=re.compile (R'(?:(?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) \.) {3} (?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])') -content_list=pattern.findall (HTML) +list_num=[] -      forNuminchDuankou: +List_num.append (num[4:-5]) A      forIinchRange (len (list_num)): atip=content_list[i]+":"+List_num[i] -          whileTrue: -Proxy_support=urllib.request.proxyhandler ({'http': IP}) -Opener=Urllib.request.build_opener (Proxy_support) -opener.add_handler=[("user-agent","mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36")] - Urllib.request.install_opener (opener) in             Try: -                 Print("attempting to access using%s ..."%IP) toIp_filter="http://www.whatsmyip.org/" +Ip_response=Urllib.request.urlopen (ip_filter) -             exceptUrllib.error.URLError: the                 Print("access error, this IP is not available") *                  Break $             Else:Panax Notoginseng                 Print("Access Success! ") -                 Print("the available IP is:%s"%IP) the useful_ip.append (IP) +                 ifInput"continue crawling? ")=="N": A                     Print("valid IPs are as follows:") the                      forKeyinchuseful_ip: +                         Print(Key) - exit () $                 Else: $                      Break -  - if __name__=="__main__": theGetProxy ()

When dealing with the port number corresponding to the IP address, I used a very stupid method. There is actually a better solution, and you can think about it. In the above code, use Urllib to access a range of knowledge points, such as Web site, handler processor custom opener, Python exception handling, regular match IP, and so on. Any knowledge, use more to be skilled.

You can see that it runs successfully, and when you find an available IP, you will be asked if you want to continue crawling. Of course, we can manually build a ippool that is the IP pool, customize a function, the IP can be used to write a file to save, here is not to repeat. There is a mature IP pool code on GitHub, and you can download it and read it, just to make a simple experiment with some of the previous usage, so it's not a perfect piece of code.

Python crawler (5)--Regular expression (ii)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.