Python crawler (5)--Regular expression (ii)

Last Update:2018-02-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous article, we used the RE module to match some of the contents of a long string. Next we'll make a match "[email protected] Advantage 314159265358 1892673 3.14 little Girl try_your_best [email protected] Python 3 "

Our goal is to match ' 56 ', where \d represents a matching number, {2} indicates a match number of two times, {M,n},m,n are non-negative integers, m<=n, which indicates a match m-n times. Adding R in front of the matching rule means that the native string is represented.

In fact, when we use regular expressions, we usually compile them into pattern objects and compile them using the Re.compile () method. Let's match the IP address such as: 192.168.1.1.

1 Import Re 2 3 str='192.168.1.1'45 re.search (R'([01]{ 0,1}\D{0,1}\D|2[0-4]\D|25[0-5]) \.) {3} ([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) ', str)

It can be seen that the regular use is not simple. In the above rule, we used three subgroups, if we use FindAll to match all the IP on the webpage, it will classify the result into (' 192 ', ' 168 ', ' 1 ', ' 1 '). Obviously that's not what we want. At this time, we need to use (? ：... ) to represent a non-capturing group, that is, the string that the subgroup matches cannot be fetched from behind.

With the previous foundation, I tried to write down the following code, crawling the IP address from the Web site and verifying that it was available using the proxy access site. When using Python's exception handling mechanism. Although the code is not mature, but still share it, slowly improve.

1 Importurllib.request2 ImportRe3 4 5Url="http://www.xicidaili.com/"6useful_ip=[]7 defloadPage (URL):8headers = {"user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36"}9Response=urllib.request.request (url,headers=headers)TenHtml=urllib.request.urlopen (response). Read (). Decode ("Utf-8") One     returnHTML A  - defgetproxy (): -Html=loadPage (URL) thePattern=re.compile (R'(<td>\d+</td>)') -duankou=pattern.findall (HTML) -Pattern=re.compile (R'(?:(?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5]) \.) {3} (?: [01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])') -content_list=pattern.findall (HTML) +list_num=[] -      forNuminchDuankou: +List_num.append (num[4:-5]) A      forIinchRange (len (list_num)): atip=content_list[i]+":"+List_num[i] -          whileTrue: -Proxy_support=urllib.request.proxyhandler ({'http': IP}) -Opener=Urllib.request.build_opener (Proxy_support) -opener.add_handler=[("user-agent","mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36")] - Urllib.request.install_opener (opener) in             Try: -                 Print("attempting to access using%s ..."%IP) toIp_filter="http://www.whatsmyip.org/" +Ip_response=Urllib.request.urlopen (ip_filter) -             exceptUrllib.error.URLError: the                 Print("access error, this IP is not available") *                  Break $             Else:Panax Notoginseng                 Print("Access Success! ") -                 Print("the available IP is:%s"%IP) the useful_ip.append (IP) +                 ifInput"continue crawling? ")=="N": A                     Print("valid IPs are as follows:") the                      forKeyinchuseful_ip: +                         Print(Key) - exit () $                 Else: $                      Break -  - if __name__=="__main__": theGetProxy ()

When dealing with the port number corresponding to the IP address, I used a very stupid method. There is actually a better solution, and you can think about it. In the above code, use Urllib to access a range of knowledge points, such as Web site, handler processor custom opener, Python exception handling, regular match IP, and so on. Any knowledge, use more to be skilled.

You can see that it runs successfully, and when you find an available IP, you will be asked if you want to continue crawling. Of course, we can manually build a ippool that is the IP pool, customize a function, the IP can be used to write a file to save, here is not to repeat. There is a mature IP pool code on GitHub, and you can download it and read it, just to make a simple experiment with some of the previous usage, so it's not a perfect piece of code.

Python crawler (5)--Regular expression (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler (5)--Regular expression (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler (5)--Regular expression (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support