The shortest matching implementation code in python Regular Expressions

Last Update:2018-01-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's start with an example:

Use regular expressions to parse the following XML/HTML tags:

<composer>Wolfgang Amadeus Mozart</composer><author>Samuel Beckett</author> <city>London</city>

You want to automatically format and rewrite the format:

Composer: Wolfgang Amadeus shortart
Author: Samuel Beckett
City: London

A code is like this:

# Coding: UTF-8 import re s = "<composer> WolfgangAmadeus audio art </composer> <author> SamuelBeckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> pattern2 = re. compile (">. + </") # match> <any character in listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

This code runs and the result is OK.

Next we will modify the format of s:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

The answer is as follows:

Let's take a look at the two matching results. The modified code is as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the print (listNames) print (listContents) list of all strings meeting the regular expression pattern2 # Because xml is standard, it corresponds one to one (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

The result is as follows:

From the first arrow, we can see that this processing is correct. Then, looking at the second arrow, the matching result is obviously incorrect. Why?
This is because in the regular expression, '*', '+ ','? 'These are greedy matching. For example, if a * is used, the operation results will be as many matching modes as possible. So when you try to match a pair of symmetric delimiters, such as angle brackets in the HTML sign. The pattern matching a single HTML sign cannot work normally, because. * is essentially greedy. In this case, the solution is to use a non-Greedy qualifier *? , +? ,?? Or {m, n }?, Match as little text as possible.

The code can be modified as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +?> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (">. +? </") # Match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

Finally, use grouping to optimize the regular expression of the Code as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<(\ w +?)> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (" <\ w +?> (. + ?) </\ W +?> ") # Match any character in <a>... </a>. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): print (listNames [I], ":", listContents [I])

This article introduces the python regular expression.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The shortest matching implementation code in python Regular Expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The shortest matching implementation code in python Regular Expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support