The shortest matching implementation code in python Regular Expressions

Source: Internet
Author: User

The shortest matching implementation code in python Regular Expressions

Let's start with an example:

Use regular expressions to parse the following XML/HTML tags:

<composer>Wolfgang Amadeus Mozart</composer><author>Samuel Beckett</author> <city>London</city> 

You want to automatically format and rewrite the format:

Composer: Wolfgang Amadeus shortart
Author: Samuel Beckett
City: London

A code is like this:

# Coding: UTF-8 import re s = "<composer> WolfgangAmadeus audio art </composer> <author> SamuelBeckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> pattern2 = re. compile (">. + </") # match> <any character in listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

This code runs and the result is OK.

Next we will modify the format of s:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

The answer is as follows:

Let's take a look at the two matching results. The modified code is as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the print (listNames) print (listContents) list of all strings meeting the regular expression pattern2 # Because xml is standard, it corresponds one to one (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

The result is as follows:

From the first arrow, we can see that this processing is correct. Then, looking at the second arrow, the matching result is obviously incorrect. Why?
This is because in the regular expression, '*', '+ ','? 'These are greedy matching. For example, if a * is used, the operation results will be as many matching modes as possible. So when you try to match a pair of symmetric delimiters, such as angle brackets in the HTML sign. The pattern matching a single HTML sign cannot work normally, because. * is essentially greedy. In this case, the solution is to use a non-Greedy qualifier *? , +? ,?? Or {m, n }?, Match as little text as possible.

The code can be modified as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +?> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (">. +? </") # Match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])

Finally, use grouping to optimize the regular expression of the Code as follows:

# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<(\ w +?)> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (" <\ w +?> (. + ?) </\ W +?> ") # Match any character in <a>... </a>. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): print (listNames [I], ":", listContents [I])

This article introduces the python regular expression.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.