The shortest matching implementation code in python Regular Expressions
Let's start with an example:
Use regular expressions to parse the following XML/HTML tags:
<composer>Wolfgang Amadeus Mozart</composer><author>Samuel Beckett</author> <city>London</city>
You want to automatically format and rewrite the format:
Composer: Wolfgang Amadeus shortart
Author: Samuel Beckett
City: London
A code is like this:
# Coding: UTF-8 import re s = "<composer> WolfgangAmadeus audio art </composer> <author> SamuelBeckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> pattern2 = re. compile (">. + </") # match> <any character in listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])
This code runs and the result is OK.
Next we will modify the format of s:
# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])
The answer is as follows:
Let's take a look at the two matching results. The modified code is as follows:
# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +>") # match any character in <> # This mode is not greedy, so s can match pattern2 = re if not multiple lines. compile (">. + </") # match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the print (listNames) print (listContents) list of all strings meeting the regular expression pattern2 # Because xml is standard, it corresponds one to one (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])
The result is as follows:
From the first arrow, we can see that this processing is correct. Then, looking at the second arrow, the matching result is obviously incorrect. Why?
This is because in the regular expression, '*', '+ ','? 'These are greedy matching. For example, if a * is used, the operation results will be as many matching modes as possible. So when you try to match a pair of symmetric delimiters, such as angle brackets in the HTML sign. The pattern matching a single HTML sign cannot work normally, because. * is essentially greedy. In this case, the solution is to use a non-Greedy qualifier *? , +? ,?? Or {m, n }?, Match as little text as possible.
The code can be modified as follows:
# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<\ w +?> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (">. +? </") # Match> <any character. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): # discard unnecessary symbols by using slice during output, such as: <>/print (listNames [I] [1: len (listNames [I])-1], ":", listContents [I] [1: len (listContents [I])-2])
Finally, use grouping to optimize the regular expression of the Code as follows:
# Coding: utf-8import res = "<composer> Wolfgang Amadeus shortart </composer> <author> Samuel Beckett </author> <city> London </city>" pattern1 = re. compile ("<(\ w +?)> ") # Match any character in <> # This mode is not greedy, so s can match pattern2 = re. compile (" <\ w +?> (. + ?) </\ W +?> ") # Match any character in <a>... </a>. The question mark must be added ,"? "Non-Greedy match listNames = pattern1.findall (s) # obtain the list of all strings meeting the regular expression pattern1 listContents = pattern2.findall (s) # obtain the list of all strings meeting the regular expression pattern2 # xml is a one-to-one correspondence because it is standard (for incorrect input, do not consider it for the moment) for I in range (len (listNames): print (listNames [I], ":", listContents [I])
This article introduces the python regular expression.