A regular expression is a Swiss army knife that searches for a particular pattern from information. They are a huge library of tools, some of which are often overlooked or underutilized. Today I will show you some high-level usage of regular expressions.
For example, this is a regular expression that we might use to detect telephone numbers in the US:
R ' ^ (1[-\s.])? (\ ()? \d{3} (? ( 2)) [-\s.]? \d{3}[-\s.]? \d{4}$ '
We can add some comments and spaces to make it more readable.
R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # The area coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{4}$ ' # last 4 digits
Let's put it in a code snippet:
Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123 55 6789 "]for number in numbers: pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # optional ' 1-', ' 1. ' or ' 1 ' R ' (\ ()? ' # Optional opening parenthesis R ' \d{3} ' # The area code R ' (? ( 2)) ' # If there was opening parenthesis, close it R ' [-\s.]? ' # followed by '-' or '. ' or space R ' \d{3} ' # first 3 digits R ' [-\s.]? ' # followed by '-' or '. "or space R ' \d{4}$\s* ', number) # last 4 digits if pattern: print ' {0} is valid '. Format (number) else: print ' {0} is not valid '. Format (number)
Output, with no spaces:
123 555 6789 is valid1-(123) -555-6789 are valid (123-555-6789 is not valid (123). 555.6789 is valid123-6789 is not valid
Regular expressions are a good feature of Python, but it is difficult to debug them, and regular expressions can easily make mistakes.
Fortunately, Python can re.compile
re.match
re.DEBUG
output A parse tree of regular expressions by setting (in fact, the integer 128) flag.
Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123 55 6789 "]for number in numbers: pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # optional ' 1-', ' 1. ' or ' 1 ' R ' (\ ()? ' # Optional opening parenthesis R ' \d{3} ' # The area code R ' (? ( 2)) ' # If there was opening parenthesis, close it R ' [-\s.]? ' # followed by '-' or '. ' or space R ' \d{3} ' # first 3 digits R ' [-\s.]? ' # followed by '-' or '. ' or space R ' \d{4}$ ', number, re. DEBUG) # last 4 digits if pattern: print ' {0} was valid '. Format (number) else: print ' {0} is not v Alid '. Format (number)
Parse tree
At_beginningmax_repeat 0 1 Subpattern 1 literal in literal category category_space literal 46m Ax_repeat 0 2147483648 in category category_spacemax_repeat 0 1 subpattern 2 literal 40max_repeat 0 2147483648 in Category Category_spacemax_repeat 3 3 in category Category_digitmax_repeat 0 2147483648 in category category_s Pacesubpattern None groupref_exists 2 literal 41nonemax_repeat 0 2147483648 in category Category_spacemax_repeat 0 1 in literal category Category_space literal 46max_repeat 0 2147483648 in category Category_spacemax_rep Eat 3 3 in category Category_digitmax_repeat 0 2147483648 in category category_spacemax_repeat 0 1 in literal Category Category_space literal 46max_repeat 0 2147483648 in category Category_spacemax_repeat 4 4 in Cat Egory category_digitat at_endmax_repeat 0 2147483648 in category category_space123 555 6789 is valid1-(123) -555-6789 I S valid (123-555-6789 isNot valid (123). 555.6789 is valid123-6789 is not valid
Greed and non-greed
Before I explain this concept, I would like to show an example first. We want to find anchor tags from a piece of HTML text:
Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ' m = Re.findall (' <a.*>.* <\/a> ', HTML) if M: print m
The results will be expected:
[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ']
Let's change the input and add a second anchor tag:
Import rehtml = ' Hello <a href= "http://pypix.com" title= "Pypix" >Pypix</a> "" Hello <a href= "/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*>.*<\/a> ', html) if M: print m
The result seems to be right again. But don't be fooled! If we encounter two anchor labels on the same line, it will no longer work correctly:
[' <a href= ' http://pypix.com "title=" Pypix ">pypix</a>hello <a href=" http://example.com "title" Example ">Example</a>"
This pattern matches the first open label and the last closed tag, and all the contents between them, into a match instead of two separate matches. This is because the default matching pattern is "greedy".
When in greedy mode, quantifiers (such as *
and +
) match as many characters as possible.
When you add a question mark in the back ( .*?
) it will become "non-greedy".
Import rehtml = ' Hello <a href= "http://pypix.com" title= "Pypix" >Pypix</a> "" Hello <a href= "/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*?>.*?<\/a> ', html) if M: print m
Now the result is correct.
[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ', ' <a href= ' http://example.com ' title ' Example ' >Example</a> ']
Forward and post-defined delimiters
A forward qualifier searches for the current match after the search matches. It is better to explain a little by an example.
The following pattern matches first and foo
then detects if the match is followed bar
:
Import restrings = [ "Hello foo", # returns False "Hello foobar" ] # returns truefor string in strings : pattern = Re.search (R ' foo (? =bar) ', string) if pattern: print ' True ' else: print ' False '
This doesn't seem to work, because it's not easier for us to detect directly foobar
. However, it can also be used to define the forward negation. The following example matches foo
when and only if the back of it is not followed bar
.
Import restrings = [ "Hello foo", # returns True "Hello Foobar", # returns False "Hello Foobaz" ] # returns TRUEFOR string in strings: pattern = Re.search (R ' foo (?! Bar) ', string ' if pattern: print ' True ' else: print ' False '
A back-up qualifier is similar, but it looks at the preceding pattern that is currently matched. You can use it (?>
to express a definite definition, to (?<!
denote a negative definition.
The following pattern matches one that is not followed foo
bar
.
Import restrings = [ "Hello Bar", # returns True "Hello Foobar", # returns False "Hello Bazbar"] # returns TRUEFOR string in strings: pattern = Re.search (R ' (? <!foo) bar ', string) if pattern: print ' True ' else: print ' False '
Condition (if-then-else) mode
Regular expressions provide the ability to detect conditions. The format is as follows:
(? (? =regex) Then|else)
The condition can be a number. Represents the group to which the reference was previously snapped.
For example, we can use this regular expression to detect open and closed angle brackets:
Import restrings = [ "<pypix>", # Returns True "<foo", # returns false "Bar>", # Returns false "Hello"] # returns TRUEFOR string in strings: pattern = Re.search (R ' ^ (<)? [ a-z]+ (? ( 1) >) $ ', string) if pattern: print ' True ' else: print ' False '
In the above example, the 1
grouping (<)
, of course, can also be null because a question mark is followed. It matches closed angle brackets only when the condition is true.
The condition can also be a delimiting character.
No capturing group
grouping, enclosed in parentheses, will capture an array, which can then be referenced when it is used later. But we can also not capture them.
Let's look at a very simple example:
Import re string = ' Hello foobar ' pattern = Re.search (R ' (f.*) (b.*) ', string) print "f* = {0}". Format ( Pattern.group (1) # Prints f* = foo print "b* + {0}". Format (Pattern.group (2)) # Prints b* = Bar
Now let's change a little bit and add another group in front (H.*)
:
Import re string = ' Hello foobar ' pattern = Re.search (R ' (h.*) (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1)) # prints f* = Hello print "b* + = {0}". Format (Pattern.group (2)) # Prints b* = Bar
The pattern array changes, depending on how we use the variables in our code, which may make our script not work properly. Now we have to find the place where the pattern array appears in the code, and then adjust the subscript accordingly. If we are really not interested in the content of a newly added group, we can make it "not captured", like this:
Import re string = ' Hello foobar ' pattern = Re.search (R ' (?: h.*) (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1)) # prints f* = foo print "b* + = {0}". Format (Pattern.group (2)) # Prints b* = Bar
By adding it in front of the group ?:
, we no longer have to capture it in the pattern array. So the other values in the array do not need to be moved.
Named groups
As in the previous example, this is another way to prevent us from falling into a trap. We can actually name the groups, and then we can refer to them by name, instead of using array subscripts. The format is: (?Ppattern)
We can rewrite the previous example, just like this:
Import re string = ' Hello foobar ' pattern = Re.search (R ' (? p<fstar>f.*) (? p<bstar>b.*) ', String ' print ' f* + = {0} '. Format (Pattern.group (' Fstar ') # prints f* = foo print "b* = = {0} ". Format (Pattern.group (' Bstar ')) # prints b* = Bar
Now we can add another group without affecting the other existing groups in the pattern array:
Import re string = ' Hello foobar ' pattern = Re.search (R ' (? p
Using Callback functionsre.sub()
can be used in Python to add a callback function to the regular expression substitution.
Let's take a look at this example, this is an e-mail template:
import Re template = "Hello [first_name] [last_name], \ Thank Purchasing [Product_Name] from [Store_name]. The total cost of your purchase is [Product_price] plus [ship_price] for shipping. \ can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \ sincerely, \ [Store_manager_name] "# Assume DIC have all the replacement data # SUC H as dic[' first_name '] dic[' product_price '] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "St Ore_name ":" Walkers "," Product_price ":" $ "," Ship_price ":" $ "," Ship_days_min ":" 1 ", "Ship_days_max": "5", "Store_manager_name": "Doejohn"} result = Re.compile (R ' \[(. *) \ ] ') Print result.sub (' John ', template, count=1)
Notice that each substitution has a common denominator, which is enclosed in a pair of brackets. We can use a separate regular expression to capture them, and use a callback function to handle the specific substitution.
So using a callback function is a better approach:
Import Re template = "Hello [first_name] [last_name], \ Thank you to purchasing [product_name] from [s Tore_name]. The total cost of your purchase is [Product_price] plus [ship_price] for shipping. \ can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \ sincerely, \ [Store_manager_name] "# Assume DIC have all the replacement data # SUC H as dic[' first_name '] dic[' product_price '] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "St Ore_name ":" Walkers "," Product_price ":" $ "," Ship_price ":" $ "," Ship_days_min ":" 1 ", "Ship_days_max": "5", "Store_manager_name": "Doejohn"} def multiple_replace (DIC, Tex T): pattern = "|". Join (Map (Lambda key:re.escape ("[" +key+] "), Dic.keys ())) return re.sub (pattern, lambda m:dic[M.group () [1:-1]], text) print multiple_replace (dic, template)
Don't invent the wheel again.It may be more important to know when not to use regular expressions. In many cases you can find alternative tools.
parsing [x]html StackOverflow an answer with a brilliant explanation tells us why we shouldn't use regular expressions to parse [x]html.You should use the HTML parser, Python has a lot of options:
- ElementTree is part of the standard library
- BeautifulSoup is a popular third-party library
- lxml is a full-featured, C-based Fast library
The next two of even malformed HTML can be elegant, which brings the gospel to a large number of ugly sites.
An example of ElementTree:
From xml.etree import elementtree tree = elementtree.parse (' filename.html ') for element in Tree.findall (' H1 '): print elementtree.tostring (Element)
Other
There are a number of other tools to consider before using regular expressions.
Thanks for reading!
Advanced Regular Expression Technology (Python version)