Advanced regular expression technology (Python version), regular expression python
A regular expression is a Swiss Army knife that searches information for a specific pattern. They are a huge tool library, some of which are often ignored or underutilized. Today, I will show you some advanced usage of regular expressions.
For example, this is a regular expression that we may use to detect American phone numbers:
r'^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'
We can add some comments and spaces to make it more readable.
r'^'r'(1[-\s.])?' # optional '1-', '1.' or '1'r'(\()?' # optional opening parenthesisr'\d{3}' # the area coder'(?(2)\))' # if there was opening parenthesis, close itr'[-\s.]?' # followed by '-' or '.' or spacer'\d{3}' # first 3 digitsr'[-\s.]?' # followed by '-' or '.' or spacer'\d{4}$' # last 4 digits
Let's put it in a code snippet:
import renumbers = [ "123 555 6789", "1-(123)-555-6789", "(123-555-6789", "(123).555.6789", "123 55 6789" ]for number in numbers: pattern = re.match(r'^' r'(1[-\s.])?' # optional '1-', '1.' or '1' r'(\()?' # optional opening parenthesis r'\d{3}' # the area code r'(?(2)\))' # if there was opening parenthesis, close it r'[-\s.]?' # followed by '-' or '.' or space r'\d{3}' # first 3 digits r'[-\s.]?' # followed by '-' or '.' or space r'\d{4}$\s*',number) # last 4 digits if pattern: print '{0} is valid'.format(number) else: print '{0} is not valid'.format(number)
Output without spaces:
123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid
Regular Expressions are a good feature of python, but it is difficult to debug them, and regular expressions are prone to errors.
Fortunately, python canre.compile
Orre.match
Setre.DEBUG
(Actually an integer of 128) indicates that the parsing tree of the regular expression can be output.
import renumbers = [ "123 555 6789", "1-(123)-555-6789", "(123-555-6789", "(123).555.6789", "123 55 6789" ]for number in numbers: pattern = re.match(r'^' r'(1[-\s.])?' # optional '1-', '1.' or '1' r'(\()?' # optional opening parenthesis r'\d{3}' # the area code r'(?(2)\))' # if there was opening parenthesis, close it r'[-\s.]?' # followed by '-' or '.' or space r'\d{3}' # first 3 digits r'[-\s.]?' # followed by '-' or '.' or space r'\d{4}$', number, re.DEBUG) # last 4 digits if pattern: print '{0} is valid'.format(number) else: print '{0} is not valid'.format(number)
Resolution tree
at_beginningmax_repeat 0 1 subpattern 1 literal 49 in literal 45 category category_space literal 46max_repeat 0 2147483648 in category category_spacemax_repeat 0 1 subpattern 2 literal 40max_repeat 0 2147483648 in category category_spacemax_repeat 3 3 in category category_digitmax_repeat 0 2147483648 in category category_spacesubpattern None groupref_exists 2 literal 41Nonemax_repeat 0 2147483648 in category category_spacemax_repeat 0 1 in literal 45 category category_space literal 46max_repeat 0 2147483648 in category category_spacemax_repeat 3 3 in category category_digitmax_repeat 0 2147483648 in category category_spacemax_repeat 0 1 in literal 45 category category_space literal 46max_repeat 0 2147483648 in category category_spacemax_repeat 4 4 in category category_digitat at_endmax_repeat 0 2147483648 in category category_space123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid
Greedy and non-greedy
Before explaining this concept, I would like to show an example. We need to find the anchor tag from a piece of html text:
import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>'m = re.findall('<a.*>.*<\/a>', html)if m: print m
The results will be expected:
['<a href="http://pypix.com" title="pypix">Pypix</a>']
Let's change the input to add the second anchor Tag:
import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \ 'Hello <a href="http://example.com" title"example">Example</a>'m = re.findall('<a.*>.*<\/a>', html)if m: print m
The result looks correct again. But don't be fooled! If we encounter two anchor tags in the same line, it will not work correctly:
['<a href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>']
This pattern matches the first open tag, the last closed tag, and all the content between them, into a match instead of two separate matches. This is because the default matching mode is "greedy ".
When in greedy mode, quantifiers (such*
And+
) Match as many characters as possible.
When you add a question mark (.*?
) It will become "non-greedy ".
import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \ 'Hello <a href="http://example.com" title"example">Example</a>'m = re.findall('<a.*?>.*?<\/a>', html)if m: print m
The result is correct now.
['<a href="http://pypix.com" title="pypix">Pypix</a>', '<a href="http://example.com" title"example">Example</a>']
Forward and backward delimiters
A forward definator searches for the current match and then searches for the match. An example is better explained.
The following pattern matchesfoo
And then check for matchingbar
:
import restrings = [ "hello foo", # returns False "hello foobar" ] # returns Truefor string in strings: pattern = re.search(r'foo(?=bar)', string) if pattern: print 'True' else: print 'False'
This seems useless, because we can directly detectfoobar
Isn't it simpler. However, it can also be used to define the forward negation. The following example matchesfoo
, If and only after itNoFollowbar
.
import restrings = [ "hello foo", # returns True "hello foobar", # returns False "hello foobaz"] # returns Truefor string in strings: pattern = re.search(r'foo(?!bar)', string) if pattern: print 'True' else: print 'False'
The backward definer is similar, but it is used to view the previous pattern of the current match. You can use(?>
To define,(?<!
Negative definition.
The following pattern matches onefoo
Thebar
.
import restrings = [ "hello bar", # returns True "hello foobar", # returns False "hello bazbar"] # returns Truefor string in strings: pattern = re.search(r'(?<!foo)bar',string) if pattern: print 'True' else: print 'False'
Condition (IF-Then-Else) Mode
Regular Expressions provide the condition detection function. The format is as follows:
(?(?=regex)then|else)
The condition can be a number. Reference the previously captured group.
For example, we can use this regular expression to detect open and closed angle brackets:
import restrings = [ "<pypix>", # returns true "<foo", # returns false "bar>", # returns false "hello" ] # returns truefor string in strings: pattern = re.search(r'^(<)?[a-z]+(?(1)>)$', string) if pattern: print 'True' else: print 'False'
In the preceding example,1
Group(<)
Of course, it can also be blank because there is a question mark after it. It matches Closed Angle brackets only when the condition is set.
A condition can also be a delimiter.
No capturing Group
Grouping, enclosed by parentheses, captures an array and can be referenced later. But we can also not capture them.
Let's take a look at a very simple example:
import re string = 'Hello foobar' pattern = re.search(r'(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => foo print "b* => {0}".format(pattern.group(2)) # prints b* => bar
Now let's change it a little bit. Add another group to the front.(H.*)
:
import re string = 'Hello foobar' pattern = re.search(r'(H.*)(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => Hello print "b* => {0}".format(pattern.group(2)) # prints b* => bar
The pattern array has changed, depending on how we use these variables in the code, which may make our script not work properly. Now we have to find the place where the pattern array appears in every part of the code and adjust the subscript accordingly. If we are really not interested in the content of a newly added group, we can make it "not captured", just like this:
import re string = 'Hello foobar' pattern = re.search(r'(?:H.*)(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => foo print "b* => {0}".format(pattern.group(2)) # prints b* => bar
Add?:
And we no longer need to capture it in the pattern array. Therefore, other values in the array do not need to be moved.
Name Group
Like in the previous example, this is another way to prevent us from falling into the trap. We can actually name groups, and then we can reference them by name, instead of using array subscript. Format:(?Ppattern)
We can rewrite the previous example as follows:
import re string = 'Hello foobar' pattern = re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)', string) print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar
Now we can add another group without affecting other existing groups in the pattern array:
import re string = 'Hello foobar' pattern = re.search(r'(?PUse callback Functions
In Pythonre.sub()
It can be used to add a callback function to replace a regular expression.
Let's take a look at this example. This is an e-mail template:
import re template = "Hello [first_name] [last_name], \ Thank you for purchasing [product_name] from [store_name]. \ The total cost of your purchase was [product_price] plus [ship_price] for shipping. \ You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \ Sincerely, \ [store_manager_name]" # assume dic has all the replacement data # such as dic['first_name'] dic['product_price'] etc... dic = { "first_name" : "John", "last_name" : "Doe", "product_name" : "iphone", "store_name" : "Walkers", "product_price": "$500", "ship_price": "$10", "ship_days_min": "1", "ship_days_max": "5", "store_manager_name": "DoeJohn" } result = re.compile(r'\[(.*)\]') print result.sub('John', template, count=1)
Note that each replacement has one thing in common, and they are enclosed by a pair of brackets. We can use a single regular expression to capture them and use a callback function to handle specific replacement.
Therefore, using callback functions is a better method:
import re template = "Hello [first_name] [last_name], \ Thank you for purchasing [product_name] from [store_name]. \ The total cost of your purchase was [product_price] plus [ship_price] for shipping. \ You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \ Sincerely, \ [store_manager_name]" # assume dic has all the replacement data # such as dic['first_name'] dic['product_price'] etc... dic = { "first_name" : "John", "last_name" : "Doe", "product_name" : "iphone", "store_name" : "Walkers", "product_price": "$500", "ship_price": "$10", "ship_days_min": "1", "ship_days_max": "5", "store_manager_name": "DoeJohn" } def multiple_replace(dic, text): pattern = "|".join(map(lambda key : re.escape("["+key+"]"), dic.keys())) return re.sub(pattern, lambda m: dic[m.group()[1:-1]], text) print multiple_replace(dic, template)
Do not reinvent the wheel
More importantly, you may know whenNoUse a regular expression. In many cases, you can find alternative tools.
Parsing an answer on [X] HTML Stackoverflow tells us why [X] HTML shouldn't be parsed using regular expressions.
You should use the HTML Parser. Python has many options:
- ElementTree is part of the standard library
- BeautifulSoup is a popular third-party library
- Lxml is a c-based library with complete functions.
Even the malformed HTML can be very elegant, which brings the gospel to a large number of ugly websites.
An example of ElementTree:
from xml.etree import ElementTree tree = ElementTree.parse('filename.html') for element in tree.findall('h1'): print ElementTree.tostring(element)
Others
There are many other tools to consider before using regular expressions.
Thank you for reading!