Advanced regular expression technology (Python version), regular expression python

Source: Internet
Author: User

Advanced regular expression technology (Python version), regular expression python

A regular expression is a Swiss Army knife that searches information for a specific pattern. They are a huge tool library, some of which are often ignored or underutilized. Today, I will show you some advanced usage of regular expressions.

For example, this is a regular expression that we may use to detect American phone numbers:

r'^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'

We can add some comments and spaces to make it more readable.

r'^'r'(1[-\s.])?' # optional '1-', '1.' or '1'r'(\()?'      # optional opening parenthesisr'\d{3}'      # the area coder'(?(2)\))'   # if there was opening parenthesis, close itr'[-\s.]?'    # followed by '-' or '.' or spacer'\d{3}'      # first 3 digitsr'[-\s.]?'    # followed by '-' or '.' or spacer'\d{4}$'    # last 4 digits

Let's put it in a code snippet:

import renumbers = [ "123 555 6789",            "1-(123)-555-6789",            "(123-555-6789",            "(123).555.6789",            "123 55 6789" ]for number in numbers:    pattern = re.match(r'^'                   r'(1[-\s.])?'           # optional '1-', '1.' or '1'                   r'(\()?'                # optional opening parenthesis                   r'\d{3}'                # the area code                   r'(?(2)\))'             # if there was opening parenthesis, close it                   r'[-\s.]?'              # followed by '-' or '.' or space                   r'\d{3}'                # first 3 digits                   r'[-\s.]?'              # followed by '-' or '.' or space                   r'\d{4}$\s*',number)    # last 4 digits    if pattern:        print '{0} is valid'.format(number)    else:        print '{0} is not valid'.format(number)

Output without spaces:

123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid

Regular Expressions are a good feature of python, but it is difficult to debug them, and regular expressions are prone to errors.

Fortunately, python canre.compileOrre.matchSetre.DEBUG(Actually an integer of 128) indicates that the parsing tree of the regular expression can be output.

import renumbers = [ "123 555 6789",            "1-(123)-555-6789",            "(123-555-6789",            "(123).555.6789",            "123 55 6789" ]for number in numbers:    pattern = re.match(r'^'                    r'(1[-\s.])?'        # optional '1-', '1.' or '1'                    r'(\()?'             # optional opening parenthesis                    r'\d{3}'             # the area code                    r'(?(2)\))'          # if there was opening parenthesis, close it                    r'[-\s.]?'           # followed by '-' or '.' or space                    r'\d{3}'             # first 3 digits                    r'[-\s.]?'           # followed by '-' or '.' or space                    r'\d{4}$', number, re.DEBUG)  # last 4 digits    if pattern:        print '{0} is valid'.format(number)    else:        print '{0} is not valid'.format(number)
Resolution tree
at_beginningmax_repeat 0 1  subpattern 1    literal 49    in      literal 45      category category_space      literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  subpattern 2    literal 40max_repeat 0 2147483648  in    category category_spacemax_repeat 3 3  in    category category_digitmax_repeat 0 2147483648  in    category category_spacesubpattern None  groupref_exists 2    literal 41Nonemax_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  in    literal 45    category category_space    literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 3 3  in    category category_digitmax_repeat 0 2147483648  in    category category_spacemax_repeat 0 1  in    literal 45    category category_space    literal 46max_repeat 0 2147483648  in    category category_spacemax_repeat 4 4  in    category category_digitat at_endmax_repeat 0 2147483648  in    category category_space123 555 6789 is valid1-(123)-555-6789 is valid(123-555-6789 is not valid(123).555.6789 is valid123 55 6789 is not valid
Greedy and non-greedy

Before explaining this concept, I would like to show an example. We need to find the anchor tag from a piece of html text:

import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>'m = re.findall('<a.*>.*<\/a>', html)if m:    print m

The results will be expected:

['<a href="http://pypix.com" title="pypix">Pypix</a>']

Let's change the input to add the second anchor Tag:

import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \       'Hello <a href="http://example.com" title"example">Example</a>'m = re.findall('<a.*>.*<\/a>', html)if m:    print m

The result looks correct again. But don't be fooled! If we encounter two anchor tags in the same line, it will not work correctly:

['<a href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>']

This pattern matches the first open tag, the last closed tag, and all the content between them, into a match instead of two separate matches. This is because the default matching mode is "greedy ".

When in greedy mode, quantifiers (such*And+) Match as many characters as possible.

When you add a question mark (.*?) It will become "non-greedy ".

import rehtml = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \       'Hello <a href="http://example.com" title"example">Example</a>'m = re.findall('<a.*?>.*?<\/a>', html)if m:    print m

The result is correct now.

['<a href="http://pypix.com" title="pypix">Pypix</a>', '<a href="http://example.com" title"example">Example</a>']
Forward and backward delimiters

A forward definator searches for the current match and then searches for the match. An example is better explained.

The following pattern matchesfooAnd then check for matchingbar:

import restrings = [  "hello foo",         # returns False             "hello foobar"  ]    # returns Truefor string in strings:    pattern = re.search(r'foo(?=bar)', string)    if pattern:        print 'True'    else:        print 'False'

This seems useless, because we can directly detectfoobarIsn't it simpler. However, it can also be used to define the forward negation. The following example matchesfoo, If and only after itNoFollowbar.

import restrings = [  "hello foo",         # returns True             "hello foobar",      # returns False             "hello foobaz"]      # returns Truefor string in strings:    pattern = re.search(r'foo(?!bar)', string)    if pattern:        print 'True'    else:        print 'False'

The backward definer is similar, but it is used to view the previous pattern of the current match. You can use(?>To define,(?<!Negative definition.

The following pattern matches onefooThebar.

import restrings = [  "hello bar",         # returns True             "hello foobar",      # returns False             "hello bazbar"]      # returns Truefor string in strings:    pattern = re.search(r'(?<!foo)bar',string)    if pattern:        print 'True'    else:        print 'False'
Condition (IF-Then-Else) Mode

Regular Expressions provide the condition detection function. The format is as follows:

(?(?=regex)then|else)

The condition can be a number. Reference the previously captured group.

For example, we can use this regular expression to detect open and closed angle brackets:

import restrings = [  "<pypix>",    # returns true             "<foo",       # returns false             "bar>",       # returns false             "hello" ]     # returns truefor string in strings:    pattern = re.search(r'^(<)?[a-z]+(?(1)>)$', string)    if pattern:        print 'True'    else:        print 'False'

In the preceding example,1Group(<)Of course, it can also be blank because there is a question mark after it. It matches Closed Angle brackets only when the condition is set.

A condition can also be a delimiter.

No capturing Group

Grouping, enclosed by parentheses, captures an array and can be referenced later. But we can also not capture them.

Let's take a look at a very simple example:

import re          string = 'Hello foobar'          pattern = re.search(r'(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => foo          print "b* => {0}".format(pattern.group(2)) # prints b* => bar

Now let's change it a little bit. Add another group to the front.(H.*):

import re          string = 'Hello foobar'          pattern = re.search(r'(H.*)(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => Hello          print "b* => {0}".format(pattern.group(2)) # prints b* => bar

The pattern array has changed, depending on how we use these variables in the code, which may make our script not work properly. Now we have to find the place where the pattern array appears in every part of the code and adjust the subscript accordingly. If we are really not interested in the content of a newly added group, we can make it "not captured", just like this:

import re          string = 'Hello foobar'          pattern = re.search(r'(?:H.*)(f.*)(b.*)', string)          print "f* => {0}".format(pattern.group(1)) # prints f* => foo          print "b* => {0}".format(pattern.group(2)) # prints b* => bar

Add?:And we no longer need to capture it in the pattern array. Therefore, other values in the array do not need to be moved.

Name Group

Like in the previous example, this is another way to prevent us from falling into the trap. We can actually name groups, and then we can reference them by name, instead of using array subscript. Format:(?Ppattern)We can rewrite the previous example as follows:

import re          string = 'Hello foobar'          pattern = re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)', string)          print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo          print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar

Now we can add another group without affecting other existing groups in the pattern array:

import re          string = 'Hello foobar'          pattern = re.search(r'(?PUse callback Functions

In Pythonre.sub()It can be used to add a callback function to replace a regular expression.

Let's take a look at this example. This is an e-mail template:

import re          template = "Hello [first_name] [last_name], \           Thank you for purchasing [product_name] from [store_name]. \           The total cost of your purchase was [product_price] plus [ship_price] for shipping. \           You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \           Sincerely, \           [store_manager_name]"          # assume dic has all the replacement data          # such as dic['first_name'] dic['product_price'] etc...          dic = {           "first_name" : "John",           "last_name" : "Doe",           "product_name" : "iphone",           "store_name" : "Walkers",           "product_price": "$500",           "ship_price": "$10",           "ship_days_min": "1",           "ship_days_max": "5",           "store_manager_name": "DoeJohn"          }          result = re.compile(r'\[(.*)\]')          print result.sub('John', template, count=1)

Note that each replacement has one thing in common, and they are enclosed by a pair of brackets. We can use a single regular expression to capture them and use a callback function to handle specific replacement.

Therefore, using callback functions is a better method:

import re          template = "Hello [first_name] [last_name], \           Thank you for purchasing [product_name] from [store_name]. \           The total cost of your purchase was [product_price] plus [ship_price] for shipping. \           You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \           Sincerely, \           [store_manager_name]"          # assume dic has all the replacement data          # such as dic['first_name'] dic['product_price'] etc...          dic = {           "first_name" : "John",           "last_name" : "Doe",           "product_name" : "iphone",           "store_name" : "Walkers",           "product_price": "$500",           "ship_price": "$10",           "ship_days_min": "1",           "ship_days_max": "5",           "store_manager_name": "DoeJohn"          }          def multiple_replace(dic, text):    pattern = "|".join(map(lambda key : re.escape("["+key+"]"), dic.keys()))    return re.sub(pattern, lambda m: dic[m.group()[1:-1]], text)     print multiple_replace(dic, template)
Do not reinvent the wheel

More importantly, you may know whenNoUse a regular expression. In many cases, you can find alternative tools.

Parsing an answer on [X] HTML Stackoverflow tells us why [X] HTML shouldn't be parsed using regular expressions.

You should use the HTML Parser. Python has many options:

  • ElementTree is part of the standard library
  • BeautifulSoup is a popular third-party library
  • Lxml is a c-based library with complete functions.

Even the malformed HTML can be very elegant, which brings the gospel to a large number of ugly websites.

An example of ElementTree:

from xml.etree import ElementTree          tree = ElementTree.parse('filename.html')          for element in tree.findall('h1'):             print ElementTree.tostring(element)
Others

There are many other tools to consider before using regular expressions.

Thank you for reading!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.