Advanced Regular Expression Technology (Python version)

Source: Internet
Author: User
A regular expression is a Swiss army knife that searches for a particular pattern from information. They are a huge library of tools, some of which are often overlooked or underutilized. Today I will show you some high-level usage of regular expressions.


For example, this is a regular expression that we might use to detect telephone numbers in the US:

R ' ^ (1[-\s.])? (\ ()? \d{3} (? ( 2)) [-\s.]? \d{3}[-\s.]? \d{4}$ '

We can add some comments and spaces to make it more readable.

R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # The area coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$ ' # last 4 digits


Let's put it in a code snippet:

Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123-6789"]for number in Numbe Rs:pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # the Area Coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$\s* ', number] # last 4 digitsif pattern:print ' {0} is valid '. Format (number) ELSE:PR int ' {0} is not valid '. Format (number)


Output, with no spaces:

123 555 6789 is valid1-(123) -555-6789 are valid (123-555-6789 is not valid (123). 555.6789 is valid123-6789 is not valid


Regular expressions are a good feature of Python, but it is difficult to debug them, and regular expressions can easily make mistakes.


Fortunately, Python can set the re for Re.compile or Re.match. DEBUG (which is actually the integer 128) flag to output the parse tree of the regular expression.

Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123-6789"]for number in Numbe Rs:pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # the Area Coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$ ', number, re. DEBUG) # last 4 digitsif pattern:print ' {0} was valid '. Format (number) Else:print ' {0} is not valid '. Format (number)


Parse tree

At_beginningmax_repeat 0 1subpattern 1literal 49inliteral 45category category_spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 0 1subpattern 2literal 40max_repeat 0 2147483648incategory Category_ Spacemax_repeat 3 3incategory category_digitmax_repeat 0 2147483648incategory Category_spacesubpattern nonegroupref_ exists 2literal 41nonemax_repeat 0 2147483648incategory category_spacemax_repeat 0 1inliteral 45category Category_ Spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 3 3incategory category_digitmax_repeat 0 2147483648incategory category_spacemax_repeat 0 1inliteral 45category category_spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 4 4incategory category_digitat at_endmax_repeat 0 2147483648incategory Category_space123 555 6789 is valid1-(123) -555-6789 are valid (123-555-6789 is not valid (123). 555.6789 is valid123 6789 i s not valid

Greed and non-greed


Before I explain this concept, I would like to show an example first. We want to find anchor tags from a piece of HTML text:

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ' m = Re.findall (' <a.*>.* <\/a> ', HTML) if M:print m


The results will be expected:

[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ']

Let's change the input and add a second anchor tag:

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' pypix ' >Pypix</a> ' ' Hello <a href= '/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*>.*<\/a> ', html) if M:print m


The result seems to be right again. But don't be fooled! If we encounter two anchor labels on the same line, it will no longer work correctly:

[' <a href= ' http://pypix.com "title=" Pypix ">pypix</a>hello <a href=" http://example.com "title" Example ">Example</a>"

This pattern matches the first open label and the last closed tag, and all the contents between them, into a match instead of two separate matches. This is because the default matching pattern is "greedy".


When in greedy mode, quantifiers (such as * and +) match as many characters as possible.


When you add a question mark in the back (. *?) it will become "non-greedy".

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' pypix ' >Pypix</a> ' ' Hello <a href= '/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*?>.*?<\/a> ', html) if M:print m


Now the result is correct.

[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ', ' <a href= ' http://example.com ' title ' Example ' >Example</a> ']

Forward and post-defined delimiters


A forward qualifier searches for the current match after the search matches. It is better to explain a little by an example.


The following pattern first matches foo and then detects if the bar is then matched:

Import restrings = ["Hello foo", # returns False "Hello Foobar"] # returns truefor string in Strings:pattern = Re.search ( R ' foo (? =bar) ', string) if Pattern:print ' True ' else:print ' False '


This doesn't seem to work, because we can directly detect foobar is not much easier. However, it can also be used to define the forward negation. The following example matches Foo when and only if it is not followed by bar.

Import restrings = ["Hello foo", # returns True "Hello Foobar", # returns False "Hello Foobaz"] # returns TRUEFOR string in Strings:pattern = Re.search (R ' foo (?! Bar) ', string ' if Pattern:print ' True ' else:print ' False '


A back-up qualifier is similar, but it looks at the preceding pattern that is currently matched. You can use (?> to denote a definite definition, and (? <!) to express a negative definition.


The following pattern matches a bar that is not followed by Foo.

Import restrings = ["Hello bar", # returns True "Hello Foobar", # returns False "Hello Bazbar"] # returns TRUEFOR string in Strings:pattern = Re.search (R ' (? <!foo) bar ', string) if Pattern:print ' True ' else:print ' False '


Condition (if-then-else) mode

Regular expressions provide the ability to detect conditions. The format is as follows:

(? (? =regex) Then|else)

The condition can be a number. Represents the group to which the reference was previously snapped.


For example, we can use this regular expression to detect open and closed angle brackets:

Import restrings = ["<pypix>", # Returns True "<foo", # returns False "Bar>", # returns false "Hello"] # return S truefor string in Strings:pattern = Re.search (R ' ^ (<)? [ a-z]+ (? ( 1) >) $ ', string) if Pattern:print ' True ' else:print ' False '


In the example above, 1 is a grouping (<) and of course it can be empty because it follows a question mark. It matches closed angle brackets only when the condition is true.

The condition can also be a delimiting character.

No capturing group

grouping, enclosed in parentheses, will capture an array, which can then be referenced when it is used later. But we can also not capture them.

Let's look at a very simple example:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1)) # Prints f* = fooprint "b* + {0}". Format (Pattern.group (2)) # Prints b* = Bar


Now let's change a little bit and add another group in front (h.*):

Import restring = ' Hello foobar ' pattern = Re.search (R ' (h.*) (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1) # Prints f* = helloprint "b* + {0}". Format (Pattern.group (2)) # Prints b* = Bar


The pattern array changes, depending on how we use the variables in our code, which may make our script not work properly. Now we have to find the place where the pattern array appears in the code, and then adjust the subscript accordingly. If we are really not interested in the content of a newly added group, we can make it "not captured", like this:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (?: h.*) (f.*) (b.*) ', string) print "f* = {0}". Format ( Pattern.group (1) # Prints f* = fooprint "b* = {0}". Format (Pattern.group (2)) # Prints b* = Bar


By adding?: In front of the group, we will no longer have to capture it in the pattern array. So the other values in the array do not need to be moved.

Named groups

As in the previous example, this is another way to prevent us from falling into a trap. We can actually name the groups, and then we can refer to them by name, instead of using array subscripts. The format is: (? Ppattern) We can rewrite the previous example, just like this:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (? p<fstar>f.*) (? p<bstar>b.*) ', string ' print ' f* + = {0} '. Format (Pattern.group (' Fstar ') # prints f* = fooprint "b* = {0}". Format (pattern.group (' Bstar ')) # prints b* = Bar

Now we can add another group without affecting the other existing groups in the pattern array:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (? p

Using Callback functions

In Python, Re.sub () can be used to add a callback function to the regular expression substitution.

Let's take a look at this example, this is an e-mail template:

Import retemplate = "Hello [first_name] [last_name], \thank to purchasing [product_name] from [Store_name]. \the Total cost of your purchase is [Product_price] plus [ship_price] for shipping. \you can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \sincerely, \[store_manager_name] "# Assume DIC have all the replacement data# such as dic[' first_name '] dic[' product_price ' ] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "Store_name": "Walkers", "Product_price": "$ 0 "," Ship_price ":" $ "," Ship_days_min ":" 1 "," Ship_days_max ":" 5 "," Store_manager_name ":" Doejohn "}result = Re.compile (R ' \[(. *) \] ') Print result.sub (' John ', template, count=1)


Notice that each substitution has a common denominator, which is enclosed in a pair of brackets. We can use a separate regular expression to capture them, and use a callback function to handle the specific substitution.


So using a callback function is a better approach:

Import retemplate = "Hello [first_name] [last_name], \thank to purchasing [product_name] from [Store_name]. \the Total cost of your purchase is [Product_price] plus [ship_price] for shipping. \you can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \sincerely, \[store_manager_name] "# Assume DIC have all the replacement data# such as dic[' first_name '] dic[' product_price ' ] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "Store_name": "Walkers", "Product_price": "$ 0 "," Ship_price ":" $ "," Ship_days_min ":" 1 "," Ship_days_max ":" 5 "," Store_manager_name ":" Doejohn "}def Multiple_ Replace (dic, text):p Attern = "|". Join (Map (Lambda key:re.escape ("[" +key+] "), Dic.keys ()) return re.sub (pattern, Lambda m:dic[m.group () [1:-1]], text) Print Multiple_replace (dic, template)


Don't invent the wheel again.

It may be more important to know when not to use regular expressions. In many cases you can find alternative tools.

parsing [x]html

An answer on StackOverflow tells us why we shouldn't use regular expressions to parse [x]html] with a wonderful explanation.

You should use the HTML parser, Python has a lot of options:

    • ElementTree is part of the standard library

    • BeautifulSoup is a popular third-party library

    • lxml is a full-featured, C-based Fast library

The next two of even malformed HTML can be elegant, which brings the gospel to a large number of ugly sites.

An example of ElementTree:

From xml.etree Import elementtreetree = Elementtree.parse (' filename.html ') to element in Tree.findall (' H1 '):p rint Elementtree.tostring (Element)

Other

There are a number of other tools to consider before using regular expressions.

The above is the content of advanced regular expression technology (Python version), please follow topic.alibabacloud.com (www.php.cn) for more information!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.