International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Advanced Regular Expression Technology (Python version)

Last Update:2017-02-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A regular expression is a Swiss army knife that searches for a particular pattern from information. They are a huge library of tools, some of which are often overlooked or underutilized. Today I will show you some high-level usage of regular expressions.

For example, this is a regular expression that we might use to detect telephone numbers in the US:

R ' ^ (1[-\s.])? (\ ()? \d{3} (? ( 2)) [-\s.]? \d{3}[-\s.]? \d{4}$ '

We can add some comments and spaces to make it more readable.

R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # The area coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$ ' # last 4 digits

Let's put it in a code snippet:

Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123-6789"]for number in Numbe Rs:pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # the Area Coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$\s* ', number] # last 4 digitsif pattern:print ' {0} is valid '. Format (number) ELSE:PR int ' {0} is not valid '. Format (number)

Output, with no spaces:

123 555 6789 is valid1-(123) -555-6789 are valid (123-555-6789 is not valid (123). 555.6789 is valid123-6789 is not valid

Regular expressions are a good feature of Python, but it is difficult to debug them, and regular expressions can easily make mistakes.

Fortunately, Python can set the re for Re.compile or Re.match. DEBUG (which is actually the integer 128) flag to output the parse tree of the regular expression.

Import renumbers = ["123 555 6789", "N (123) -555-6789", "(123-555-6789", "(123). 555.6789", "123-6789"]for number in Numbe Rs:pattern = Re.match (R ' ^ ' R ' (1[-\s.])? ' # Optional ' 1-', ' 1. ' or ' 1 ' r ' (\ ()? ' # Optional opening parenthesisr ' \d{3} ' # the Area Coder ' (? ( 2)) ' # If there was opening parenthesis, close ITR ' [-\s.]? ' # followed by '-' or '. ' or spacer ' \d{3} ' # first 3 DIGITSR ' [-\s]? ' # followed by '-' or '. ' or spacer ' \d{4}$ ', number, re. DEBUG) # last 4 digitsif pattern:print ' {0} was valid '. Format (number) Else:print ' {0} is not valid '. Format (number)

Parse tree

At_beginningmax_repeat 0 1subpattern 1literal 49inliteral 45category category_spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 0 1subpattern 2literal 40max_repeat 0 2147483648incategory Category_ Spacemax_repeat 3 3incategory category_digitmax_repeat 0 2147483648incategory Category_spacesubpattern nonegroupref_ exists 2literal 41nonemax_repeat 0 2147483648incategory category_spacemax_repeat 0 1inliteral 45category Category_ Spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 3 3incategory category_digitmax_repeat 0 2147483648incategory category_spacemax_repeat 0 1inliteral 45category category_spaceliteral 46max_repeat 0 2147483648incategory category_spacemax_repeat 4 4incategory category_digitat at_endmax_repeat 0 2147483648incategory Category_space123 555 6789 is valid1-(123) -555-6789 are valid (123-555-6789 is not valid (123). 555.6789 is valid123 6789 i s not valid

Greed and non-greed

Before I explain this concept, I would like to show an example first. We want to find anchor tags from a piece of HTML text:

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ' m = Re.findall (' <a.*>.* <\/a> ', HTML) if M:print m

The results will be expected:

[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ']

Let's change the input and add a second anchor tag:

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' pypix ' >Pypix</a> ' ' Hello <a href= '/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*>.*<\/a> ', html) if M:print m

The result seems to be right again. But don't be fooled! If we encounter two anchor labels on the same line, it will no longer work correctly:

[' <a href= ' http://pypix.com "title=" Pypix ">pypix</a>hello <a href=" http://example.com "title" Example ">Example</a>"

This pattern matches the first open label and the last closed tag, and all the contents between them, into a match instead of two separate matches. This is because the default matching pattern is "greedy".

When in greedy mode, quantifiers (such as * and +) match as many characters as possible.

When you add a question mark in the back (. *?) it will become "non-greedy".

Import rehtml = ' Hello <a href= ' http://pypix.com ' title= ' pypix ' >Pypix</a> ' ' Hello <a href= '/http/ example.com "title" Example ">Example</a> ' m = Re.findall (' <a.*?>.*?<\/a> ', html) if M:print m

Now the result is correct.

[' <a href= ' http://pypix.com ' title= ' Pypix ' >Pypix</a> ', ' <a href= ' http://example.com ' title ' Example ' >Example</a> ']

Forward and post-defined delimiters

A forward qualifier searches for the current match after the search matches. It is better to explain a little by an example.

The following pattern first matches foo and then detects if the bar is then matched:

Import restrings = ["Hello foo", # returns False "Hello Foobar"] # returns truefor string in Strings:pattern = Re.search ( R ' foo (? =bar) ', string) if Pattern:print ' True ' else:print ' False '

This doesn't seem to work, because we can directly detect foobar is not much easier. However, it can also be used to define the forward negation. The following example matches Foo when and only if it is not followed by bar.

Import restrings = ["Hello foo", # returns True "Hello Foobar", # returns False "Hello Foobaz"] # returns TRUEFOR string in Strings:pattern = Re.search (R ' foo (?! Bar) ', string ' if Pattern:print ' True ' else:print ' False '

A back-up qualifier is similar, but it looks at the preceding pattern that is currently matched. You can use (?> to denote a definite definition, and (? <!) to express a negative definition.

The following pattern matches a bar that is not followed by Foo.

Import restrings = ["Hello bar", # returns True "Hello Foobar", # returns False "Hello Bazbar"] # returns TRUEFOR string in Strings:pattern = Re.search (R ' (? <!foo) bar ', string) if Pattern:print ' True ' else:print ' False '

Condition (if-then-else) mode

Regular expressions provide the ability to detect conditions. The format is as follows:

(? (? =regex) Then|else)

The condition can be a number. Represents the group to which the reference was previously snapped.

For example, we can use this regular expression to detect open and closed angle brackets:

Import restrings = ["<pypix>", # Returns True "<foo", # returns False "Bar>", # returns false "Hello"] # return S truefor string in Strings:pattern = Re.search (R ' ^ (<)? [ a-z]+ (? ( 1) >) $ ', string) if Pattern:print ' True ' else:print ' False '

In the example above, 1 is a grouping (<) and of course it can be empty because it follows a question mark. It matches closed angle brackets only when the condition is true.

The condition can also be a delimiting character.

No capturing group

grouping, enclosed in parentheses, will capture an array, which can then be referenced when it is used later. But we can also not capture them.

Let's look at a very simple example:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1)) # Prints f* = fooprint "b* + {0}". Format (Pattern.group (2)) # Prints b* = Bar

Now let's change a little bit and add another group in front (h.*):

Import restring = ' Hello foobar ' pattern = Re.search (R ' (h.*) (f.*) (b.*) ', string) print "f* = {0}". Format (Pattern.group (1) # Prints f* = helloprint "b* + {0}". Format (Pattern.group (2)) # Prints b* = Bar

The pattern array changes, depending on how we use the variables in our code, which may make our script not work properly. Now we have to find the place where the pattern array appears in the code, and then adjust the subscript accordingly. If we are really not interested in the content of a newly added group, we can make it "not captured", like this:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (?: h.*) (f.*) (b.*) ', string) print "f* = {0}". Format ( Pattern.group (1) # Prints f* = fooprint "b* = {0}". Format (Pattern.group (2)) # Prints b* = Bar

By adding?: In front of the group, we will no longer have to capture it in the pattern array. So the other values in the array do not need to be moved.

Named groups

As in the previous example, this is another way to prevent us from falling into a trap. We can actually name the groups, and then we can refer to them by name, instead of using array subscripts. The format is: (? Ppattern) We can rewrite the previous example, just like this:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (? p<fstar>f.*) (? p<bstar>b.*) ', string ' print ' f* + = {0} '. Format (Pattern.group (' Fstar ') # prints f* = fooprint "b* = {0}". Format (pattern.group (' Bstar ')) # prints b* = Bar

Now we can add another group without affecting the other existing groups in the pattern array:

Import restring = ' Hello foobar ' pattern = Re.search (R ' (? p
Using Callback functions

In Python, Re.sub () can be used to add a callback function to the regular expression substitution.

Let's take a look at this example, this is an e-mail template:

Import retemplate = "Hello [first_name] [last_name], \thank to purchasing [product_name] from [Store_name]. \the Total cost of your purchase is [Product_price] plus [ship_price] for shipping. \you can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \sincerely, \[store_manager_name] "# Assume DIC have all the replacement data# such as dic[' first_name '] dic[' product_price ' ] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "Store_name": "Walkers", "Product_price": "$ 0 "," Ship_price ":" $ "," Ship_days_min ":" 1 "," Ship_days_max ":" 5 "," Store_manager_name ":" Doejohn "}result = Re.compile (R ' \[(. *) \] ') Print result.sub (' John ', template, count=1)


Notice that each substitution has a common denominator, which is enclosed in a pair of brackets. We can use a separate regular expression to capture them, and use a callback function to handle the specific substitution.



So using a callback function is a better approach:

Import retemplate = "Hello [first_name] [last_name], \thank to purchasing [product_name] from [Store_name]. \the Total cost of your purchase is [Product_price] plus [ship_price] for shipping. \you can expect your product to arrive in [Ship_days_min] to [Ship_days_max] business days. \sincerely, \[store_manager_name] "# Assume DIC have all the replacement data# such as dic[' first_name '] dic[' product_price ' ] etc ... DIC = {"first_name": "John", "last_name": "Doe", "Product_Name": "iphone", "Store_name": "Walkers", "Product_price": "$ 0 "," Ship_price ":" $ "," Ship_days_min ":" 1 "," Ship_days_max ":" 5 "," Store_manager_name ":" Doejohn "}def Multiple_ Replace (dic, text):p Attern = "|". Join (Map (Lambda key:re.escape ("[" +key+] "), Dic.keys ()) return re.sub (pattern, Lambda m:dic[m.group () [1:-1]], text) Print Multiple_replace (dic, template)


Don't invent the wheel again.

It may be more important to know when not to use regular expressions. In many cases you can find alternative tools.

parsing [x]html

An answer on StackOverflow tells us why we shouldn't use regular expressions to parse [x]html] with a wonderful explanation.

You should use the HTML parser, Python has a lot of options:


 
 
   
   ElementTree is part of the standard library 
   BeautifulSoup is a popular third-party library 
   lxml is a full-featured, C-based Fast library 
  

 
The next two of even malformed HTML can be elegant, which brings the gospel to a large number of ugly sites.

An example of ElementTree:

From xml.etree Import elementtreetree = Elementtree.parse (' filename.html ') to element in Tree.findall (' H1 '):p rint Elementtree.tostring (Element)
Other

There are a number of other tools to consider before using regular expressions.

The above is the content of advanced regular expression technology (Python version), please follow topic.alibabacloud.com (www.php.cn) for more information!



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Advanced Regular Expression Technology (Python version)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support