Describes the usage of Regular Expressions in Python and python regular expressions.

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction

The regular expression (RE) provides the foundation for advanced text pattern matching, search, replacement, and other functions. A regular expression is a string composed of characters and special characters that describe the repetition of these characters and characters. Therefore, it can match a set of strings with similar features in a certain pattern, therefore, it can match a series of strings with similar features in a certain pattern.
Ii. Details
1. special symbols and characters used by regular expressions

Metacharacters, the most common metacharacters for regular expressions, are special characters and symbols that give them powerful functionality and flexibility.

(1) Use the pipeline symbol (|) to match multiple regular expression modes
The pipe sign (|) indicates one or more operations. It means to select one of multiple different regular expressions separated by pipe symbols. It can match more than one string. "OR" (operation) is also called "union" OR "logical" OR "OR ).

(2) match any single character (.)
Point character or period (.) the symbol matches any single character except NEWLINE (the Python regular expression has a compilation identifier [S or DOTALL], which can remove this restriction so that (.) include line breaks when matching. Whether it is letters, numbers, blank characters that do not contain "\ n", printable characters, non-printable characters, or symbols, they can be matched. Match a dot (.) itself. It must be escaped using the backslash.

(3) match from the beginning or end of a string or the word boundary (^, $, \ B, \ B)
Some symbols and special characters are used to search for the regular expression pattern from the beginning or end of A string. to match A pattern starting from the beginning of A string, use the character ^ or special character \, to match the end of a string, use the dollar sign $ or \ Z. If you want to match any of the two characters, you must use a backslash to escape.
Special characters \ B and \ B are used to match the word boundary. The difference between the two is that the matching mode of \ B is a word boundary, that is, the corresponding mode must start with a word, whether the word is preceded by a character (the word is in the middle of a string) or no character (the word is at the beginning of a line ). Similarly, \ B only matches the pattern that appears in the middle of a word (that is, a character that is not on the word boundary ).

(4) create a character class ([])
Although the point number can be used to match any character, it sometimes needs to match some special characters. Because of this, square brackets ([]) were invented. The Regular Expression in square brackets matches any character in square brackets.

(5) specify the range (-) and negative (^)
In addition to matching a single character, square brackets also support the specified character range. The hyphens (-) in square brackets indicate the range of characters. For example, A-Z, a-z, or 0-9 indicate uppercase letters, lowercase letters, and decimal numbers. This is a range in alphabetical order, so it is not limited to only letters and decimal numbers. In addition, if the first character after the left square brackets is the upper arrow symbol (^), it indicates that it does not match any character in the specified character set.

6) use the Closure Operators (*, +, and ,? , {}) Multiple occurrences/repeated matches
Special symbols "*", "+", and "?", They can be used to match the occurrence of the string mode once, multiple times, or not. The asterisks or asterisks operator matches the regular expression on the left to show zero or more times. The plus sign (+) operator matches the regular expression pattern on the left of the operator at least once, while the question mark operator (? ) Matches the regular expression pattern on the left of the regular expression zero or once.
The curly braces operator ({}) can be a single value or a pair of values separated by commas. If it is a value, for example, {N}, it indicates matching N times; if it is a pair of values, that is, {M, N}, it indicates matching M times to N times. You can escape these symbols with a backslash to make them play a special role, that is, "\ *" matches the asterisks themselves. Repeated metacharacters (* +? {M, n}), the Regular Expression Engine tries to "absorb" more characters in the matching mode, which is called "greedy ". The question mark tells the Regular Expression Engine to be as lazy as possible. The less characters consumed by the current match, the better. Leave as many characters as possible to the subsequent pattern.

(7) special characters indicate character sets
You can use "\ d" instead of "0-9" to represent decimal numbers. Another special character "\ w" can be used to represent the character set of the entire character number, that is, equivalent to the abbreviated form of "A-Za-z0-9 _", special character "\ s" represents a blank character. These special characters do not match in upper case. For example, "\ D" indicates a non-decimal number (equivalent to "[^ 0-9.

(8) use parentheses () to form a group
When using parentheses () and regular expressions together, you can use either of the following functions: grouping regular expressions and matching sub-groups.
An additional benefit of using parentheses is that the matched substring is saved to a group for future use. These sub-groups can be repeatedly called in the same match or search, or extracted for further processing.

2. Regular Expressions and Python
The re engine has been rewritten in Python1.6, improving its performance and adding Unicode support. The interface is not changed, so the module name remains unchanged. The new re engine is called sre internally.
(1) re module: core functions and methods
Common Regular Expression functions and methods:

(2) Use compile () to compile regular expressions

Python code is eventually compiled into bytecode before being executed by the interpreter. We mentioned that calling a code object instead of a string by calling eval () or exec () significantly improves the performance because the compilation process does not have to be executed for the former. That is, it is faster to use a pre-compiled code object than to use a string, because the interpreter must compile the code object before executing the code in the string form. This concept also applies to regular expressions. Before pattern matching, the regular expression pattern must be compiled into a regex object first. If a regular expression is used for comparison multiple times during execution, we recommend that you pre-compile it first, and since regular expression compilation is required, it is wise to use pre-compilation to improve execution performance. Re. compile () is used to provide this function. In fact, the module functions will cache compiled objects, so not all search () and match () that use the same Regular Expression Pattern need to be compiled. Even so, it still saves the performance overhead of querying the cache and repeatedly calling the function with the same string.
(3) matching objects and group () and groups () Methods
When processing regular expressions, in addition to the regex object, there is another type of object-matching object. These objects are the results returned after the match () or search () is successfully called. There are two main methods for matching objects: group () and group ().
The group () method returns either all matching objects or a specific sub-group as required. Groups () is simple. It returns a single or all sub-groups of tuples. If the regular expression does not contain any child groups, groups () returns an empty tuples, and group () returns all matched objects.
(4) match strings with match ()
The match () function tries to match the pattern from the beginning of the string. If the match is successful, a matching object is returned. If the match fails, None is returned. The group () method of the matched object can be used to display the successful match.
View the CODE piece derived from my CODE piece on CODE

  >>> m = re.match('foo', 'foo') # pattern matches string   >>> if m is not None: # show match if successful    ...   m.group()   ...    'foo'   >>> m = re.match('foo', 'bar')   >>> if m is not None:   ...   m.group()   ...

M is an instance of the matching object. Even if the string is longer than the pattern, the matching may succeed. If the matching fails, an AttributeError error is thrown.
View the CODE piece derived from my CODE piece on CODE

  >>> re.match('foo', 'food on the table').group()   'foo'   >>> re.match('foo', 'fod on the table').group()   Traceback (most recent call last):    File "<stdin>", line 1, in <module>   AttributeError: 'NoneType' object has no attribute 'group'

(5) search for a pattern in a string
Search and match work in the same way. The difference is that search checks matching of the Regular Expression Pattern given anywhere in the parameter string. If a successful match is found, a matching object is returned; otherwise, None is returned. Match () tries to perform the matching mode from the start of the string, while search () searches for the position where the mode appears for the first time in the string, instead of trying to match (at the start). Strictly speaking, search () search from left to right.
[Html] view plaincopy view CODE snippets derived from my CODE snippets on CODE

  >>> m = re.search('foo', 'seafood')   >>> if m is not None: m.group()   ...    'foo'

The methods match () and search () of the regex object are used together with the methods group () and groups () of the matching object to process the vast majority of special characters and symbols in the regular expression syntax.
(6) Match multiple strings (|)
View the CODE piece derived from my CODE piece on CODE

  >>> bt = 'bat|bet|bit' # RE pattern: bat, bet, bit   >>> m = re.match(bt, 'bat')   >>> if m is not None: m.group()   ...    'bat'   >>> m = re.search(bt, 'He bit me!')   >>> if m is not None: m.group()   ...    'bit'

(7) match any single character (.)
Point numbers cannot match line breaks or non-characters (that is, empty strings). In a regular expression, escape it with a backslash to make it meaningless.
View the CODE piece derived from my CODE piece on CODE

  >>> anyend = '.end'   >>> m = re.match(anyend, 'bend') # dot matches 'b'   >>> if m is not None: m.group()   ...    'bend'   >>> m = re.match(anyend, 'end') # no char to match   >>> if m is not None: m.group()   ...    >>> m = re.match(anyend, '\nend')  # any char except \n   >>> if m is not None: m.group()   ...    >>> m = re.search('.end', 'The end.')# matches ' ' in search .   >>> if m is not None: m.group()   ...    ' end'   >>> pi_patt = '3\.14'   >>> m = re.match(pi_patt, '3.14')   >>> if m is not None: m.group()   ...    '3.14'   >>> m = re.match(pi_patt, '3014')   >>> if m is not None: m.group()   ...

(8) create a character set combination ([])
View the CODE piece derived from my CODE piece on CODE

  >>> m = re.match('[cr][23][dp][o2]', 'c3po')# matches 'c3po'   >>> if m is not None: m.group()   ...    'c3po'   >>> m = re.match('r2d2|c3po', 'c2do')# does not match 'c2do'   >>> if m is not None: m.group()   ...    >>> m = re.match('r2d2|c3po', 'r2d2')# matches 'r2d2'   >>> if m is not None: m.group()   ...    'r2d2'

(9) repetition, special characters, and sub-groups
The regular expression ("\ w + @ \ w + \. com") of a simple email address may require more email addresses than the regular expression.
View the CODE piece derived from my CODE piece on CODE

  >>> patt = '\w+@(\w+\.)?\w+\.com'   >>> re.match(patt, 'nobody@xxx.com').group()   'nobody@xxx.com'   >>> re.match(patt, 'nobody@www.xxx.com').group()   'nobody@www.xxx.com'   >>> patt = '\w+@(\w+\.)*\w+\.com'   >>> re.match(patt, 'nobody@www.xxx.yyy.zzz.com').group()   'nobody@www.xxx.yyy.zzz.com'

How to use the group () method to access each sub-group and use the groups () method to obtain a tuples containing all matched sub-groups:
View the CODE piece derived from my CODE piece on CODE

  >>> m = re.match('(\w\w\w)-(\d\d\d)', 'abc-123')   >>> m.group()   'abc-123'   >>> m.groups()   ('abc', '123')   >>> m.group(1)   'abc'   >>> m.group(2)   '123'   >>> m = re.match('(a)(b)', 'ab')   >>> m.groups()   ('a', 'b')

(10) matching from the beginning or end of a string and on the word boundary
View the CODE piece derived from my CODE piece on CODE

  >>> m = re.search('^The', 'end. The') # not at beginning   >>> if m is not None: m.group()   ...    >>> m = re.search(r'\bthe', 'bite the dog') # at a boundary   >>> if m is not None: m.group()   ...    'the'   >>> m = re.search(r'\bthe', 'bitethe dog') # no boundary   >>>    >>> if m is not None: m.group()   ...    >>> m = re.search(r'\Bthe', 'bitethe dog') # no boundary   >>> if m is not None: m.group()   ...    'the'

(11) Use findall () to find the matching part
Findall () is used to search for the occurrence of a regular expression pattern in a string without overlap. The similarities between findall () and search () are that both perform string search, the difference is that findall () always returns a list. If no matching part is found in findall (), an empty list is returned. If the matching part is successfully found, a list of all matching parts is returned (in the order displayed from left to right ).
[Html] view plaincopy view CODE snippets derived from my CODE snippets on CODE

  >>> re.findall('car', 'carry the barcardi to the car')   ['car', 'car', 'car']   >>> re.findall('car', 'scary')   ['car']   >>> re.findall('car', 'ssary')   []

When a regular expression has only one sub-group, findall () returns a list of strings that match the sub-group. If the expression has multiple sub-groups, the returned result is a list of tuples, each element in the tuples is the Matching content of a sub-group. A tuples like this (each successful match corresponds to a tuple) constitute the elements in the returned list.
(12) search and replace with sub () [and subn ()]
Both sub () and subn () replace all the parts that match the Regular Expression Pattern in a string. The part to be replaced is usually a string, but it may also be a function that returns a string to be replaced. Subn () is the same as sub (), but it also returns a number indicating the number of replicas. The string after replacement and the number indicating the number of replicas are returned as elements of a tuples.
View the CODE piece derived from my CODE piece on CODE

  >>> re.sub('[ae]', 'X', 'abcdef')   'XbcdXf'   >>> re.subn('[ae]', 'X', 'abcdef')   ('XbcdXf', 2)

(13) split (separated) with split)
If the delimiter does not use a regular expression represented by a special symbol to match multiple schemas, the execution process of re. split () and string. split () is the same.
View the CODE piece derived from my CODE piece on CODE

  >>> re.split(':', 'str1:str2:str3')   ['str1', 'str2', 'str3']

In Linux, the output results of the who command are separated:
View the CODE piece derived from my CODE piece on CODE

  #!/usr/bin/env python      from os import popen   from re import split      f = popen('who', 'r')   for eachLine in f:     print split('\s\s+|\t', eachLine.strip())   f.close()

View the CODE piece derived from my CODE piece on CODE

  ['aoyang', 'tty1', '2015-03-27 09:06 (:0)']   ['aoyang', 'pts/0', '2015-03-27 09:09 (:0.0)']   ['aoyang', 'pts/1', '2015-03-27 11:41 (:0.0)']   ['aoyang', 'pts/2', '2015-03-27 14:37 (:0.0)']

It is difficult to use string. split () to save the user's login information, because the blank symbols that separate the data are irregular and uncertain. Regular Expressions can easily do this. OS. popen () executes the who command, removes the NEWLINE at the end of each line, and adds the mode to check a single TAB symbol. the optional delimiter of split.
Conflicts between ASCII characters and special characters of the regular expression. For example, the special symbol "\ B" represents the return key in ASCII characters, while "\ B" is also a special symbol of a regular expression, and "match a word boundary" represents. In order for the RE compiler to regard the two characters "\ B" as the string to be expressed, rather than a backspace key, it needs to be escaped using another backslash, which can be written as follows: \ B ".
Iii. Summary
(1) Regular Expressions are powerful tools used to process strings. They have their own unique syntax and an independent processing engine, which may not be as efficient as the built-in str method, but are very powerful, supports pattern matching, extraction, search-and replacement.
(2) Only common symbols are provided here, and you need to accumulate Regular Expression usage in practice.
(3) if there are any deficiencies, please leave a message. Thank you first!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More