Getting started with Python regular expressions
I. Regular Expression Basics
1.1. Brief introduction
Regular expressions are not part of Python. Regular expressions are powerful tools for working with strings, with their own unique syntax and an independent processing engine, which may not be as efficient as Str's own approach, but very powerful. Thanks to this, in a language that provides regular expressions, the syntax for regular expressions is the same, except that the number of grammars supported by different programming languages is different; But don't worry, unsupported syntax is usually a less common part. If you've already used regular expressions in other languages, simply take a look and get started.
Shows the process of matching using regular expressions:
Re_simple
Figure 1
The approximate matching process of regular expressions is: to take out the expression and the character comparison in the text, if each character can match, then the match succeeds; The match fails once there is a matching unsuccessful character. If there are quantifiers or boundaries in an expression, the process can be slightly different, but it is also well understood, with examples of fancy and a few more times you can understand them.
Lists the regular expression metacharacters and syntax supported by Python: Figure 2
Greedy mode and non-greedy mode of 1.2-digit quantifier
Regular expressions are typically used to find matching strings in text. The number of words in Python is greedy by default (which may be the default non-greedy in a few languages), always trying to match as many characters as possible; Instead of greedy, always try to match as few characters as possible.
1.3 Anti-Slash puzzle
As with most programming languages, "\" is used as an escape character in regular expressions, which can cause a backslash to be plagued. If you need to match the character "\" in the text, because the positive expression needs to be escaped, so the regular expression "\ \" In the programming language of the string "\ \" also needs to be escaped, like in C + + is "\\\\", Python is also such a child. That is, the first two and the last two are used in the programming language to escape the backslash, converted to two backslashes and then escaped in the regular expression into a backslash. The native string in Python solves this problem well, and the regular expression in this example can be expressed using R "\ \". Similarly, a "\\d" that matches a number can be written as r "\d". With the native string, you no longer have to worry about missing the backslash, and the expression is more intuitive.
1.4. Matching mode
Regular expressions provide some matching patterns that are available, such as ignoring case, multi-line matching, and so on, which is described in the factory method Re.compile (pattern[, flags]) of the pattern class.
Two. Regular expression module (re-module) in Python
2.1 Starting with RE
Python provides support for regular expressions through the RE module.
The general step for using re is to compile the string form of the regular expression into a pattern instance, then use the pattern instance to process the text and get the matching result (a match instance), and finally use the match instance to get the information and do other things.
Look at the code below
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- # compile regular expressions into pattern objects
- Pattern = Re.compile (r' Hello ')
- # match text with pattern, get match result, cannot match when will return none
- Match = Pattern.match (' Hello world! ')
- If match:
- # Use Match to get group information
- print Match.group ()
- # # # output # #
- # Hello
2.2 Re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which compiles a regular expression in the form of a string into a pattern object. The second parameter, flag, is the matching pattern, and the value can use the bitwise OR operator ' | ' To take effect at the same time, such as re. I | Re. M. Alternatively, you can specify patterns in the Regex string, such as re.compile (' pattern ', re. I | Re. M) is equivalent to Re.compile (' (? im) pattern ').
The optional values are:
Re. I (re. IGNORECASE): Ignore case (full notation in parentheses, same as below)
M (MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
S (dotall): Point any matching pattern, change '. ' The behavior
L (LOCALE): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
U (UNICODE): Make a predetermined character class \w \w \b \b \s \s \d \d Depending on the UNICODE-defined character attribute
X (VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments.
The following two regular expressions are equivalent:
[Python]View PlainCopy
- # multiple lines, ignoring whitespace characters
- A = Re.compile (R"" "\d + # The integral part
- \. # The decimal point
- \d * # Some fractional digits "" ", Re. X
- b = Re.compile (R"\d+\.\d*")
Re provides a number of modular methods for completing regular expression functions. These methods can be substituted with the corresponding method of the pattern instance, with the only advantage being that one less line of Re.compile () code is written, but the compiled pattern object cannot be reused at the same time. As the above example can be abbreviated as:
m = Re.match (R ' Hello ', ' Hello world! ')
Print M.group ()
The RE module also provides a method of escape (string), which is used to such as the regular expression metacharacters in string */+/, and so on before the escape character is returned, which is a bit more useful when a large number of matching metacharacters are required.
2.3 Match
The match object is a matching result that contains a lot of information about this match and can be obtained using the readable properties or methods provided by match.
2.3.1 Property
String: The text to use when matching.
Re: The pattern object to use when matching.
POS: The index in which the text expression begins the search. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
Endpos: The index of the end-of-search text expression. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
Lastindex: The index of the last captured grouping in the text. If there are no captured groupings, it will be none.
Lastgroup: The alias of the last captured group. If the group has no aliases or no captured groupings, it will be none.
2.3.2 method
Group ([Group1, ...]):
Obtains one or more packet intercepted strings; When more than one parameter is specified, it is returned as a tuple. Group1 can use numbers or aliases; The number 0 represents the entire matched substring; Returns group (0) when the parameter is not filled; Groups that do not intercept strings return none; The group that intercepted multiple times returned the last intercepted substring.
Groups ([default]):
Returns the string intercepted by all groups as a tuple. Equivalent to calling group (,... last). Default indicates that a group that does not intercept a string is replaced with this value, which defaults to none.
Groupdict ([default]):
Returns the alias of the group with the alias name as a dictionary with the value of the substring intercepted by the group, and the group without aliases is not contained within. The default meaning is the same.
Start ([group]):
Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The group default value is 0.
End ([group]):
Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The group default value is 0.
span ([group]):
Returns (Start (group), End (group)).
Expand (Template):
Substituting the matched grouping into the template and then returns. You can use \id or \g<id> in the template, \g<name> a reference group, but you cannot use number 0. \id and \g<id> are equivalent; But \10 will be considered a 10th grouping, if you want to express \1 after the character ' 0 ', you can only use \g<1>0.
2.4 Pattern (compiled regular expression)
The pattern object is a compiled regular expression that can be matched to the text by a series of methods provided by pattern.
Pattern cannot be instantiated directly and must be constructed using Re.compile ().
Pattern provides several readable properties for getting information about an expression.
Pattern: The expression string used at compile time.
Flags: The matching pattern used at compile time. Digital form.
Groups: The number of groupings in an expression.
Groupindex: The alias of the group with the alias in the expression is the key, and the group's corresponding number is the dictionary of the value, and no alias group is included.
Method of 2.5 Re module
2.5.1 Match (string[, pos[, Endpos]) or re.match (pattern, string[, flags]):
This method attempts to match pattern from the point at which the pos of string is labeled;
Returns a Match object if the pattern is still matched at the end;
None is returned if pattern does not match during the match, or if the match does not end and the Endpos is reached.
The default values for POS and Endpos are 0 and Len (string), respectively;
Re.match () cannot specify these two parameters, the parameter flags specifies the matching pattern when compiling pattern.
Note: This method is not an exact match. If the string has any remaining characters at the end of the pattern, it is still considered successful. If you want an exact match, you can add the boundary match ' $ ' at the end of the expression.
2.5.2 Search (string[, pos[, Endpos]) or re.search (pattern, string[, flags]):
This method is used to find substrings in a string that can match a success.
Attempts to match pattern from the POS subscript of string, and returns a match object if the pattern is still matched at the end;
If there is no match, the POS is added 1 and then the match is tried again; None is returned until Pos=endpos is still unable to match.
The default values for POS and Endpos are 0 and len (string) respectively;
Re.search () cannot specify these two parameters, the parameter flags specifies the matching pattern when compiling pattern.
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- # compile regular expressions into pattern objects
- Pattern = Re.compile (R' world ')
- # Use Search () to find a matching substring, no matching substring will be returned when none is present
- # using Match () in this example does not match successfully
- # match = Pattern.match (' Hello world! ') # This can't be matched
- # match = Pattern.match (' world! hello ') # this can match
- Match = Pattern.search (' Hello world! ') # This can match
- If match:
- # Use Match to get group information
- print Match.group ()
- # # # output # #
- # World
Personal understanding: One difference between match and search is that match is a pattern that matches a string, and search is a string that matches the pattern.
2.5.3 Split (string[, Maxsplit]) or re.split (pattern, string[, Maxsplit]):
Returns a list after splitting a string by a substring that can be matched. The maxsplit is used to specify the maximum number of splits and does not specify that all will be split.
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- p = re.compile (r' \d+ ')
- Print p.split (' one1two2three3four4 ')
- # # # output # #
- # [' One ', ' one ', ' one ', ' three ', ' four ', ']
Personal Understanding:
Expression: \d+
Matching results: ONE1TWO2THREE3FOUR4
So here is the string that is being split instead of the string being matched.
2.5.4 FindAll (string[, pos[, Endpos]) or re.findall (pattern, string[, flags]):
Searches for a string, returning all matching substrings as a list.
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- p = re.compile (r' \d+ ')
- Print P.findall (' one1two2three3four4 ')
- # # # output # #
- # [' 1 ', ' 2 ', ' 3 ', ' 4 ']
2.5.5 Finditer (string[, pos[, Endpos]) or re.finditer (pattern, string[, flags]):
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- p = re.compile (r' \d+ ')
- For M in p.finditer (' One1two2three3four5 '):
- print M.group (),
- # # # output # #
- # 1 2 3 5
2.5.6 Sub (repl, string[, Count]) or re.sub (pattern, REPL, string[, Count]):
Returns the replaced string after each matched substring in string is replaced with REPL.
When Repl is a string, you can use \id or \G<ID>, and the \g<name> reference is grouped, but the number 0 cannot be used.
When Repl is a method, this method should only accept one parameter (the match object) and return a string for substitution (the returned string cannot be referenced in the grouping).
Count is used to specify the maximum number of replacements, not all when specified.
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- p = re.compile (R' (\w+) (\w+) ')
- s = ' I say, hello world! '
- Print p.sub (r' \2-\1 ', s)
- def func (m):
- return M.group (1). Title () + "+ m.group (2)." Title ()
- Print P.sub (func, s)
- # # # output # #
- # say-i, world-hello!
- # I Say, Hello world!
2.5.7 subn (REPL, string[, Count]) or re.sub (pattern, REPL, string[, Count]):
Returns (Sub (REPL, string[, Count]), number of replacements).
[Python]View PlainCopy
- #!/usr/bin/env python
- # Coding=utf-8
- # Python 2.7.3
- Import re
- p = re.compile (R' (\w+) (\w+) ')
- s = ' I say, hello world! '
- Print p.subn (r' \2-\1 ', s)
- def func (m):
- return M.group (1). Title () + "+ m.group (2)." Title ()
- Print P.subn (func, s)
- # # # output # #
- # (' Say-i, world-hello! ', 2)
- # (' I Say, Hello world! ', 2)
Getting started with Python regular expressions