An article about Python regular expressions

Last Update:2017-09-22 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Regular expression syntax

　　1.1 Characters and character classes
1 special characters: \.^$?+*{}[] () |
To use the literal value of the above special characters, you must use \ to escape
2 Character class
1. One or more of the characters contained in [] are called character classes, and the character class matches only one of them if no quantifier is specified.
2. You can specify a range within a character class, such as [a-za-z0-9] to represent any one character from a to Z,a to z,0 to 9
3. After the left parenthesis followed by a ^, the negation of a character class, such as [^0-9] means that can match an arbitrary non-numeric character.
4. Inside the character class, except for \, other special characters no longer have special meanings, and all represent literal values. ^ Put in the first position for negation, put in another position to represent ^ itself,-put in the middle of the expression range, placed in the character class of the first character, then represents-itself.

5. Shorthand can be used inside character classes, such as \d \s \w
3 Shorthand method
. You can match any characters except the newline character, if you have re. Dotall flag, matches any character including line breaks
\d matches a Unicode number if it is with the re. ASCII, then match 0-9
\d matching Unicode non-numeric
\s matches the Unicode whitespace, if with re. ASCII, then match one of the \t\n\r\f\v
\s matching Unicode non-whitespace
\w matches a Unicode word character and, if with re.ascii, matches one of [a-za-z0-9_]
\w matching Unicode non-single characters

　　1.2 quantifier
1. Match the preceding character 0 or 1 times
2. * match the preceding characters 0 or more times
3. + Match the preceding character 1 or more times
4. {m} matches the preceding expression m-times
5. {m,} matches the preceding expression at least m times
6. {, n} matches the preceding regular expression up to n times
7. {M,n} matches the preceding regular expression at least m times, up to N times
Note the point:
The above quantifiers are greedy patterns, will match as many as possible, if you want to change to non-greedy mode, by following a quantifier after a? To achieve

The

1.3 Group and captures
1 ():
1. Captures () the contents of the expression in a way that can be used for further processing.　　　　After the left parenthesis follow?: To close the capture function of this parenthesis
2. Combine parts of a regular expression to use quantifiers or |
2 responses refer to what was captured in the previous ():
1. Reverse referencing by group number
every unused?: Parentheses are assigned a good group, starting from 1, from left to right, you can refer to the content captured in the preceding () expression by \i
2. By group name Reverse references the content captured in the preceding parentheses
can be followed by an opening parenthesis? P<name>, put the group name in the angle brackets to set up an alias for a group, followed by (? P=name) To reference the previously captured content. such as (? p<word>\w+) \s+ (? P=word) to match the repeated words.
3 Note:
A reverse reference cannot be used in a character class [].

　　1.4 Assertions and tokens
Assertion does not match any text, only imposes certain constraints on the text where the assertion resides
1 Common assertions:
1. \b matches the boundary of a word and is placed in the character class [] to indicate BACKSPACE
2. \b Matches non-word boundaries, affected by ASCII tags
3. \a matches at the beginning
4. ^ Matches at start, if there is a multiline flag, matches after each line break
5. \z matches at the end
6. $ matches at the end, if there is a multiline flag, matches before each line break
7. (=e) is looking forward
8. (?! e) Negative outlook
9. (<=e) is recalling
(? <!e) Negative review
2 The explanation of the forward looking review
Forward looking: exp1 (? =exp2) Exp1 after the content to match EXP2
Negative outlook: EXP1 (?! EXP2) EXP1 After the content does not match EXP2
Looking back: (? <=exp2) Exp1 Exp1 front of content to match EXP2
Negative Looking back: (? <!exp2) EXP1 EXP1 front content does not match EXP2
For example: We are looking for hello, but Hello must be followed by world, and the regular expression can write: "(hello) \s+ (? =world)", to match "Hello wangqing" and "Hello World" only to the latter's Hello

　　1.5 Matching items
(? (ID) yes_exp|no_exp): sub-expression corresponding to ID if match to content, then match yes_exp here, otherwise match no_exp

　　1.6 Flags for regular expressions
1. There are two ways to use regular expression flags
1. By passing the flag parameter to the compile method, multiple flags Use the | Split method, such as Re.compile (r "#[\da-f]{6}\b", re. Ignorecase|re. MULTILINE)
2. Add a flag to the regular expression by adding the (? flag) method before the regular expression, such as (? ms) #[\da-z]{6}\b
2. Common signs
Re. A or re.ascii, so that \b \b \s \s \w \w \d \d All assume that the string is assumed to be ASCII
Re. I or re.ignorecase make regular expressions ignore case
Re. m or re.multiline multiple lines match, so each ^ after each carriage return, each match before each carriage return
Re. s or re.dotall enable. Can match any character, including carriage return
Re. X or Re.verbose this can span multiple lines in a regular expression, or you can add comments, but whitespace needs to be represented by \s or [] because the default whitespace is no longer explained. Such as:
Re.compile (r "" "
[^>]*? #不是src的属性
Src= #src属性的开始
(?:
(? P<quote>["']) #左引号
(? P<image_name>[^\1>]+?) #图片名字
(? P=quote) #右括号
"" ", Re. Verbose|re. IGNORECASE)

2. Python Regular expression Module
　 2.1 Regular expression processing string there are four main functions　
1. Match to see if a string conforms to the syntax of a regular expression, generally returns TRUE or False
2. Get the regular expression to extract the text that meets the requirements in the string
3. Replace the text in the lookup string that matches the regular expression and replace it with the corresponding string
4. Split the string by using regular expressions.
　　 2.2 Python two ways to use regular expressions in the RE module
1. Use the Re.compile (R, F) method to generate the regular expression object, and then call the corresponding method of the regular Expression object. The benefit of this approach is that it can be used multiple times after the regular object is generated.
2. The re module has a corresponding module method for each object method of the regular expression object, except that the first parameter passed in is a regular expression string. This method is suitable for regular expressions that are used only once.
　　 2.3 Common methods for regular expression objects
1. Rx.findall (S,start, end):
Returns a list that contains all matching content if there are no groupings in the regular expression.
If there is a grouping in the regular expression, each element in the list is a tuple that contains the matched contents of the sub-group, but does not return the contents of the entire regular expression match
2. Rx.finditer (S, start, end):
Returns an object that can be iterated
Iterate over an iterative object, each time a matching object is returned, you can call the group () method of the matching object to see what the specified group matches, and 0 to indicate what the entire regular expression matches
3. Rx.search (S, start, end):
Returns a matching object that returns none if no match is reached
The search method stops only once and does not continue to match
4. Rx.match (S, start, end):
If the regular expression matches at the beginning of the string, a matching object is returned, otherwise none is returned
5. Rx.sub (x, S, m):
Returns a string. Each matching place is replaced with X, returns the replaced string, and, if M is specified, replaces up to M times. For x, you can use/I or/g<id>id can be a group name or a number to reference the captured content.
The module method x in Re.sub (r, X, S, m) can use a function. We can then replace the matched text by pushing the function to process the captured content.
6. RX.SUBN (x, S, m):
As with the Re.sub () method, the difference is that it returns a two-tuple, one of which is the result string, and one is the number of replacements.
7. Rx.split (S, m): Split string
Returns a list
To split a string with the content matched by a regular expression
If a grouping exists in a regular expression, the contents of the grouping match are placed in the middle of each two split in the list as part of the list, such as:
Rx = Re.compile (r "(\d) [a-z]+ (\d)")
s = "AB12DK3KLJ8JK9JKS5"
result = Rx.split (s)
return [' Ab1 ', ' 2 ', ' 3 ', ' Klj ', ' 8 ', ' 9 ', ' JKS5 ']
8. Rx.flags (): Flags for regular expression compile-time settings
9. Rx.pattern (): string used by regular expression compilation
　 2.4 Properties and methods of matching objects
M.group (g, ...)
Returns the content to which the number or group name matches, by default or 0 to indicate what the entire expression matches, and returns a tuple if more than one is specified
M.groupdict (default)
Returns a dictionary. The dictionary key is the group name of all named groups, and the value is captured by the named group
If there is a default parameter, it will be defaulted to those groups that are not participating in the match.
M.groups (default)
Returns a tuple. Contains all sub-groupings of captured content, starting with 1, if the default value is specified, this value is used as a value for those groups that have not captured the content
M.lastgroup ()
The name of the capturing group with the highest number of matches to the content, or none if no or no name is used (not commonly used)
M.lastindex ()
Matches the number of the capturing group to the highest number of the content, and returns none if none.
. M.start (g):
The sub-group of the current matching object is matched from that position of the string, and returns 1 if the current group does not participate in the match.
. M.end (g)
The sub-group of the current matching object is terminated from that position of the string, and returns 1 if the current group does not participate in the match.
M.span ()
Returns a two-tuple with the contents of the return values of M.start (g) and M.end (g), respectively
M.re ()
A regular expression that produces this matching object
M.string ()
Pass to match or search for a matching string
M.pos ()
The starting position of the search. That is, the beginning of the string, or the position specified by start (not commonly used)
M.endpos ()
The end position of the search. That is, the end position of the string, or the position specified by end (not commonly used)
　　 2.5 Summary
1. Python does not return a true and false method for matching regular expressions, but can be judged by whether the return value of the match or search method is None
2. For the search function of regular expressions, if only one search can be obtained using a matching object returned by the search or match method, iterating over the iterative objects returned by the Finditer method multiple times can be used for searching
3. For the replacement of regular expressions, you can either use the Sub or Subn method of the regular expression object, or you can do it through the Re module method Sub or subn, except that the replacement text of the sub method of the module can use a function to generate
4. For the segmentation of regular expressions, you can use the split method of the regular expression object, and note that if the regular expression object has a grouping, the grouped captured content is also placed in the returned list

Original blog post, if reproduced, please note the source Kazakhstan.

An article about Python regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More