Regular expressions and the Python re module

Last Update:2016-10-19 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Regular grammar

character	Description
\	Marks the next character as a special character, text, reverse reference, or octal escape. For example, "n" matches the character "n". "\ n" matches the line break. The sequence "\ \" matches "\", "\ (" Match "(".
^	Matches the starting position of the input string. If the Multiline property of the RegExp object is set, ^ will also match the position after "\ n" or "\ r".
$	Matches the position of the end of the input string. If you set the Multiline property of the RegExp object, the $ will also match the position before \ n or \ r.
/^ and $/	Paired use should be a rule that indicates that the entire string is required to exactly match the definition, rather than just one substring in the string. Example:/^\s*$/matches a blank line
*	Matches the preceding character or sub-expression 0 or more times. For example, zo* matches "z" and "Zoo". * Equivalent to {0,}.
+	Matches the preceding character or sub-expression one or more times. For example, "zo+" matches "Zo" and "Zoo", but does not match "Z". + equivalent to {1,}.
?	Matches the preceding character or sub-expression 0 or one time. For example, "Do (es)?" Match "Do" in "do" or "does".? Equivalent to {0,1}.
{n}	N is a non-negative integer. Matches exactly N times. For example, "o{2}" does not match "O" in "Bob", but matches two "o" in "food".
{n,}	N is a non-negative integer. Match at least N times. For example, "o{2,}" does not match "O" in "Bob", but matches all o in "Foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".
{n,m}	m and n are non-negative integers, where n <= M. Matches at least N times, up to m times. For example, "o{1,3}" matches the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note: You cannot insert a space between a comma and a number.
?	When this character follows any other qualifier (*, + 、?、 {n}, {n,}, {n,m}), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible string searched, while the default "greedy" pattern matches the string that is searched for as long as possible. For example, in the string "Oooo", "o+?" Only a single "O" is matched, and "o+" matches All "O".
.	Matches any single character except "\ n". To match any character that includes "\ n", use a pattern such as "[\s\s]".
(pattern)	Matches the pattern and captures the matched sub-expression. You can use the $0...$9 property to retrieve a captured match from the result "match" collection. To match the bracket character (), use "\ (" or "\)".
(?:pattern)	A subexpression that matches the pattern but does not capture the match, that is, it is a non-capturing match and does not store a match for later use. This is useful for combining pattern parts with the "or" character (\|). For example, ' Industr (?: y\|ies) is a more economical expression than ' industry\|industries '.
(? =pattern)	A subexpression that performs a forward lookahead search that matches the string at the starting point of the string that matches the pattern . It is a non-capture match, that is, a match that cannot be captured for later use. For example, ' Windows (? =95\|98\| nt\|2000) ' Matches Windows 2000 ' in Windows, but does not match Windows 3.1 in Windows. Lookahead does not occupy characters, that is, when a match occurs, the next matching search immediately follows the previous match, rather than the word specifier that makes up the lookahead.
(?! pattern)	A subexpression that performs a reverse lookahead search that matches a search string that is not at the starting point of a string that matches the pattern . It is a non-capture match, that is, a match that cannot be captured for later use. For example, ' Windows (?! 95\|98\| nt\|2000) ' matches Windows 3.1 ' in Windows, but does not match Windows 2000 in Windows. Lookahead does not occupy characters, that is, when a match occurs, the next matching search immediately follows the previous match, rather than the word specifier that makes up the lookahead.
x\| y	Match x or y. For example, ' Z\|food ' matches ' z ' or ' food '. ' (z\|f) Ood ' matches "Zood" or "food".
[XYZ]	Character. Matches any one of the characters contained. For example, "[ABC]" matches "a" in "plain".
[^XYZ]	The reverse character set. Matches any characters that are not contained. For example, "[^abc]" matches "P" in "plain".
[A-Z]	The character range. Matches any character within the specified range. For example, "[A-z]" matches any lowercase letter in the range "a" to "Z".
[^ A-Z]	The inverse range character. Matches any character that is not in the specified range. For example, "[^a-z]" matches any character that is not in the range "a" to "Z".
\b	Matches a word boundary, which is the position between the word and the space. For example, "er\b" matches "er" in "never", but does not match "er" in "verb".
\b	Non-word boundary match. "er\b" matches "er" in "verb", but does not match "er" in "Never".
\cx	Matches the control character indicated by x . For example, \cm matches a control-m or carriage return character. The value of x must be between A-Z or a-Z. If this is not the case, then the C is assumed to be the "C" character itself.
\d	numeric character matching. equivalent to [0-9].
\d	Non-numeric character matching. equivalent to [^0-9].
\f	The page break matches. Equivalent to \x0c and \CL.
\ n	Line break matches. Equivalent to \x0a and \CJ.
\ r	Matches a carriage return character. Equivalent to \x0d and \cm.
\s	Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s	Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v].
\ t	TAB matches. Equivalent to \x09 and \ci.
\v	Vertical tab matches. Equivalent to \x0b and \ck.
\w	Matches any character, including underscores. Equivalent to "[a-za-z0-9_]".
\w	Matches any non-word character. Equivalent to "[^a-za-z0-9_]".
\xN	Match N, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". Allows the use of ASCII code in regular expressions.
\Num	Matches num, where num is a positive integer. To capture a matching reverse reference. For example, "(.) \1 "matches two consecutive identical characters.
\N	Identifies an octal escape code or a reverse reference. If there are at least N captured subexpression in front of \n , then n is a reverse reference. Otherwise, if n is an octal number (0-7), then n is the octal escape code.
\nm	Identifies an octal escape code or a reverse reference. If there is at least a nm capture subexpression in front of the \nm , then nm is a reverse reference. If there are at least N captures in front of the \nm , then n is a reverse reference followed by the character M. If neither of the preceding conditions exists, the \nm matches the octal value nm, where n and m are octal digits (0-7).
\NML	When N is an octal number (0-3),m and l are octal numbers (0-7), the octal escape code NMLis matched.
\uN	Matches n, where n is a Unicode character represented by a four-bit hexadecimal number. For example, \u00a9 matches the copyright symbol (?).

The difference between 2.martch and search

Python provides two different primitive operations: match and search. Match is matched from the beginning of the string, and search (perl default) makes any match from the string.

Note: When the regular expression starts with ' ^ ', match is the same as search. Match will only succeed if and only if the matched string starts to match or matches from the position of the POS parameter.

3. Module content re.compile (pattern, flags=0)

Compiles a regular expression, returns a Regexobject object, and can then invoke the match () and the search () method through the Regexobject object.

Prog = Re.compile (pattern)

result = Prog.match (string)

With

result = Re.match (pattern, string)

is equivalent.

The first way to achieve the reuse of regular expressions.

Re.search (Pattern, string, flags=0)

Looks in the string to see if it matches the regular expression. Returns _SRE. The Sre_match object, if it cannot match, returns none.

Re.match (Pattern, string, flags=0)

Whether the beginning of the string can match the regular expression. Returns _SRE. The Sre_match object, if it cannot match, returns none.

Re.split (Pattern, string, maxsplit=0)

Separates a string from a regular expression. If you enclose the regular expression in parentheses, the matching string is also returned in the list. Maxsplit is the number of separations, the maxsplit=1 is separated once, the default is 0, the number of times is not limited.

>>> re.split (' \w+ ', ' Words, Words, Words. ')
[' Words ', ' Words ', ' Words ', ']
>>> re.split (' (\w+) ', ' Words, Words, Words. ')
[' Words ', ', ', ' Words ', ', ', ' Words ', '. ', ']
>>> re.split (' \w+ ', ' Words, Words, Words. ', 1)
[' Words ', ' Words, Words. ']
>>> re.split (' [a-f]+ ', ' 0a3b9 ', flags=re. IGNORECASE)

Note: The python I used was 2.6, and the source code found that split () did not have the flags parameter, and 2.7 only increased. This problem I found more than once, the official documents and source inconsistencies, if found abnormal, should go to the source code to find the reason.

If it matches at the beginning or end of the string, the returned list will start or end with a blank string.

>>> re.split (' (\w+) ', ' ... words, words ... ')
[', ' ... ', ' words ', ', ', ' words ', ' ... ', ']

If the string does not match, a list of the entire string is returned.

>>> Re.split ("A", "BBB")
[' BBB ']

Re.findall (Pattern, string, flags=0)

Find all the substrings that the RE matches and return them as a list. This match is returned from left to right in an orderly manner. If there is no match, an empty list is returned.

>>> Re.findall ("A", "Bcdef")
[]

>>> Re.findall (r "\d+", "12A32BC43JF3")
[' 12 ', ' 32 ', ' 43 ', ' 3 ']

Re.finditer (Pattern, string, flags=0)

Find all the substrings that the RE matches and return them as an iterator. This match is returned from left to right in an orderly manner. If there is no match, an empty list is returned.

>>> it = Re.finditer (r "\d+", "12A32BC43JF3")
>>> for match in it:
Print Match.group ()

12
32
43
3

Re.sub (Pattern, Repl, String, count=0, flags=0)

Find all the substrings that the RE matches and replace them with a different string. The optional parameter count is the maximum number of times a pattern match is replaced, and count must be a non-negative integer. The default value is 0 to replace all matches. If there is no match, the string will return unchanged.

RE.SUBN (Pattern, Repl, String, count=0, flags=0)

The same as the Re.sub method, but returns a two-tuple that contains the new string and the number of substitution executions.

Re.escape (String)

To escape non-alphanumeric numbers in a string

Re.purge ()

Emptying regular expressions in the cache

4. Regular Expression objects

Re. Regexobject

Re.compile () returns the Regexobject object

Re. Matchobject

Group () returns a string that is matched by RE

Start () returns the position where the match started

End () returns the position of the end of the match

Span () returns a tuple containing the position of the match (start, end)

5. Compile Flag

The compile flag allows you to modify some of the way regular expressions are run. In the RE module The logo can use two names, one is full name such as IGNORECASE, one is abbreviated, one letter form like I. (If you are familiar with Perl's pattern modifications, use the same letters in one letter; for example, re.) The abbreviated form of verbose is re. X. Multiple flags can be specified by bitwise or-ing them. such as Re. I | Re. M is set to the I and M flags:

I
IGNORECASE

Makes the match insensitive to case, and the character class and the string that match the letter are ignored when the case is written. For example, [A-z] can also match lowercase letters, Spam can match "Spam", "Spam", or "Spam". This lowercase letter does not take into account the current position.

L
LOCALE

Affects "W," W, "B, and" B, depending on the current localization setting.

Locales is a feature in the C language library and is used to help with programming that requires different languages to consider. For example, if you are working with French text, you want to use "w+ to match the text, but" W matches only the character class [a-za-z]; it does not match "é" or "?". If your system is properly configured and localized to French, the internal C function tells the program that "é" should also be considered a letter. The use of the LOCALE flag when compiling regular expressions will give you the ability to use these C functions to process "W" compiled objects, which will be slower, but will also be able to match the French text with "w+" as you would expect.

M
MULTILINE

(At this time ^ and $ will not be interpreted; they will be introduced in section 4.1.)

Use "^" to match only the beginning of the string, and $ to match only the end of the string and the end of the string directly before the line break (if any). When this flag is specified, "^" matches the start of the string and the beginning of each line in the string. Similarly, the $ metacharacters match the end of the string and the end of each line in the string (directly before each line break).

S
Dotall

Make the "." Special character match any character exactly, including line breaks; no this flag, "." matches any characters except line breaks.

X
VERBOSE

This flag is given by giving you a more flexible format so that you can write regular expressions much easier to understand. When the flag is specified, a white space character in the re string is ignored, unless the whitespace is in the character class or after the backslash, which allows you to organize and indent the re more clearly. It can also allow you to write comments to the RE, which are ignored by the engine; the comment is identified by the "#" sign, but the symbol cannot be followed by a string or backslash.

Regular expressions and the Python re module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More