Python re Module

Last Update:2018-12-07 Source: Internet

Author: User

Tags character classes uppercase letter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Refer to Region.

Regular Expressions (also known as res, RegEx, and RegEx pattens) are small and highly specializedProgramming LanguageIt is embedded in the python development language and can be used through the re module. Regular Expression

Pattern can be compiled into a series of bytecode and then executed using the engine written in C. The following describes the regular expression syntax.

The regular expression contains a metacharacter list. The list value is as follows:. ^ $ * +? {[] \ | ()

1. metacharacters ([]) are used to specify a character class. The so-called character classes is the set of characters you want to match (character). character (character) can be listed in a single way, you can also use "-" to separate two characters to represent a range. For example, if [ABC] matches any character in A, B, or C, [ABC] can also be represented by a character range --- [A-C]. if you want to match a single uppercase letter, you can use [A-Z].

Metacharacters does not work in character class. For example, [AKM $] matches any character in "A", "k", "M", "$. Metacharacter "$" is a common character.

2. metacharacters [^]. You can use a complementary set to match characters not in the range. In this way, "^" is used as the first character of the category. In other places, "^" simply matches the character "^. For example, [^ 5] matches any character except "5. Besides [], metacharacters ^ indicate the start of matching a string. For example, "^ AB +" indicates a string starting with AB.

For example,

>>> M = Re. Search ("^ AB +", "asdfabbbb ")
>>> Print m
None
>>> M = Re. Search ("AB +", "asdfabbbb ")
>>> Print m
<_ SRE. sre_match object at 0x011b1988>
>>> Print M. Group ()
Abbbb

In the above example, re. Match cannot be used. Because match matches the start of a string, we cannot verify whether the metacharacter "^" represents the start position of the string.

>>> M = Re. Match ("^ AB +", "asdfabbbb ")
>>> Print m
None
>>> M = Re. Match ("AB +", "asdfabbbb ")
>>> Print m
None

# Verify the meaning of "^" in the metacharacters [] in different locations.
>>> Re. Search ("[^ ABC]", "ABCD") # "^" indicates the inverse of the first character, that is, any character other than ABC.
<_ SRE. sre_match object at 0x011b19f8>
>>> M = Re. Search ("[^ ABC]", "ABCD ")
>>> M. Group ()
'D'
>>> M = Re. Search ("[ABC ^]", "^") # If "^" is not the first character in [], it is a common character.
>>> M. Group ()
'^'

However, I have such a question about the metacharacters "^". The official documentation is http://docs.python.org/library/re.html """^”mat, matches the start

Of the string, and inMultilineMode also matches immediately after each newline.

What I understand is that "^" matches the start of the string. In multiline mode, it also matches the line break.

>>> M = Re. Search ("^ A \ W +", "abcdfa \ na1b2c3 ")

>>> M. Group ()

'Abcdefa'

>>> M = Re. Search ("^ A \ W +", "abcdfa \ na1b2c3", re. multiline ),

>>> M. Group ()#

'Abcdefa'

I think flag is set to re. multiline, according to the above section, he should also match the line break, so there should be M. group should have "a1b2c3", but the result does not exist. Use findall to try and find the result. So here I understand that the reason why the group does not exist is that the search and match Methods return results after matching, rather than matching all.

>>> M = Re. findall ("^ A \ W +", "abcdfa \ na1b2c3", re. multiline)

>>> M

['Abcdef', 'a1b2c3']

3. metacharacters (\) and metacharacters backslash. As a string character in Python, different characters can be added after the backslash to indicate different special meanings.

It can also be used to cancel all metacharacters so that you can match them in the mode. For example, if you need to match the character "[" or "\", you can use a backslash to cancel their special meaning before them: \ [or \\

4. Metacharacters ($) match the end of a string or before the line break at the end of a string. (In multiline mode, "$" also matches the line feed)

The regular expression "foo" matches both "foo" and "foobar", while "foo $" only matches "foo ".

>>> Re. findall ("foo. $", "foo1 \ nfoo2 \ n") # match the linefeed at the end of the string.
['Foo2']

>>> Re. findall ("foo. $", "foo1 \ nfoo2 \ n", re. multiline)
['Foo1', 'foo2']

>>> M = Re. Search ("foo. $", "foo1 \ nfoo2 \ n ")
>>> M
<_ SRE. sre_match object at 0x00a27170>
>>> M. Group ()
'Foo2'
>>> M = Re. Search ("foo. $", "foo1 \ nfoo2 \ n", re. multiline)
>>> M. Group ()
'Foo1'

It seems that the effect of RE. multiline on $ is quite large.

5. metacharacters (*), matching 0 or more

6. metacharacters (?), Match one or zero

7. metacharacters (+), matching one or more
8. metacharacters (|), indicating "or", for example, a | B. where A and B are regular expressions, they indicate matching A or B.

9. metacharacters ({})

{M}, used to represent the M copy of the regular expression. For example, "A {5}" indicates matching five "A", that is, "AAAAA"

>>> Re. findall ("A {5}", "aaaaaaaaaa ")
['Aaaaa', 'aaaaa']
>>> Re. findall ("A {5}", "aaaaaaaaa ")
['Aaaaa']

{M. n} indicates m to N copies of the regular expression, and tries to match as many copies as possible.

>>> Re. findall ("A {2, 4}", "aaaaaaaa ")
['Aaa', 'aaa']
In the above example, we can see that the regular expression {m, n} matches n first, not M, because the result is not ["AA ", "AA"]

>>> Re. findall ("A {2}", "aaaaaaaa ")
['A', 'a']

{M, n }? Used to represent m to N copies of the regular expression, and try to match as few copies as possible.

>>> Re. findall ("A {2, 4 }? "," Aaaaaaaa ")
['A', 'a']

10. Metacharacters ("()") are used to indicate the start and end of a group.

Commonly used include (RES ),(? P <Name> res), which is a group without a name and a group with a name. You can use matchobject. Group (name)

Obtain the matched group. A group without a name can obtain the matched group by the group sequence number starting from 1, such as matchobject. Group (1 ). The specific application will be explained in the group () method below

11. metacharacters (.)

The metacharacter "." matches all characters except line breaks in the default mode. In dotall mode, match all characters, including line breaks.

>>> Import re

>>> Re. Match (".", "\ n ")

>>> M = Re. Match (".", "\ n ")

>>> Print m

None

>>> M = Re. Match (".", "\ n", re. dotall)

>>> Print m

<_ SRE. sre_match object at 0x00c2ce20>

>>> M. Group ()

'\ N'

Next, let's take a look at the methods that match object objects have. Below is a brief introduction of several common methods.

1. Group ([group1,…])

Returns one or more matched child groups. If it is a parameter, the result is a string. If it is multiple parameters, the result is the tuples of one parameter and one item. The default value of group1 is 0 (all matching values will be returned). If the groupn parameter is 0, the corresponding return value is all matched strings. If the value of group1 is [1... If the value is within the range of 99, the string corresponding to the bracket group will be matched. If the group number is negative or greater than the group number defined in pattern, an indexerror exception is thrown. If pattern does not match but group does, the value of group is none. If one pattern can match multiple, the group corresponds to the last pattern matching. In addition, sub-groups are differentiated from left to right according to parentheses.

>>> M = Re. Match ("(\ W +)", "ABCD efgh, chaj ")

>>> M. Group () # match all

'Abcd efgh'

>>> M. Group (1) # Child Group of the first bracket.

'Abcd'

>>> M. group (2)

'Efgh'

>>> M. Group () # multiple parameters return one tuples

('Abcd', 'efgh ')

>>> M = Re. Match ("(? P <first_name> \ W + )(? P <last_name> \ W +) "," Sam Lee ")
>>> M. Group ("first_name") # Use group to obtain sub-groups with names
'Sam'
>>> M. Group ("last_name ")
'Lil'

Remove the brackets below

>>> M = Re. Match ("\ W +", "ABCD efgh, chaj ")

>>> M. Group ()

'Abcd efgh'

>>> M. Group (1)

Traceback (most recent call last ):

File "<pyshell #32>", line 1, in <module>

M. Group (1)

Indexerror: no such group

If a group matches multiple times, only the last match is accessible:

If a group matches more than one, only the last matched one is returned.

>>> M = Re. Match (R "(...) +", "a1b2c3 ")

>>> M. Group (1)

'C3'

>>> M. Group ()

'A1b2c3'

The default value of group is 0, and the string matched by regular expression pattern is returned.

>>> S = "afkak1aafal12345adadsfa"

>>> Pattern = r "(\ D) \ W + (\ D {2}) \ W"

>>> M = Re. Match (pattern, S)

>>> Print m

None

>>> M = Re. Search (pattern, S)

>>> M

<_ SRE. sre_match object at 0x00c2fda0>

>>> M. Group ()

'1afal12345a'

>>> M. Group (1)

'1'

>>> M. group (2)

'45'

>>> M. group (1, 2, 0)

('1', '45', '1aafal12345a ')

2. Groups([Default])

Returns a tuple containing all child groups. Default is used to set the default value that does not match the group. Default is "NONE" by default ",

>>> M = Re. Match ("(\ D +) \. (\ D +)", "23.123 ")

>>> M. Groups ()

('23', '123 ')

>>> M = Re. Match ("(\ D + )\.? (\ D + )? "," 24 ") # The second \ D does not match. Use the default value" NONE"

>>> M. Groups ()

('24', none)

>>> M. Groups ("0 ")

('24', '0 ')

3. groupdict([Default])

Returns all matchedName Sub-group. Key is the name value, and value is the matched value. The default parameter is the default value of a child group that does not match. The parameters are the same as those of the groups () method. The default value is none.

>>> M = Re. Match ("(\ W +)", "Hello World ")

>>> M. groupdict ()

{}

>>> M = Re. Match ("(? P <first> \ W + )(? P <secode> \ W +) "," Hello World ")

>>> M. groupdict ()

{'Secode': 'World', 'first': 'hello '}

As shown in the preceding example, groupdict () does not work for sub-groups without names.

Regular Expression object

Re. Search (String[,Pos[,Endpos])

Scan the string to find the location that matches the regular expression.If a match is found, a matchobject object is returned (not all objects are matched). If none is found, none is returned.

The second parameter indicates the position starting from the string. The default value is 0.

The third parameter, endpos, specifies the maximum value of the string to be searched. The default value is the length of the string ..

>>> M = Re. Search ("ABCD", '1abcd2abcd ')
>>> M. Group () # returns a match object after finding the object, and finds the matching result based on the method of the object.
'Abcd'
>>> M. Start ()
1
>>> M. End ()
5

>>> Re. findall ("ABCD", "1abcd2abcd ")
['Abcd', 'abcd']

Re.Split(Pattern,String[,Maxsplit = 0,Flags = 0])

Use Pattern to split the string. If pattern contains parentheses, all groups in pattern will also be returned.

>>> Re. Split ("\ W +", "Words, words, works", 1)

['Word', 'words, works']

>>> Re. Split ("[A-Z]", "0a3b9z", re. ignorecase)

['0a3 ', '9', '']

>>> Re. Split ("[A-Z] +", "0a3b9z", re. ignorecase)

['0a3 ', '9', '']

>>> Re. Split ("[A-Za-Z] +", "0a3b9z ")

['0', '3', '9', '']

>>> Re. Split ('[A-F] +', '0a3b9', re. ignorecase) # Re. ignorecase is used to ignore the case sensitivity in pattern.

['0', '3b9']

If a group is captured during split and the string starts to match, the returned result starts with an empty string.

>>> Re. Split ('(\ W +)', '... Words, words ...')

['', '...', 'Word', ', 'word','... ','']

>>> Re. Split ('(\ W +)', 'words, words ...')

['Word', ',', 'word', '...', '']

Re.Findall(Pattern,String[,Flags])

Returns all strings that do not overlap with pattern in the string in the form of list. The string is scanned from left to right, and the matching results are returned in this order.

Return all non-overlapping matchesPatternInString, As a list of strings.StringIs scanned left-to-right, and matches are returned in the order found. if one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. empty matches are encoded in the result unless they touch the beginning of another match.

>>> Re. findall ('(\ W +)', 'words, words ...')

[',', '...']

>>> Re. findall ('(\ W +) D', 'words, words... D ')

['...']

>>> Re. findall ('(\ W +) D','... Dwords, words... D ')

['...', '...']

Re.Finditer(Pattern,String[,Flags])

Similar to findall, it only returns the list, but returns a stack generator.

Let's look at an example of sub and subn.

>>> Re. sub ("\ D", "abc1def2hijk", "re ")

'Re'

>>> X = Re. sub ("\ D", "abc1def2hijk", "re ")

>>> X

'Re'

>>> Re. sub ("\ D", "re", "abc1def2hijk ",)

'Abcredefrehijk'

>>> Re. subn ("\ D", "re", "abc1def2hijk ",)

('Abcredefrehijk ', 2)

Through the example, we can see the difference between sub and subn: Sub returns the replaced string, while subn returns the tuples consisting of the replaced string and the number of replicas.

Re.Sub(Pattern,Repl,String[,Count,Flags])

Replace pattern in string with REPL. If pattern does not match, the returned string is not changed.] Repl can be a string or a function. If it is a string, if repl is a method/function. All pattern matches. He calls back this method/function. This function and method use a single match object as the parameter, and then return the replaced string. The following is an example provided on the official website:

>>> Def dashrepl (matchobj ):

 ... If matchobj. group (0) = '-': Return''

 ... Else: Return '-'

 >>> Re. sub ('-{1, 2}', dashrepl, 'Pro ---- Gram-files ')

 'Pro -- gram files'

 >>> Re. sub (R' \ sand \ s', '&', 'baked beans and spam', flags = Re. ignorecase)

 'Baked beans & spam'

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More