Python detailed re module

Last Update:2015-04-13 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The meta-character of the regular expression has. ^ $ * ? { [ ] | ( ) ． Indicates that any character [] is used to match a specified character category, which is a character set that you want to match, a relationship that can be understood as a character in a character set. ^ If placed at the beginning of the string, it means to take the non-meaning. [^5] represents characters other than 5. And if ^ is not at the beginning of the string, it represents itself.
Metacharacters with repeating function: * Repeat 0 to infinity for the previous character 1 to infinity for the previous character? Repeat 0 to 1 times for the previous character {M,n} for the previous character repeats in M to n times, where {0,} = *,{1,} =, {0,1} =? {m} repeats m times for previous character
\d matches any decimal number; it is equivalent to class [0-9]. \d matches any non-numeric character; it is equivalent to class [^0-9]. \s matches any whitespace character; it is equivalent to class [FV]. \s matches any non-whitespace character; it is equivalent to the class [^ FV]. \w matches any alphanumeric character; it is equivalent to class [a-za-z0-9_]. \w matches any non-alphanumeric character; it is equivalent to class [^a-za-z0-9_].

Regular expressions (which can be called Res,regex,regex pattens) are a small, highly specialized programming language that is embedded in the Python development language and can be used through the RE module. of the regular expression

Pattern can be compiled into a series of bytecode and then executed by the engine written in C. The following is a brief introduction to the syntax of the regular expression

The regular expression contains a list of metacharacters (metacharacter), and the list values are as follows: . ^ $ * +? { [ ] \ | ( )

1. Meta-character ([]), which is used to specify a character class. The so-called character classes is the set of characters (character) that you want to match. Characters (character) can be listed individually or by "-" to delimit two characters to represent a range. For example, [ABC] matches any of the characters in A, B, or C, and [ABC] can also be represented by a range of characters---[a-c]. If you want to match a single capital letter, you can use [A-z].

metacharacters (metacharacters) does not work in character class, such as [akm$] will match any character in "a", "K", "M", "$". Here the Meta character (metacharacter) "$" is an ordinary character.

2. Meta-characters [^]. You can use a complement to match characters that are not in the interval range. The practice is to put "^" as the first character of the category; "^" in other places simply matches the "^" character itself. For example, [^5] will match any character except "5". At the same time, the meta-character ^ represents the beginning of a matching string, such as "^ab+", which represents a string beginning with AB.

Example Verification,

>>> m=re.search ("^ab+", "ASDFABBBB")
>>> Print M
None
>>> m=re.search ("ab+", "ASDFABBBB")
>>> Print M
<_sre. Sre_match Object at 0x011b1988>
>>> Print M.group ()
abbbb

The previous example cannot be re.match because the match matches the beginning of the string, and we cannot verify that the metacharacters "^" represent the starting position of the string.

    >>> m=re.match ("^ab+", "asdfabbbb")
    >>> print m
    none
     >>> M=re.match ("ab+", "asdfabbbb")
    > >> print m
    none

#验证在元字符 [], the meaning that "^" represents in different positions.
>>> Re.search ("[^abc]", "ABCD") # "^" in the first character denotes an inverse, that is, any character other than ABC.
<_sre. Sre_match Object at 0x011b19f8>
>>> M=re.search ("[^abc]", "ABCD")
>>> M.group ()
' d '
>>> M=re.search ("[abc^]", "^") #如果 "^" is not the first character in [], then it is a normal character
>>> M.group ()
^

However, there is a question about the meta-character "^". Official document http://docs.python.org/library/re.html about meta-characters "^" There is such a word, Matches the start

Of the string, and in MULTILINE mode also matches immediately after each newline.

I understand that the "^" matches the beginning of the string, in multiline mode, and also after the line break.

>>> m=re.search ("^a\w+", "ABCDFA\NA1B2C3")

>>> M.group ()

' Abcdfa '

>>> m=re.search ("^a\w+", "abcdfa\na1b2c3", re. MULTILINE),

>>> M.group () #

' Abcdfa '

I think flag is set to re. MULTILINE, according to the above paragraph, he should also match the newline character, so there should be m.group should have "a1b2c3", but the result is not, use FindAll to try, you can find the results. So here I understand that there is no group inside because the search and match methods are matched to return, not to match all.

>>> M=re.findall ("^a\w+", "abcdfa\na1b2c3", re. MULTILINE)

>>> m

[' Abcdfa ', ' a1b2c3 ']

3. Meta-character (\), meta-character backslash. As a string letter in Python, a backslash can be followed by a different character utilises to represent different special meanings.

It can also be used to cancel all meta characters so that you can match them in the pattern. For example, if you need to match the characters "[" or "\", you can use backslashes before them to cancel their special meaning: \[or \ \

4. Metacharacters ($) matches the end of the string or the end of the string before the line break. (in multiline mode, "$" also matches before line break)

The regular expression "foo" matches both "foo" and "Foobar", while "foo$" matches "foo" only.

>>> re.findall ("foo.$", "foo1\nfoo2\n") #匹配字符串的结尾的换行符之前.
[' Foo2 ']

>>> re.findall ("foo.$", "foo1\nfoo2\n", re. MULTILINE)
[' foo1 ', ' Foo2 ']

>>> m=re.search ("foo.$", "foo1\nfoo2\n")
>>> m
<_sre. Sre_match Object at 0x00a27170>
>>> M.group ()
' Foo2 '
>>> m=re.search ("foo.$", "foo1\nfoo2\n", re. MULTILINE)
>>> M.group ()
' Foo1 '

looks like re. The effect of multiline on $ is still pretty big.

5. Metacharacters (*), matching 0 or more

6. Meta-character (?), matching one or 0

7. Meta-character (+), matching one or more
8, meta-character (|), indicating "or", such as a| b, where A-B is a regular expression that matches a

9. Meta-characters ({})

{m}, which represents the M-copy of the preceding regular expression, such as "a{5}", which indicates a match of 5 "a", or "AAAAA"

>>> re.findall ("a{5}", "AAAAAAAAAA")
[' aaaaa ', ' AAAAA ']
>>> re.findall ("a{5}", "AAAAAAAAA")
[' AAAAA ']

{M.N} is used to represent the M-n copy of the preceding regular expression, trying to match as many copy as possible.

>>> re.findall ("a{2,4}", "AAAAAAAA")
[' AAAA ', ' AAAA ']
From the above example, you can see {m,n}, the regular expression first matches n, not m, because the result is not ["AA", "AA", "AA", "AA"]

>>> re.findall ("a{2}", "AAAAAAAA")
[' AA ', ' AA ', ' AA ', ' AA ']

{m,n}? The m to n copy that represents the preceding regular expression, trying to match as few copy as possible

>>> re.findall ("a{2,4}?", "AAAAAAAA")
[' AA ', ' AA ', ' AA ', ' AA ']

10. Metacharacters ("()"), which is used to denote the start and end of a group.

More commonly used (REs), (? P<name>res), which is a group with no name and has a name for the groups, can be used by Matchobject.group (name)

Gets the matching group, and the non-named group can get the matching groups, such as Matchobject.group (1), by the group ordinal number starting at 1. Specific applications will be explained in the following group () method

11. Metacharacters (.)

Meta-character "." In the default mode, all characters except the newline character are matched. In Dotall mode, all characters are matched, including line breaks.

>>> Import re

>>> Re.match (".", "\ n")

>>> M=re.match (".", "\ n")

>>> Print M

None

>>> M=re.match (".", "\ n", re. Dotall)

>>> Print M

<_sre. Sre_match Object at 0x00c2ce20>

>>> M.group ()

' \ n '

Let's take a look at the methods that the match object object has, and here's a brief introduction to several common methods

1.group ([group1,...])

Returns one or more subgroups that are matched to. If it is a parameter, then the result is a string, if it is more than one parameter, then the result is a tuple of an item. The default value for Group1 is 0 (all matching values will be returned). If the GROUPN parameter is 0, the corresponding return value is the string that matches all, if the value of group1 is [1 ... 99], the string corresponding to the bracket group is matched. If the group number is negative or larger than the group number defined in pattern, the Indexerror exception is thrown. If the pattern does not match, but the group matches, then the group value is none. If one pattern can match more than one, then the group corresponds to the last of the style matches. In addition, subgroups are differentiated from left to right according to parentheses.

>>> M=re.match ("(\w+) (\w+)", "ABCD Efgh, Chaj")

>>> M.group () # match all

' ABCD Efgh '

>>> M.group (1) # The child group of the first parenthesis.

' ABCD '

>>> M.group (2)

' Efgh '

>>> m.group # Multiple parameters return a tuple

(' ABCD ', ' efgh ')

>>> M=re.match (? p<first_name>\w+) (? p<last_name>\w+) "," Sam Lee ")
>>> M.group ("first_name") #使用group获取含有name的子组
' Sam '
>>> M.group ("last_name")
' Lee '

The parentheses are removed below

>>> m=re.match ("\w+ \w+", "ABCD Efgh, Chaj")

>>> M.group ()

' ABCD Efgh '

>>> M.group (1)

Traceback (most recent):

File "<pyshell#32>", line 1, in <module>

M.group (1)

Indexerror:no such group

If a group matches multiple times, only the last match is accessible:

if a group matches more than one, then only the last match is returned.

>>> M=re.match (r "(..) + "," a1b2c3 ")

>>> M.group (1)

' C3 '

>>> M.group ()

' A1B2C3 '

The default value for group is 0, which returns the string to which the regular expression pattern matches

>>> s= "AFKAK1AAFAL12345ADADSFA"

>>> pattern=r "(\d) \w+ (\d{2}) \w"

>>> M=re.match (pattern,s)

>>> Print M

None

>>> M=re.search (pattern,s)

>>> m

<_sre. Sre_match Object at 0x00c2fda0>

>>> M.group ()

' 1aafal12345a '

>>> M.group (1)

' 1 '

>>> M.group (2)

' 45 '

>>> M.group (1,2,0)

(' 1 ', ' a ', ' 1aafal12345a ')

2. Groups([default])

Returns a tuple that contains all child groups. Default is used to set defaults that do not match to the group. Default is "None",

>>> M=re.match ("(\d+) \. ( \d+) "," 23.123 ")

>>> m.groups ()

(' 23 ', ' 123 ')

>>> M=re.match ("(\d+) \.? (\d+)? "," #这里的第二个 \d no match, using the default value "None"

>>> m.groups ()

(' + ', None)

>>> m.groups ("0")

(' 24 ', ' 0 ')

3.groupdict ([default])

Returns a dictionary of all named subgroups that match. Key is the name value, and value is the match to. The default value of the parameter is a child group that does not match. Here is the same argument as the groups () method. The default value is None

>>> M=re.match ("(\w+) (\w+)", "Hello World")

>>> m.groupdict ()

{}

>>> M=re.match (? p<first>\w+) (? p<secode>\w+) "," Hello World ")

>>> m.groupdict ()

{' Secode ': ' World ', ' first ': ' Hello '}

As you can see from the example above, groupdict () does not work with subgroups that do not have a name

Regular Expression Object

Re.search (string[, pos[, Endpos]])

Scans a string of strings to find where to match the regular expression. Returns a Matchobject object if a match is found (and does not match all). If not found then return none.

The second parameter means starting at the location of the string, which is 0 by default.

The third parameter, Endpos, defines where the string is located farthest away. The default value is the length of the string:

>>> M=re.search ("ABCD", ' 1ABCD2ABCD ')
>>> m.group () #找到即返回一个match object, and then finds the matching result based on the method of the object.
' ABCD '
>>> M.start ()
1
>>> M.end ()
5

>>> Re.findall ("ABCD", "1ABCD2ABCD")
[' ABCD ', ' ABCD ']

Re. Split (pattern, string[, maxsplit=0, flags=0])

Use pattern to split a string. If pattern has parentheses, all the groups in the pattern will also be returned.

>>> re.split ("\w+", "Words,words,works", 1)

[' Words ', ' words,works ']

>>> Re.split ("[A-Z]", "0a3b9z", re. IGNORECASE)

[' 0a3 ', ' 9 ', ']

>>> re.split ("[a-z]+", "0a3b9z", re. IGNORECASE)

[' 0a3 ', ' 9 ', ']

>>> re.split ("[a-za-z]+", "0a3b9z")

[' 0 ', ' 3 ', ' 9 ', ']

>>> re.split (' [a-f]+ ', ' 0a3b9 ', re. IGNORECASE) #re. The ignorecase is used to ignore the case in pattern.

[' 0 ', ' 3b9 ']

If you capture a group at split and match the beginning of the string, the returned result will begin with an empty string.

>>> re.split (' (\w+) ', ' ... words, words ... ')

[', ' ... ', ' words ', ', ', ' words ', ' ... ', ']

>>> re.split (' (\w+) ', ' words, words ... ')

[' Words ', ', ', ' words ', ' ... ', ']

Re. FindAll (pattern, string[, flags])

Returns all non-overlapping strings in a string that match pattern in the form of a list. The string is scanned from left to right, and the returned results of the match are in this order.

Return all non-overlapping matches of the pattern in string, as a list of strings. The string is scanned left-to-right, and matches was returned in the order found. If One or more groups is present in the pattern, return a list of groups; This would be a list of tuples if the pattern have more than one group. Empty matches is included in the result unless they touch the beginning of another match.

>>> Re.findall (' (\w+) ', ' words, words ... ')

[‘, ‘, ‘...‘]

>>> Re.findall (' (\w+) d ', ' words, words...d ')

[‘...‘]

>>> Re.findall (' (\w+) d ', ' ... dwords, words...d ')

[‘...‘, ‘...‘]

Re. Finditer (pattern, string[, flags])

Similar to FindAll, except that it returns a list, but returns an iterator

Let's take a look at a sub and Subn example.

>>> re.sub ("\d", "Abc1def2hijk", "Re")

' RE '

>>> x=re.sub ("\d", "Abc1def2hijk", "Re")

>>> x

' RE '

>>> re.sub ("\d", "Re", "Abc1def2hijk",)

' Abcredefrehijk '

>>> re.subn ("\d", "Re", "Abc1def2hijk",)

(' Abcredefrehijk ', 2)

By example we can see the difference between sub and SUBN: The sub Returns the substituted string, and Subn returns the tuple consisting of the replaced string and the number of replacements.

Re. Sub (pattern, repl, string[, Count, flags])

replaces the pattern in a string with REPL. If pattern does not match, then the returned string does not change]. The REPL can be a string, or it can be a function. If it is a string, if Repl is a method/function. For all the pattern matches to. He has all the callbacks to use this method/function. This function and method uses a single Match object as the argument, and then returns the replaced string. Here is an example from the official website:

>>> def dashrepl (matchobj):

... if matchobj.group (0) = = '-': return '

... else:retu

Python detailed re module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More