Regular expressions in Python (re module)

Source: Internet
Author: User
Tags locale
First, Introduction

The regular expression itself is a small, highly specialized programming language, whereas in Python the Cheng can be called directly to implement a regular match by embedding the RE module inline. The regular expression pattern is compiled into a series of bytecode, which is then executed by a matching engine written in C. second, the common character meaning in regular expressions

1, ordinary characters and 11 meta characters:

Ordinary characters Match itself Abc Abc
. Matches any character except the newline character "\ n" (also matches a newline character in Dotall mode) A.c Abc
\ Escape character so that the latter character changes the original meaning A\.c;a\\c A.c;a\c
* Matches a previous character 0 or more times abc* Ab;abccc
+ Matches the previous character 1 or infinitely times abc+ Abc;abccc
? Match one character 0 or 1 times Abc? Ab;abc
^ Matches the beginning of a string. Match the beginning of each line in multiline mode ^abc Abc
$ Matches the end of a string, matching the end of each row in a multiline pattern abc$ Abc
| Or. Match | Any one of the left and right expressions, matching left-to-right, if | not included in (), then its scope is the entire regular expression Abc|def ABC def
{} {m} matches the previous character m times, {m,n} matches the previous character M to n times, and if N is omitted, matches m to infinity Ab{1,2}c ABC ABBC
[] Character. The corresponding position can be any character in the character set. Characters in the character set can be listed individually, or they can be given a range, such as [ABC] or [A-c]. [^ABC] Represents the reverse, that is, non-ABC.
All special characters lose their original special meaning in the character set. Escape the special meaning of restoring special characters with the \ backslash.
A[bcd]e Abe ace Ade
() The enclosed expression will be grouped, starting at the left of the expression without encountering a grouped opening parenthesis "(", Number +1.
Group expressions, as a whole, can be followed by a number of words. The | In expression is only valid in that group.
(ABC) {2}
A (123|456) c
ABCABC a456c

Here need to emphasize the role of the backslash: back to the backslash with the metacharacters to remove special features, (the special character escape into ordinary characters) after the backslash with ordinary characters to implement special features, (that is, predefined characters) to refer to the number of the corresponding word group matching the string.

A=re.search (R ' (Tina) (FEI) haha\2 ', ' Tinafeihahafei Tinafeihahatina '). Group ()
print (a)
results:
Tinafeihahafei

2, predefined character set (can be written in the character set [...] IN)

\d Number: [0-9] A\bc A1c
\d Non-digit: [^\d] A\dc Abc
\s Match any white space character:[< space >\t\r\n\f\v] A\sc A C
\s Non-whitespace characters: [^\s] A\sc Abc
\w Matches any character that includes an underscore: [a-za-z0-9_] A\wc Abc
\w Matches non-alphabetic characters, that is, matching special characters A\wc A C
\a Matches only the beginning of the string, the same ^ \aabc Abc
\z Matches only the end of the string, same $ Abc\z Abc
\b Match between \w and \w, that is, matching word boundaries to match a word boundary, which refers to the position between words and spaces. For example, ' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '. \babc\b
A\b!bc
Space ABC Space
A!bc
\b [^\b] A\bbc Abc
Here we need to emphasize the understanding of \b's word boundaries:
W = re.findall (' \btina ', ' Tian Tinaaaa ') print (w) s = Re.findall (R ' \btina ', ' Tian Tinaaaa ') print (s) V = Re.findall (R ' \btina ', ' tian#tinaaaa ') print (v) a = Re.findall (R ' \btina\b ', ' tian#tina@aaa ') print (a) The results of the implementation are as follows: [] [' Tina '] [' Tina ' ] [' Tina ']

3. Special Grouping Usage:

(? p<name>) Group, specify an additional alias in addition to the original number (? P<ID>ABC) {2} Abcabc
(? P=name) Reference alias to the <name> group match to the string (? p<id>\d) ABC (? P=id) 1ABC1 5abc5
\<number> The reference number for the <number> group matches to the string (\d) abc\1 1ABC1 5abc5
Common function function in the RE module

1, compile ()

Compiles a regular expression pattern that returns the pattern of an object. (You can compile common regular expressions into regular expression objects, which can be a little more efficient.) )

Format:

Re.compile (pattern,flags=0)

Pattern: An expression string used at compile time.

Flags compile flags that modify the way regular expressions are matched, such as case sensitivity, multiline matching, and so on. The flags that are commonly used are:

Sign Meaning
Re. S (Dotall) make. Match all characters, including line wraps
Re. I (IGNORECASE) Make matching not sensitive to case
Re. L (LOCALE) Do localized identification (locale-aware) matching, French, etc.
Re. M (MULTILINE) Multiple lines matching, affecting ^ and $
Re. X (VERBOSE) This flag is easier to read by giving a more flexible format to the regular expression
Re. U Resolves characters based on the Unicode character set, which affects \w,\w,\b,\b

Import re
tt = "Tina is a good girl, she's cool, clever, and"
rr = Re.compile (R ' \w*oo\w* ')
print (RR). FindAll (TT))   #查找所有包含 ' oo ' word
execution results are as follows:
[' good ', ' cool ']

2, Match ()

Determines whether the re matches at a position where the string is just beginning. Note: This method does not match exactly. When pattern ends, string and remaining characters are still considered successful. To match exactly, you can add a boundary match ' $ ' to the end of the expression

Format:

Re.match (Pattern, string, flags=0)

Print (Re.match (' com ', ' Comwww.runcomoob '). Group ())
print (re.match (' com ', ' Comwww.runcomoob ', re.) I). Group ()
results are as follows:
com
com

3, search ()

Format:

Re.search (Pattern, string, flags=0)

The Re.search function finds pattern matches within a string, and returns none if the first match is found and then returned, if the string does not match.

Print (Re.search (' \dcom ', ' www.4comrunoob.5com '). Group ()) The
results are as follows:
4com

* Note: Match and search once the match is successful, it is a match object, and the match object has the following methods: Group () returns the string start () matched by the RE () returns the position end of the match () the SP An () returns a tuple that contains the matching (start, end) position group () returns the string that the re overall matches, can enter multiple group numbers at a time, corresponding to the string matching the group number.

A. Group () returns the string that the re whole matches.
B. Group (N,M) returns a string that matches the group number n,m, and returns a Indexerror exception if the group number does not exist
The C.groups () groups () method returns a tuple that contains all the group strings in the regular expression, from 1 to the contained group number, usually groups () does not require arguments, returns a tuple, and the tuples in the tuple are the groups defined in the regular expression.

Import re
a = "123abc456"
 print (Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (0))   #123abc456, return to the whole
 Print (Re.search (0-9]*) ([a-z]*) ([0-9]*), a). Group (1))   #123
 Print (Re.search ([0-9]*) ([a-z]*) ( 0-9]*) ", a). Group (2))   #abc
 Print (Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (3))   #456
# # #group (1) lists the first bracket matching part, Group (2) lists the second bracket matching part, and group (3) lists the third bracket matching part. ###

4, FindAll ()

Re.findall traversal match, you can get all the matching strings in the string, return a list.

Format:

Re.findall (Pattern, string, flags=0)

p = re.compile (R ' \d+ ')
print (P.findall (' O1n2m3k4 ')) The
results are as follows:
[' 1 ', ' 2 ', ' 3 ', ' 4 ']
Import re
tt = "Tina is a good girl, she's cool, clever, and"
rr = Re.compile (R ' \w*oo\w* ')
print ( Rr.findall (TT))
print (Re.findall (R ' (\w) *oo (\w) ', TT)) # () indicates that the subexpression 
executes as follows:
[' good ', ' cool ']
[(' G ', ' d '), (' C ', ' l ')]

5, Finditer ()

Searches for a string that returns an iterator that accesses each matching result (match object) sequentially. Find all the substrings that the RE matches and return them as an iterator.

Format:

Re.finditer (Pattern, string, flags=0)

ITER = Re.finditer (R ' \d+ ', ' drumm44ers drumming, 11 ... Ten ... ') for
i in ITER: print (
    i) print (
    i.group ())
    print (I.span ())
results are as follows:
<_sre. Sre_match object; span= (0, 2), match= ' >
(0, 2)
<_sre. Sre_match object; Span= (8), match= ' >
(8,)
<_sre. Sre_match object; span=, match= ' one ' > One
(
<_sre). Sre_match object; span=, match= ' a ' >
(31, 33)

6, Split ()

Returns a list after the string is split by a substring that can match.

You can use Re.split to split strings, such as: Re.split (R ' \s+ ', text), and split the string into a single word list.

Format:

Re.split (Pattern, string[, Maxsplit])

Maxsplit is used to specify the maximum number of partitions, without specifying that all will be split.

The results of print (Re.split ' \d+ ', ' one1two2three3four4five5 ')
are as follows:
[' One ', ' two ', ' three ', ' four ', ' five ', ']

7. Sub ()

Returns a replacement string after each matching substring in string is replaced with the re.

Format:

Re.sub (Pattern, REPL, string, count)

import re
text = "Jgood is a handsome boy, it is cool, clever, and"
print (Re.sub (R ' \s+ ', '-', text)
The results of the implementation are as follows:
jgood-is-a-handsome-boy,-he-is-cool,-clever,-and-so-on ...

The second function is the replacement string, in this case '-'

The fourth parameter refers to the number of replacements. The default is 0, which means that each match is replaced.

Re.sub also allows the use of functions to perform complex processing of the substitution of matches.

For example: Re.sub (R ' \s ', Lambda m: ' [' + m.group (0) + '] ', text, 0); "Replace the space in the string with ' [] '.

import re
text = "Jgood is a handsome boy, it is cool, clever, and"
print (Re.sub (R ' \s+ ', Lambda m: ' [' +M.G Roup (0) + '] ', text,0) the
results are as follows: jgood[]is[]a[]handsome[]boy,[]he[]is[-]cool,[]clever,[]and[]so[]on
...

8, subn ()

Return number of replacements

Format:

SUBN (Pattern, Repl, String, count=0, flags=0)

Print (Re.subn (' [1-2] ', ' A ', ' 123456abcdef ')) print (
re.sub ("g.t", "have", ' I get A,  I got B, I gut C '))
print ( RE.SUBN ("g.t", "have", ' I get A,  I got B, I gut C ')
perform the following results:
(' Aa3456abcdef ', 2)
I have a,  I have B, I h Ave C
(' I have A,  I have B, I have C ', 3)
four or one more notes.

1, the difference between Re.match and Re.search and Re.findall:

Re.match matches only the beginning of a string, if the string does not start with a regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.

A=re.search (' [\d] ', "Abc33"). Group ()
print (a)
p=re.match (' [\d] ', ' abc33 ')
print (p)
b= Re.findall (' [\d] ', "abc33")
print (b)
results:
3
None
[' 3 ', ' 3 ']

2. Greedy match and non-greedy match

*?,+?,??, {m,n}? The front of the *,+,? etc are greedy match, that is, match as far as possible, followed by the number to make it into a lazy match

A = Re.findall (r "A (\d+?)", ' a23b ')
print (a)
B = Re.findall (R "A (\d+)", ' a23b ')
print (b)
results:
[' 2 ']
[' 23 ']
A = Re.match (' < (. *) > ', ' <H1>title<H1> '). Group ()
print (a)
B = Re.match (' < (. *?) > ', ' <H1>title<H1> '). Group ()
print (b)
results:
<H1>title<H1>

A = Re.findall (r "A (\d+) b", ' a3333b ')
print (a)
B = Re.findall (R "a" (\d+?) B ", ' a3333b ')
print (b) The results of the
implementation are as follows:
[' 3333 ']
[' 3333 ']
#######################
It should be noted here that if there is a limited condition before and after, there is no greedy mode, mismatched mode failure.

3, with flags encountered in the small pits

Print (Re.split (' A ', ' 1a1a2a3 '), re. I) #输出结果并未能区分大小写
This is because Re.split (pattern,string,maxsplit,flags) defaults to four parameters, and when we pass in three parameters, the system defaults to the RE. I was the third parameter, so it didn't work. If you want to get here the re. I work, write Flags=re. I can.
Five, regular small practice

1. Matching telephone number

p = re.compile (R ' \d{3}-\d{6} ')
print (P.findall (' 010-628888 '))

2, matching IP

Re.search (R) (([01]?\d?\d|2[0-4]\d|25[0-5]) \.) {3} ([01]?\d?\d|2[0-4]\d|25[0-5]\.) "," 192.168.1.1 ")


Reproduced from: http://www.cnblogs.com/tina-python/p/5508402.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.