First, Introduction
The regular expression itself is a small, highly specialized programming language, whereas in Python the Cheng can be called directly to implement a regular match by embedding the RE module inline. The regular expression pattern is compiled into a series of bytecode, which is then executed by a matching engine written in C. second, the common character meaning in regular expressions
1, ordinary characters and 11 meta characters:
Ordinary characters |
Match itself |
Abc |
Abc |
. |
Matches any character except the newline character "\ n" (also matches a newline character in Dotall mode) |
A.c |
Abc |
\ |
Escape character so that the latter character changes the original meaning |
A\.c;a\\c |
A.c;a\c |
* |
Matches a previous character 0 or more times |
abc* |
Ab;abccc |
+ |
Matches the previous character 1 or infinitely times |
abc+ |
Abc;abccc |
? |
Match one character 0 or 1 times |
Abc? |
Ab;abc |
^ |
Matches the beginning of a string. Match the beginning of each line in multiline mode |
^abc |
Abc |
$ |
Matches the end of a string, matching the end of each row in a multiline pattern |
abc$ |
Abc |
| |
Or. Match | Any one of the left and right expressions, matching left-to-right, if | not included in (), then its scope is the entire regular expression |
Abc|def |
ABC def |
{} |
{m} matches the previous character m times, {m,n} matches the previous character M to n times, and if N is omitted, matches m to infinity |
Ab{1,2}c |
ABC ABBC |
[] |
Character. The corresponding position can be any character in the character set. Characters in the character set can be listed individually, or they can be given a range, such as [ABC] or [A-c]. [^ABC] Represents the reverse, that is, non-ABC. All special characters lose their original special meaning in the character set. Escape the special meaning of restoring special characters with the \ backslash. |
A[bcd]e |
Abe ace Ade |
() |
The enclosed expression will be grouped, starting at the left of the expression without encountering a grouped opening parenthesis "(", Number +1. Group expressions, as a whole, can be followed by a number of words. The | In expression is only valid in that group. |
(ABC) {2} A (123|456) c |
ABCABC a456c |
Here need to emphasize the role of the backslash: back to the backslash with the metacharacters to remove special features, (the special character escape into ordinary characters) after the backslash with ordinary characters to implement special features, (that is, predefined characters) to refer to the number of the corresponding word group matching the string.
A=re.search (R ' (Tina) (FEI) haha\2 ', ' Tinafeihahafei Tinafeihahatina '). Group ()
print (a)
results:
Tinafeihahafei
2, predefined character set (can be written in the character set [...] IN)
\d |
Number: [0-9] |
A\bc |
A1c |
\d |
Non-digit: [^\d] |
A\dc |
Abc |
\s |
Match any white space character:[< space >\t\r\n\f\v] |
A\sc |
A C |
\s |
Non-whitespace characters: [^\s] |
A\sc |
Abc |
\w |
Matches any character that includes an underscore: [a-za-z0-9_] |
A\wc |
Abc |
\w |
Matches non-alphabetic characters, that is, matching special characters |
A\wc |
A C |
\a |
Matches only the beginning of the string, the same ^ |
\aabc |
Abc |
\z |
Matches only the end of the string, same $ |
Abc\z |
Abc |
\b |
Match between \w and \w, that is, matching word boundaries to match a word boundary, which refers to the position between words and spaces. For example, ' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '. |
\babc\b A\b!bc |
Space ABC Space A!bc |
\b |
[^\b] |
A\bbc |
Abc |
Here we need to emphasize the understanding of \b's word boundaries:
W = re.findall (' \btina ', ' Tian Tinaaaa ')
print (w)
s = Re.findall (R ' \btina ', ' Tian Tinaaaa ')
print (s)
V = Re.findall (R ' \btina ', ' tian#tinaaaa ')
print (v)
a = Re.findall (R ' \btina\b ', ' tian#tina@aaa ')
print (a) The results of the
implementation are as follows:
[] [' Tina '] [' Tina '
]
[' Tina ']
3. Special Grouping Usage:
(? p<name>) |
Group, specify an additional alias in addition to the original number |
(? P<ID>ABC) {2} |
Abcabc |
(? P=name) |
Reference alias to the <name> group match to the string |
(? p<id>\d) ABC (? P=id) |
1ABC1 5abc5 |
\<number> |
The reference number for the <number> group matches to the string |
(\d) abc\1 |
1ABC1 5abc5 |
Common function function in the RE module
1, compile ()
Compiles a regular expression pattern that returns the pattern of an object. (You can compile common regular expressions into regular expression objects, which can be a little more efficient.) )
Format:
Re.compile (pattern,flags=0)
Pattern: An expression string used at compile time.
Flags compile flags that modify the way regular expressions are matched, such as case sensitivity, multiline matching, and so on. The flags that are commonly used are:
Sign |
Meaning |
Re. S (Dotall) |
make. Match all characters, including line wraps |
Re. I (IGNORECASE) |
Make matching not sensitive to case |
Re. L (LOCALE) |
Do localized identification (locale-aware) matching, French, etc. |
Re. M (MULTILINE) |
Multiple lines matching, affecting ^ and $ |
Re. X (VERBOSE) |
This flag is easier to read by giving a more flexible format to the regular expression |
Re. U |
Resolves characters based on the Unicode character set, which affects \w,\w,\b,\b |
Import re
tt = "Tina is a good girl, she's cool, clever, and"
rr = Re.compile (R ' \w*oo\w* ')
print (RR). FindAll (TT)) #查找所有包含 ' oo ' word
execution results are as follows:
[' good ', ' cool ']
2, Match ()
Determines whether the re matches at a position where the string is just beginning. Note: This method does not match exactly. When pattern ends, string and remaining characters are still considered successful. To match exactly, you can add a boundary match ' $ ' to the end of the expression
Format:
Re.match (Pattern, string, flags=0)
Print (Re.match (' com ', ' Comwww.runcomoob '). Group ())
print (re.match (' com ', ' Comwww.runcomoob ', re.) I). Group ()
results are as follows:
com
com
3, search ()
Format:
Re.search (Pattern, string, flags=0)
The Re.search function finds pattern matches within a string, and returns none if the first match is found and then returned, if the string does not match.
Print (Re.search (' \dcom ', ' www.4comrunoob.5com '). Group ()) The
results are as follows:
4com
* Note: Match and search once the match is successful, it is a match object, and the match object has the following methods: Group () returns the string start () matched by the RE () returns the position end of the match () the SP An () returns a tuple that contains the matching (start, end) position group () returns the string that the re overall matches, can enter multiple group numbers at a time, corresponding to the string matching the group number.
A. Group () returns the string that the re whole matches.
B. Group (N,M) returns a string that matches the group number n,m, and returns a Indexerror exception if the group number does not exist
The C.groups () groups () method returns a tuple that contains all the group strings in the regular expression, from 1 to the contained group number, usually groups () does not require arguments, returns a tuple, and the tuples in the tuple are the groups defined in the regular expression.
Import re
a = "123abc456"
print (Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (0)) #123abc456, return to the whole
Print (Re.search (0-9]*) ([a-z]*) ([0-9]*), a). Group (1)) #123
Print (Re.search ([0-9]*) ([a-z]*) ( 0-9]*) ", a). Group (2)) #abc
Print (Re.search ([0-9]*) ([a-z]*) ([0-9]*), a). Group (3)) #456
# # #group (1) lists the first bracket matching part, Group (2) lists the second bracket matching part, and group (3) lists the third bracket matching part. ###
4, FindAll ()
Re.findall traversal match, you can get all the matching strings in the string, return a list.
Format:
Re.findall (Pattern, string, flags=0)
p = re.compile (R ' \d+ ')
print (P.findall (' O1n2m3k4 ')) The
results are as follows:
[' 1 ', ' 2 ', ' 3 ', ' 4 ']
Import re
tt = "Tina is a good girl, she's cool, clever, and"
rr = Re.compile (R ' \w*oo\w* ')
print ( Rr.findall (TT))
print (Re.findall (R ' (\w) *oo (\w) ', TT)) # () indicates that the subexpression
executes as follows:
[' good ', ' cool ']
[(' G ', ' d '), (' C ', ' l ')]
5, Finditer ()
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially. Find all the substrings that the RE matches and return them as an iterator.
Format:
Re.finditer (Pattern, string, flags=0)
ITER = Re.finditer (R ' \d+ ', ' drumm44ers drumming, 11 ... Ten ... ') for
i in ITER: print (
i) print (
i.group ())
print (I.span ())
results are as follows:
<_sre. Sre_match object; span= (0, 2), match= ' >
(0, 2)
<_sre. Sre_match object; Span= (8), match= ' >
(8,)
<_sre. Sre_match object; span=, match= ' one ' > One
(
<_sre). Sre_match object; span=, match= ' a ' >
(31, 33)
6, Split ()
Returns a list after the string is split by a substring that can match.
You can use Re.split to split strings, such as: Re.split (R ' \s+ ', text), and split the string into a single word list.
Format:
Re.split (Pattern, string[, Maxsplit])
Maxsplit is used to specify the maximum number of partitions, without specifying that all will be split.
The results of print (Re.split ' \d+ ', ' one1two2three3four4five5 ')
are as follows:
[' One ', ' two ', ' three ', ' four ', ' five ', ']
7. Sub ()
Returns a replacement string after each matching substring in string is replaced with the re.
Format:
Re.sub (Pattern, REPL, string, count)
import re
text = "Jgood is a handsome boy, it is cool, clever, and"
print (Re.sub (R ' \s+ ', '-', text)
The results of the implementation are as follows:
jgood-is-a-handsome-boy,-he-is-cool,-clever,-and-so-on ...
The second function is the replacement string, in this case '-'
The fourth parameter refers to the number of replacements. The default is 0, which means that each match is replaced.
Re.sub also allows the use of functions to perform complex processing of the substitution of matches.
For example: Re.sub (R ' \s ', Lambda m: ' [' + m.group (0) + '] ', text, 0); "Replace the space in the string with ' [] '.
import re
text = "Jgood is a handsome boy, it is cool, clever, and"
print (Re.sub (R ' \s+ ', Lambda m: ' [' +M.G Roup (0) + '] ', text,0) the
results are as follows: jgood[]is[]a[]handsome[]boy,[]he[]is[-]cool,[]clever,[]and[]so[]on
...
8, subn ()
Return number of replacements
Format:
SUBN (Pattern, Repl, String, count=0, flags=0)
Print (Re.subn (' [1-2] ', ' A ', ' 123456abcdef ')) print (
re.sub ("g.t", "have", ' I get A, I got B, I gut C '))
print ( RE.SUBN ("g.t", "have", ' I get A, I got B, I gut C ')
perform the following results:
(' Aa3456abcdef ', 2)
I have a, I have B, I h Ave C
(' I have A, I have B, I have C ', 3)
four or one more notes.
1, the difference between Re.match and Re.search and Re.findall:
Re.match matches only the beginning of a string, if the string does not start with a regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.
A=re.search (' [\d] ', "Abc33"). Group ()
print (a)
p=re.match (' [\d] ', ' abc33 ')
print (p)
b= Re.findall (' [\d] ', "abc33")
print (b)
results:
3
None
[' 3 ', ' 3 ']
2. Greedy match and non-greedy match
*?,+?,??, {m,n}? The front of the *,+,? etc are greedy match, that is, match as far as possible, followed by the number to make it into a lazy match
A = Re.findall (r "A (\d+?)", ' a23b ')
print (a)
B = Re.findall (R "A (\d+)", ' a23b ')
print (b)
results:
[' 2 ']
[' 23 ']
A = Re.match (' < (. *) > ', ' <H1>title<H1> '). Group ()
print (a)
B = Re.match (' < (. *?) > ', ' <H1>title<H1> '). Group ()
print (b)
results:
<H1>title<H1>
A = Re.findall (r "A (\d+) b", ' a3333b ')
print (a)
B = Re.findall (R "a" (\d+?) B ", ' a3333b ')
print (b) The results of the
implementation are as follows:
[' 3333 ']
[' 3333 ']
#######################
It should be noted here that if there is a limited condition before and after, there is no greedy mode, mismatched mode failure.
3, with flags encountered in the small pits
Print (Re.split (' A ', ' 1a1a2a3 '), re. I) #输出结果并未能区分大小写
This is because Re.split (pattern,string,maxsplit,flags) defaults to four parameters, and when we pass in three parameters, the system defaults to the RE. I was the third parameter, so it didn't work. If you want to get here the re. I work, write Flags=re. I can.
Five, regular small practice
1. Matching telephone number
p = re.compile (R ' \d{3}-\d{6} ')
print (P.findall (' 010-628888 '))
2, matching IP
Re.search (R) (([01]?\d?\d|2[0-4]\d|25[0-5]) \.) {3} ([01]?\d?\d|2[0-4]\d|25[0-5]\.) "," 192.168.1.1 ")
Reproduced from: http://www.cnblogs.com/tina-python/p/5508402.html