First, Introduction
The regular expression itself is a small, highly specialized programming language, and in Python, the Cheng can be called directly to implement regular matching by embedding the re module in the embedded form. The regular expression pattern is compiled into a sequence of bytecode, which is then executed by a matching engine written in C.
Second, regular expressions commonly used in the meaning of characters
1, ordinary characters and 11 metacharacters:
Normal characters |
Match itself |
Abc |
Abc |
. |
Matches any character other than "\ n" except for line breaks (line breaks can also be matched in Dotall mode) |
A.c |
Abc |
\ |
Escape character, so that the latter character changes the original meaning |
A\.c;a\\c |
A.c;a\c |
* |
Matches the previous character 0 or more times |
abc* |
Ab;abccc |
+ |
Match the previous character 1 or unlimited times |
abc+ |
Abc;abccc |
? |
Match one character 0 or 1 times |
Abc? |
Ab;abc |
^ |
Matches the beginning of the string. Match the beginning of each line in multiline mode |
^abc |
Abc |
$ |
Matches the end of the string, matching the end of each line in multiline mode |
abc$ |
Abc |
| |
Or. Match | Left and right expression any one, from a to-do match, if | is not included in (), its scope is the entire regular expression |
Abc|def |
ABCdef |
{} |
{m} matches the previous character m times, {m,n} matches the previous character M to n times, and if N is omitted, matches m to infinity |
Ab{1,2}c |
Abcabbc |
[] |
Character. The corresponding location can be any character in the character set. Characters in a character set can be listed individually, or they can be given a range, such as [ABC] or [A-c]. [^ABC] denotes inversion, that is, non-ABC. All special characters lose their original special meaning in the character set. The special meaning of recovering special characters is escaped with \ backslash. |
A[bcd]e |
Abeaceade |
() |
The enclosed expression will be grouped, starting from the left side of the expression without encountering a grouped opening parenthesis "(", Number +1. Grouping expressions as a whole can be followed by a number of words. The | In expression is only valid in this group. |
(ABC) {2} A (123|456) c |
abcabca456c |
Here you need to emphasize the effect of the backslash \:
- Backslash followed by meta-character removal of special functions; (the special character escapes to ordinary characters)
- A backslash followed by a normal character to implement special functions, (that is, predefined characters)
- The string that corresponds to the word group that references the ordinal.
A=re.search (R ' (Tina) (FEI) haha\2 ', ' Tinafeihahafei Tinafeihahatina '). Group () print (a) Result: Tinafeihahafei
2. Predefined character set (can be written in character set [...] In
\d |
Number: [0-9] |
A\bc |
A1c |
\d |
Non-numeric: [^\d] |
A\dc |
Abc |
\s |
Match any white space character:[< space >\t\r\n\f\v] |
A\sc |
A C |
\s |
Non-whitespace characters: [^\s] |
A\sc |
Abc |
\w |
Match any character that includes an underscore: [a-za-z0-9_] |
A\wc |
Abc |
\w |
Matches non-alphabetic characters, that is, matches special characters |
A\wc |
A C |
\a |
Matches only the beginning of the string, with ^ |
\aabc |
Abc |
\z |
Matches only the end of the string, same as $ |
Abc\z |
Abc |
\b |
Match between \w and \w, that is, match the word boundary to match a word boundary, that is, the position between the word and the space. For example, ' er\b ' can match ' er ' in ' never ', but not ' er ' in ' verb '. |
\babc\b A\b!bc |
Space ABC Space A!bc |
\b |
[^\b] |
A\bbc |
Abc |
Here we need to emphasize the understanding of \b Word boundaries:
W = re.findall (' \btina ', ' Tian Tinaaaa ') print (w) s = Re.findall (R ' \btina ', ' Tian Tinaaaa ') print (s) v = re.findall (R ' \btina ', ' tian#tinaaaa ') print (v) a = Re.findall (R ' \btina\b ', ' tian#[email protected] ') print (a) the results are as follows: [] [' Tina '] [' Tina '] [' Tina ']
3. Special Group Usage:
(? p<name>) |
Group, specifying an additional alias in addition to the original number |
(? P<ID>ABC) {2} |
Abcabc |
(? P=name) |
A group that references aliases to <name> matches to a string |
(? p<id>\d) ABC (? P=id) |
1abc15abc5 |
\<number> |
Group matching with reference number <number> to string |
(\d) abc\1 |
1abc15abc5 |
Three, re module commonly used function function
1, compile ()
Compiles a regular expression pattern that returns the schema of an object. (You can compile common regular expressions into regular expression objects, which can be a bit more efficient.) )
Format:
Re.compile (pattern,flags=0)
Pattern: The expression string used at compile time.
Flags compile flags that modify the way regular expressions are matched, such as case sensitivity, multiline matching, and so on. The usual flags are:
Sign |
Meaning |
Re. S (Dotall) |
make. Match all characters including line breaks |
Re. I (IGNORECASE) |
Make the match case insensitive |
Re. L (LOCALE) |
Do localization identification (locale-aware) matching, French, etc. |
Re. M (MULTILINE) |
Multiline match, affecting ^ and $ |
Re. X (VERBOSE) |
The flag is easier to understand by giving a more flexible format to write regular expressions |
Re. U |
Resolves characters based on the Unicode character set, which affects \w,\w,\b,\b |
Import Rett = "Tina is a good girl, she's cool, clever, and so on ..." rr = Re.compile (R ' \w*oo\w* ') print (Rr.findall (TT))
#查找所有包含 ' oo ' word execution results are as follows: [' good ', ' cool ']
2. Match ()
Determines whether the re matches the position of the string at the beginning. Note: This method is not an exact match. If the string has any remaining characters at the end of the pattern, it is still considered successful. If you want an exact match, you can add the boundary match ' $ ' at the end of the expression
Format:
Re.match (Pattern, string, flags=0)
Print (Re.match (' com ', ' Comwww.runcomoob '). Group ()) Print (Re.match (' com ', ' Comwww.runcomoob ', re. I). Group ()) The results of the execution are as follows: ComCom
3. Search ()
Format:
Re.search (Pattern, string, flags=0)
The Re.search function looks for a pattern match within the string, as long as the first match is found and then returns none if the string does not match.
Print (Re.search (' \dcom ', ' www.4comrunoob.5com '). Group ()) execution results are as follows: 4com
* Note: match and search once matched successfully, is a match object object, and the match object object has the following methods:
- Group () returns a string that is matched by RE
- Start () returns the position where the match started
- End () returns the position of the end of the match
- Span () returns a tuple containing the position of the match (start, end)
- Group () returns a string that matches the whole of the RE, and can enter multiple group numbers at a time, corresponding to the string matching the group number.
A. Group () returns the whole string of re-matches,
B. Group (N,M) returns a string that matches the group number n,m and returns the Indexerror exception if the group number does not exist
The C.groups () groups () method returns a tuple that contains all the group strings in a regular expression, from 1 to the included group number, usually groups () does not require parameters, and returns a tuple that is a tuple defined in a regular expression.
Import rea = "123abc456" Print (Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (0)) #123abc456, return to overall print ( Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (1)) #123 print (Re.search ("([0-9]*) ([a-z]*) ([0-9]*) ", a). Group ( 2)) #abc print (Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (3)) #456
# # #group (1) lists the first bracket matching section, Group (2) lists the second Bracket matching section, and Group (3) lists the third Bracket matching section. ###
4, FindAll ()
Re.findall traversal matches, you can get all the matching strings in the string and return a list.
Format:
Re.findall (Pattern, string, flags=0)
p = re.compile (R ' \d+ ') print (P.findall (' O1n2m3k4 ')) executes the result as follows: [' 1 ', ' 2 ', ' 3 ', ' 4 ']
Import Rett = "Tina is a good girl, she's cool, clever, and so on ..." rr = Re.compile (R ' \w*oo\w* ') print (Rr.findall (TT)) PRI NT (Re.findall (R ' (\w) *oo (\w) ', TT)) # () indicates that the sub-expression executes as follows: [' good ', ' cool '] [(' G ', ' d '), (' C ', ' l ')]
5, Finditer ()
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially. Find all the substrings that the RE matches and return them as an iterator.
Format:
Re.finditer (Pattern, string, flags=0)
ITER = Re.finditer (R ' \d+ ', ' drumm44ers drumming, 11 ... ... ') for I in ITER: print (i) print (I.group ()) Print ( i.span ()) execution results are as follows: <_sre. Sre_match object; span= (0, 2), match= ' >12 (0, 2) <_sre. Sre_match object; Span= (8, ten), Match= ' >44 (8, ten) <_sre. Sre_match object; Span= (match=), ">11", <_sre. Sre_match object; Span= (+), match= ' >10 (31, 33)
6. Split ()
Returns a list after splitting a string by a substring that can be matched.
You can use Re.split to split a string, such as: Re.split (R ' \s+ ', text), and divide the string into a word list by space.
Format:
Re.split (Pattern, string[, Maxsplit])
The maxsplit is used to specify the maximum number of splits and does not specify that all will be split.
Print (Re.split (' \d+ ', ' one1two2three3four4five5 ')) executes the result as follows: [' One ', ' one ', ' one ', ' three ', ' four ', ' five ', ']
7, Sub ()
Returns the replaced string after replacing each of the matched substrings in a string with re.
Format:
Re.sub (Pattern, REPL, string, count)
Import Retext = "Jgood is a handsome boy, he's cool, clever, and so on ..." Print (Re.sub (R ' \s+ ', '-', text)) execution results are as follows: Jgood-is -a-handsome-boy,-he-is-cool,-clever,-and-so-on ...
Where the second function is the replaced string, in this case '-'
The fourth parameter refers to the number of replacements. The default is 0, which means that each match is replaced.
Re.sub also allows for complex processing of replacements for matches using functions.
such as: Re.sub (R ' \s ', Lambda m: ' [' + m.group (0) + '] ', text, 0); Replace the space in the string ' ' with ' [] '.
Import Retext = "Jgood is a handsome boy, he's cool, clever, and so on ..." Print (Re.sub (R ' \s+ ', Lambda m: ' [' +m.group (0) + '] ', text,0)) execution results are as follows: jgood[]is[]a[]handsome[]boy,[]he[]is[]cool,[]clever,[]and[]so[]on ...
8, SUBN ()
Returns the number of replacements
Format:
SUBN (Pattern, Repl, String, count=0, flags=0)
Print (Re.subn (' [1-2] ', ' A ', ' 123456abcdef ')) Print (Re.sub ("g.t", "have", ' I get A, I got B, I gut C ')) Print (Re.subn (" g.t "," having ", ' I get a, I got B, I gut C ')" Executes the result as follows: (' Aa3456abcdef ', 2) I have a, i has B, I have C (' I has a, i H Ave B, I have C ', 3)
Iv. some points of attention
1. The difference between Re.match and Re.search and Re.findall:
Re.match matches only the beginning of the string, if the string does not begin to conform to the regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.
A=re.search (' [\d] ', ' abc33 '). Group () print (a) p=re.match (' [\d] ', ' abc33 ') print (p) b=re.findall (' [\d] ', "abc33") print (b) Results of implementation: 3none[' 3 ', ' 3 ']
2. Greedy match and non-greedy match
*?,+?,??, {m,n}? In front of the *,+, and so on are greedy matches, that is, match as much as possible, after adding the number to make it an inert match
A = Re.findall (r "A (\d+?)", ' a23b ') print (a) b = Re.findall (R "A (\d+)", ' a23b ') print (b) Execution result: [' 2 '] [' 23 ']
A = Re.match (' < (. *) > ', ' <H1>title<H1> '). Group () print (a) b = Re.match (' < (. *?) > ', ' <H1>title<H1> '). Group () print (b) Execution Results:
A = Re.findall (r "A (\d+) b", ' a3333b ') print (a) b = Re.findall (R "A (\d+?) B ", ' a3333b ') print (b) execution results are as follows: [' 3333 '] [' 3333 ']###################### #这里需要注意的是如果前后均有限定条件的时候, there is no greedy mode, non-matching mode invalidation.
3. The small pits encountered with flags
Print (Re.split (' A ', ' 1a1a2a3 ', re. I) #输出结果并未能区分大小写这是因为re. Split (pattern,string,maxsplit,flags) defaults to four parameters, and when we pass in the three parameters, the system defaults to re. I is the third parameter, so it doesn't work. If you want to get here the re. I worked, written flags=re. I can.
Five, the small practice of the regular1. Match phone number
p = re.compile (R ' \d{3}-\d{6} ') Print (P.findall (' 010-628888 '))
2. Matching IP
Re.search (R "([01]?\d?\d|2[0-4]\d|25[0-5]) \.) {3} ([01]?\d?\d|2[0-4]\d|25[0-5]\.) "," 192.168.1.1 ")
Python Regular turn