Regular Expressions in python (re module) and python Regular Expressions
I. Introduction
Regular Expressions are a small and highly specialized programming language. In python, programmers can directly call the re module to implement regular expression matching. The regular expression mode is compiled into a series of bytecode and then executed by the matching engine written in C.
Ii. Meanings of common characters in Regular Expressions
1. common characters and 11 metacharacters:
Common characters |
Match itself |
Abc |
Abc |
. |
Match any character except the linefeed "\ n" (it can also match the linefeed in DOTALL Mode) |
A. c |
Abc |
\ |
Escape characters to change the meaning of the last character |
A \. c; a \ c |
A. c; a \ c |
* |
Match the first character 0 or multiple times |
Abc * |
AB; abccc |
+ |
Match the first character once or infinitely |
Abc + |
Abc; abccc |
? |
Match a character 0 times or 1 time |
Abc? |
AB; abc |
^ |
Matches the start of a string. Match the beginning of each line in multiline Mode |
^ Abc |
Abc |
$ |
Match the end of a string. In multiline mode, match the end of each row. |
Abc $ |
Abc |
| |
Or. Match | either of the left and right expressions matches from left to right. If | is not included in (), its range is the entire regular expression. |
Abc | def |
Abc def |
{} |
{M} matches the previous character m times, and {m, n} matches the previous character m to n times. If n is omitted, the match m to infinite times. |
AB {1, 2} c |
Abc abbc |
[] |
Character Set. The corresponding position can be any character in the character set. Characters in the character set can be listed one by one or a range can be provided, for example, [abc] or [a-c]. [^ Abc] indicates the inverse, that is, non-abc. All special characters lose their original special meanings in the character set. Use \ backslash to escape and restore the special meaning of special characters. |
A [bcd] e |
Abe ace ade |
() |
The enclosed expression is used as a group. from the left side of the expression, no brackets (", number + 1) in the group are displayed. A group expression can be followed by a quantizer. | In the expression is only valid in this group. |
(Abc) {2} A (123 | 456) c |
Abcabc a456c |
Here we need to emphasize the role of backslash:
- Special features are removed from the backend of the backslash and the metacharacters (convert special characters to common characters)
- Special functions are implemented by the backslash followed by common characters (that is, pre-defined characters)
- The string that matches the word group that references the sequence number.
a=re.search(r'(tina)(fei)haha\2','tinafeihahafei tinafeihahatina').group()print(a)
Result:
Tinafeihahafei
2. pre-defined character set (can be written in character set)
\ D |
Number: [0-9] |
A \ bc |
A1c |
\ D |
Non-digit: [^ \ d] |
A \ Dc |
Abc |
\ S |
Match any blank characters: [<space> \ t \ r \ n \ f \ v] |
A \ SC |
A c |
\ S |
Non-blank characters: [^ \ s] |
A \ SC |
Abc |
\ W |
Match any character including underline: [A-Za-z0-9 _] |
A \ wc |
Abc |
\ W |
Match non-letter characters, that is, match special characters |
A \ Wc |
A c |
\ |
Matches only the start of a string. |
\ Aabc |
Abc |
\ Z |
Matches only the end of a string, the same as $ |
Abc \ Z |
Abc |
\ B |
Match between \ w and \ W, that is, match the word boundary matches a word boundary, that is, the position between words and spaces. For example, 'er \ B 'can match 'er' in "never", but cannot match 'er 'in "verb '. |
\ Babc \ B A \ B! Bc |
Space abc Space A! Bc |
\ B |
[^ \ B] |
A \ Bbc |
Abc |
Here we need to emphasize the understanding of the word boundary of \ B:
w = re.findall('\btina','tian tinaaaa')print(w)s = re.findall(r'\btina','tian tinaaaa')print(s)v = re.findall(r'\btina','tian#tinaaaa')print(v)a = re.findall(r'\btina\b','tian#tina@aaa')print(a)
The execution result is as follows:
[]
['Tina ']
['Tina ']
['Tina ']
3. Special grouping usage:
(? P <name>) |
Group, in addition to the original number, specify an additional alias |
(? P <id> abc) {2} |
Abcabc |
(? P = name) |
The group with the alias <name> is referenced to match the string. |
(? P <id> \ d) abc (? P = id) |
1abc1 5abc5 |
\ <Number> |
The group with the reference number <number> matches the string. |
(\ D) abc \ 1 |
1abc1 5abc5 |
Iii. Common functions in re Module
1. compile ()
Compile the regular expression mode and return the mode of an object. (You can compile common regular expressions into regular expression objects to improve efficiency .)
Format:
re.compile(pattern,flags=0)
Pattern: expression string used for compilation.
Flags compiled flag, used to modify the matching mode of regular expressions, such as case-sensitive or multi-line matching. Common flags include:
Flag |
Description |
Re. S (DOTALL) |
Make. Match All characters including line breaks |
Re. I (IGNORECASE) |
Make matching case insensitive |
Re. L (LOCALE) |
Local identification (locale-aware) matching, French, etc. |
Re. M (MULTILINE) |
Multi-row matching, affecting ^ and $ |
Re. X (VERBOSE) |
This flag allows you to write regular expressions in a more flexible format. |
Re. U |
Parses characters according to the Unicode character set. This flag affects \ w, \ W, \ B, \ B |
Import ret = "Tina is a good girl, she is cool, clever, and so on... "rr = re. compile (R' \ w * oo \ w * ') print (rr. findall (tt) # Find all words containing 'oo'
The execution result is as follows:
['Good', 'Cool ']
2. match ()
Determines whether the RE matches at the starting position of the string. // Note: This method does not fully match. When pattern ends, if the string contains any remaining characters, the operation is still considered successful. To perform a full match, you can add the boundary match '$' At the end of the expression'
Format:
re.match(pattern, string, flags=0)
print(re.match('com','comwww.runcomoob').group())print(re.match('com','Comwww.runcomoob',re.I).group())
The execution result is as follows:
Com
Com
3. search ()
Format:
re.search(pattern, string, flags=0)
The re. search function searches for a pattern match in a string. If the first match is found, the system returns the result. If the string does not match, the system returns None.
print(re.search('\dcom','www.4comrunoob.5com').group())
The execution result is as follows:
4com
* Note: Once the match and search match are successful, it is a match object, and the match object has the following methods:
- Group () returns the string matched by the RE.
- Start () returns the position where the matching starts.
- End () returns the position at which the matching ends.
- Span () returns the position where a tuples contain a match (START, end ).
- Group () returns the string matching the overall re. Multiple group numbers can be entered at a time, corresponding to the string matching the group number.
A. group () returns the overall matching string of the re,
B. group (n, m) returns the string matching the group number n and m. If the group number does not exist, an indexError error is returned.
C. the groups () method returns a tuples that contain all group strings in a regular expression, from 1 to the group number contained. Generally, the groups () method returns a tuple without a parameter, the element in the tuples is the group defined in the regular expression.
Import rea = "123abc456" print (re. search ("([0-9] *) ([a-z] *) ([0-9] *)", ). group (0) #123abc456, returns the overall print (re. search ("([0-9] *) ([a-z] *) ([0-9] *)", ). group (1) #123 print (re. search ("([0-9] *) ([a-z] *) ([0-9] *)", ). group (2) # abc print (re. search ("([0-9] *) ([a-z] *) ([0-9] *)", ). group (3) #456 ### group (1) lists the Matching Parts of the first parentheses, group (2) lists the Matching Parts of the second parentheses, and group (3) list the matching part of the third parenthesis. ###
4. findall ()
Re. findall traverses and matches all matched strings in the string and returns a list.
Format:
re.findall(pattern, string, flags=0)
p = re.compile(r'\d+')print(p.findall('o1n2m3k4'))
The execution result is as follows:
['1', '2', '3', '4']
Import ret = "Tina is a good girl, she is cool, clever, and so on... "rr = re. compile (R' \ w * oo \ w * ') print (rr. findall (tt) print (re. findall (R' (\ w) * oo (\ w) ', tt) # () indicates a subexpression
The execution result is as follows:
['Good', 'Cool ']
[('G', 'D'), ('C', 'L')]
5. finditer ()
Returns an iterator that accesses each matching result (Match object) sequentially. Find all the substrings matching the RE and return them as an iterator.
Format:
re.finditer(pattern, string, flags=0)
iter = re.finditer(r'\d+','12 drumm44ers drumming, 11 ... 10 ...')for i in iter: print(i) print(i.group()) print(i.span())
The execution result is as follows:
<_ Sre. SRE_Match object; span = (0, 2), match = '12'>
12
(0, 2)
<_ Sre. SRE_Match object; span = (8, 10), match = '44'>
44
(8, 10)
<_ Sre. SRE_Match object; span = (24, 26), match = '11'>
11
(24, 26)
<_ Sre. SRE_Match object; span = (31, 33), match = '10'>
10
(31, 33)
6. split ()
Split string by matching substrings and return to the list.
You can use re. split to split a string, such as re. split (r '\ s +', text). The string is split into a word list by space.
Format:
re.split(pattern, string[, maxsplit])
Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed.
print(re.split('\d+','one1two2three3four4five5'))
The execution result is as follows:
['One', 'two', 'three ', 'four', 'five', '']
7. sub ()
Use re to replace each matched substring in the string and then return the replaced string.
Format:
re.sub(pattern, repl, string, count)
import retext = "JGood is a handsome boy, he is cool, clever, and so on..."print(re.sub(r'\s+', '-', text))
The execution result is as follows:
JGood-is-a-handsome-boy,-he-is-cool,-clever,-and-so-on...
The second function is the replaced string. In this example, It is '-'
The fourth parameter indicates the number of replicas. The default value is 0, indicating that each matching item is replaced.
Re. sub also allows the use of functions to replace matching items for complex processing.
For example, re. sub (r '\ s', lambda m:' ['+ m. group (0) + ']', text, 0); replace the space ''In the string with '[]'.
import retext = "JGood is a handsome boy, he is cool, clever, and so on..."print(re.sub(r'\s+', lambda m:'['+m.group(0)+']', text,0))
The execution result is as follows:
JGood [] is [] a [] handsome [] boy, [] he [] is [] cool, [] clever, [] and [] so [] on...
8. subn ()
Return replacement times
Format:
subn(pattern, repl, string, count=0, flags=0)
print(re.subn('[1-2]','A','123456abcdef'))print(re.sub("g.t","have",'I get A, I got B ,I gut C'))print(re.subn("g.t","have",'I get A, I got B ,I gut C'))
The execution result is as follows:
('Aa3456abcdef ', 2)
I have A, I have B, I have C
('I have A, I have B, I have C', 3)
4. Notes
1. Differences between re. match and re. search and re. findall:
Re. match only matches the start of the string. If the start of the string does not conform to the regular expression, the match fails, and the function returns None; and re. search matches the entire string until a match is found.
a=re.search('[\d]',"abc33").group()print(a)p=re.match('[\d]',"abc33")print(p)b=re.findall('[\d]',"abc33")print(b)
Execution result:
3
None
['3', '3']
2. Greedy match and non-Greedy match
*?, + ?,??, {M, n }? The preceding *, + ,? And so on are greedy matches, that is, matching as much as possible, followed? ID to convert it into a inert match
a = re.findall(r"a(\d+?)",'a23b')print(a)b = re.findall(r"a(\d+)",'a23b')print(b)
Execution result:
['2']
['23']
a = re.match('<(.*)>','<H1>title<H1>').group()print(a)b = re.match('<(.*?)>','<H1>title<H1>').group()print(b)
Execution result:
<H1> title <H1>
<H1>
a = re.findall(r"a(\d+)b",'a3333b')print(a)b = re.findall(r"a(\d+?)b",'a3333b')print(b)
The execution result is as follows:
['123']
['123']
#######################
It should be noted that if there are conditions before and after, there will be no greedy mode, and the non-matching mode will fail.
3. Small pitfalls encountered when using flags
Print (re. split ('A', '1a1a2a3 ', re. I) # The output result is not case sensitive.
This is because re. split (pattern, string, maxsplit, flags) is four parameters by default. When we input three parameters, the system will default re. I is the third parameter, so it does not work. If you want re. I to take effect, write it as flags = re. I.
5. Small regular expression practices
1. Matching phone number
p = re.compile(r'\d{3}-\d{6}')print(p.findall('010-628888'))
2. IP address matching
re.search(r"(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5]\.)","192.168.1.1")
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.