1. Regular Expression basics 1.1. Brief Introduction
Regular expressions are not part of Python. Regular Expressions are powerful tools used to process strings. They have their own unique syntax and an independent processing engine, which may not be as efficient as the built-in STR method, but are very powerful. Thanks to this, in languages that provide regular expressions, the syntax of regular expressions is the same. The difference is that different programming languages support different syntaxes, unsupported syntax is usually not commonly used. If you have already used regular expressions in other languages, you just need to take a look.
The following figure shows the matching process using a regular expression:
The general matching process of a regular expression is as follows: Compare the expression with the characters in the text in sequence. If each character can match, the matching succeeds. If any character fails to match, the matching fails. If the expression contains quantifiers or boundary, this process may be slightly different, but it is also easy to understand. You can see the examples and use them several times.
Lists the python-supported regular expression metacharacters and syntaxes:
1.2. Greedy and non-Greedy modes of quantifiers
Regular Expressions are usually used to search for matched strings in the text.In python, quantifiers are greedy by default (in a few languages, they may also be non-Greedy by default) and always try to match as many characters as possible.If it is not greedy, the opposite is always trying to match as few characters as possible. For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ". WhileIf you use a non-Greedy quantizer "AB *? ", You will find"".
1.3. slashes
Like most programming languages, regular expressions use "\" as escape characters, which may cause backlash troubles. If you need to match the character "\" in the text, four Backslash "\" will be required in the regular expression expressed in programming language "\\\\": the first two and the last two are used to convert them into backslashes in the programming language, convert them into two backslashes, and then escape them into a backslash in the regular expression.The native string in Python solves this problem well. The regular expression in this example can be represented by R "\".. Similarly, "\ D" matching a number can be written as R "\ D ". With the native string, you no longer have to worry about missing the backslash, and the written expression is more intuitive.
1.4. Matching Mode
Regular Expressions provide some available matching modes, such as case-insensitive and multi-row matching. This part of content will be used in the factory method re of the pattern class. compile (pattern [, flags.
2. re module 2.1. Start Using re
Python supports regular expressions through the re module. The general step to Use Re is to first compile the string form of the regular expression into a pattern instance, and then use the pattern instance to process the text and obtain the matching result (a match instance ), finally, use the match instance to obtain information and perform other operations.
# Encoding: UTF-8import re # compile the regular expression into the pattern object pattern = Re. compile (r 'hello') # use pattern to match the text and obtain the matching result. If the match fails, nonematch = pattern is returned. match ('Hello world! ') If match: # Use match to obtain the group information print match. Group () ### output ### hello
Re. Compile (strpattern [, Flag]):
This method is a factory method of the pattern class. It is used to compile a regular expression in the string form into a pattern object. The second parameter flag is the matching mode,The value can take effect simultaneously using the bitwise OR operator '| '.For example, re. I | re. M. In addition, you can specify the mode in the RegEx string, such as Re. Compile ('pattern', re. I | re. m) and RE. Compile ('(? Im) pattern ') is equivalent.
Optional values:
- Re. I (Re. ignorecase)
- M (multiline): In multiline mode, the behavior of '^' and '$' is changed (see)
- S (dotall): Any point matching mode, changing the behavior '.'
- L (locale): Make the pre-defined character class \ W \ B \ s dependent on the current region settings
- U (UNICODE): Make the predefined character class \ W \ B \ s \ D depend on the character attribute defined by Unicode
- X (verbose): verbose mode. In this mode, the regular expression can be multiple rows, ignore blank characters, and add comments.The following two regular expressions are equivalent:
a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X)b = re.compile(r"\d+\.\d*")
Re provides many module methods for completing the regular expression function. These methods can be replaced by the corresponding method of the pattern instance. The only advantage is that less re. Compile () code is written, but the compiled pattern object cannot be reused at the same time. These methods will be introduced together in the instance method section of the pattern class. The preceding example can be abbreviated:
m = re.match(r'hello', 'hello world!')print m.group()
The re module also provides the method escape (string) to use the regular expression metacharacters in the string, such as */+ /? If you add an escape character before returning it, it is useful when you need to match a large number of metacharacters.
2.2. Match
A match object is a matching result that contains a lot of information about this matching. You can use the readable attributes or methods provided by match to obtain this information.
Attribute:
- String: The text used for matching.
- Re: Specifies the pattern object used for matching.
- Pos: The index that the regular expression starts to search for in the text. The value is the same as that of the pattern. Match () and pattern. seach () methods.
- Endpos: The index of the regular expression ending search in the text. The value is the same as that of the pattern. Match () and pattern. seach () methods.
- Lastindex: Index of the last captured group in the text. If no captured group exists, the value is none.
- Lastgroup: The alias of the last captured group. If this group does not have an alias or is not captured, it is set to none.
Method:
- Group ([group1,…]) :
Returns a string intercepted by one or more groups. If multiple parameters are specified, the string is returned as a tuple.. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, none is returned; the group that has been intercepted multiple times returns the last intercepted substring.
- Groups ([Default]):
Returns the string intercepted by all groups in the form of tuples.. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is none.
- Groupdict ([Default]):
Returns a dictionary that uses the alias of an alias group as the key and the intercepted substring as the value. A group without an alias is not included.. The meaning of default is the same as that of default.
- Start ([group]):
Returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0.
- End ([group]):
Returns the ending index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0.
- Span ([group]):
Returns (START (group), end (Group )).
- Expand (Template ):
Place the matched group into the template and return the result. You can use \ ID, \ G <ID>, \ G <Name> to reference groups in template, but cannot use number 0. \ ID and \ G <ID> are equivalent, but \ 10 will be considered as 10th groups. If you want to express \ 1 followed by the character '0 ', only \ G <1> 0 can be used.
import rem = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')print "m.string:", m.stringprint "m.re:", m.reprint "m.pos:", m.posprint "m.endpos:", m.endposprint "m.lastindex:", m.lastindexprint "m.lastgroup:", m.lastgroupprint "m.group(1,2):", m.group(1, 2)print "m.groups():", m.groups()print "m.groupdict():", m.groupdict()print "m.start(2):", m.start(2)print "m.end(2):", m.end(2)print "m.span(2):", m.span(2)print r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3')### output #### m.string: hello world!# m.re: <_sre.SRE_Pattern object at 0x016E1A38># m.pos: 0# m.endpos: 12# m.lastindex: 3# m.lastgroup: sign# m.group(1,2): ('hello', 'world')# m.groups(): ('hello', 'world', '!')# m.groupdict(): {'sign': '!'}# m.start(2): 6# m.end(2): 11# m.span(2): (6, 11)# m.expand(r'\2 \1\3'): world hello!
2.3. Pattern
The pattern object is a compiled regular expression. You can use a series of methods provided by pattern to search for the text.
Pattern cannot be directly instantiated and must be constructed using re. Compile.
Pattern provides several readable attributes for obtaining information about an expression:
- Pattern: expression string used for compilation.
- Flags: the matching mode used during compilation. Digit format.
- Groups: number of groups in the expression.
- Groupindex: the key is the alias of a group with an alias in the expression, and the number of the group is the value of the dictionary. A group without an alias is not included.
import rep = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)print "p.pattern:", p.patternprint "p.flags:", p.flagsprint "p.groups:", p.groupsprint "p.groupindex:", p.groupindex### output #### p.pattern: (\w+) (\w+)(?P<sign>.*)# p.flags: 16# p.groups: 3# p.groupindex: {'sign': 3}
Instance method [| re module method]:
- Match (string [, POS [, endpos]) | re. Match (pattern, string [, flags]):
This method will try to match pattern from the string POS subscript; If pattern can still be matched at the end, a match object will be returned; If pattern cannot match during the matching process, or if the match is not completed and the endpos is reached, none is returned.
The default values of POs and endpos are 0 and Len (string), respectively. Re. Match () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern.
Note: This method does not fully match. When pattern ends, if the string contains any remaining characters, the operation is still considered successful. To perform a full match, you can add the boundary match '$' At the end of the expression '.
For an example, see section 2.1.
- Search (string [, POS [, endpos]) | re. Search (pattern, string [, flags]):
This method is used to search for substrings that can be matched successfully in a string. Match pattern from the POs subscript of string. If pattern can still be matched at the end, a match object is returned. If it cannot be matched, add POs to 1 and try again; if the Pos = endpos still does not match, none is returned.
The default values of POs and endpos are 0 and Len (string) respectively. Re. Search () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern.# Encoding: UTF-8 import re # compile the regular expression into the pattern object pattern = Re. compile (r'world') # use search () to find matched substrings. If no matched substrings exist, none is returned. # match () is used in this example () the match = pattern cannot be matched successfully. search ('Hello world! ') If match: # Use match to obtain the group information print match. Group () ### output ### world
- Split (string [, maxsplit]) | re. Split (pattern, string [, maxsplit]):
Split string by matching substrings and return to the list. Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed.import rep = re.compile(r'\d+')print p.split('one1two2three3four4')### output #### ['one', 'two', 'three', 'four', '']
- Findall (string [, POS [, endpos]) | re. findall (pattern, string [, flags]):
Search for strings and return all matching substrings in the form of a list.import rep = re.compile(r'\d+')print p.findall('one1two2three3four4')### output #### ['1', '2', '3', '4']
- Finditer (string [, POS [, endpos]) | re. finditer (pattern, string [, flags]):
Returns an iterator that accesses each matching result (match object) sequentially.import rep = re.compile(r'\d+')for m in p.finditer('one1two2three3four4'): print m.group(),### output #### 1 2 3 4
- Sub (repl, string [, Count]) | re. sub (pattern, REPL, string [, Count]):
Use repl to replace each matched substring in the string and then return the replaced string..
When repl is a string, you can use \ ID, \ G <ID>, \ G <Name> to reference the group, but cannot use number 0.
When repl is a method, this method should only accept one parameter (match object) and return a string for replacement (the returned string cannot reference the group ).
Count is used to specify the maximum number of replicas. If not specified, all replicas are replaced.import rep = re.compile(r'(\w+) (\w+)')s = 'i say, hello world!'print p.sub(r'\2 \1', s)def func(m): return m.group(1).title() + ' ' + m.group(2).title()print p.sub(func, s)### output #### say i, world hello!# I Say, Hello World!
- Subn (repl, string [, Count]) | re. sub (pattern, REPL, string [, Count]):
Returns (sub (repl, string [, Count]), replacement times ).import rep = re.compile(r'(\w+) (\w+)')s = 'i say, hello world!'print p.subn(r'\2 \1', s)def func(m): return m.group(1).title() + ' ' + m.group(2).title()print p.subn(func, s)### output #### ('say i, world hello!', 2)# ('I Say, Hello World!', 2)
From: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html