Python System Study Notes (15th) --- Regular Expressions

Source: Internet
Author: User

1. Regular Expressions 1.1. A brief introduction to regular expressions is not part of Python. Regular Expressions are powerful tools used to process strings. They have their own unique syntax and an independent processing engine, which may not be as efficient as the built-in str method, but are very powerful. Thanks to this, in languages that provide regular expressions, the syntax of regular expressions is the same. The difference is that different programming languages support different syntaxes, unsupported syntax is usually not commonly used. If you have already used regular expressions in other languages, you just need to take a look. Shows the process of matching with a regular expression: the approximate matching process of the regular expression is: Compare the expression with the characters in the text in sequence. If each character can match, the matching is successful; if any character fails to be matched, the match fails. If the expression contains quantifiers or boundary, this process may be slightly different, but it is also easy to understand. You can see the examples and use them several times. Lists the Python-supported regular expression metacharacters and syntaxes: 1.2. The greedy and non-Greedy regular expressions of quantifiers are usually used to search for matched strings in the text. In Python, quantifiers are greedy by default (in a few languages, they may also be non-Greedy by default), and always try to match as many characters as possible; in non-greedy, the opposite is true, always try to match as few characters as possible. For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ". If we use a non-Greedy quantizer "AB *? "," A "is found ". 1.3. The backlash is the same as that in most programming languages. "\" is used as escape characters in regular expressions, which may cause the backlash problem. If you need to match the character "\" in the text, four Backslash "\" will be required in the regular expression expressed in programming language "\\\\": the first two and the last two are used to convert them into backslashes in the programming language, convert them into two backslashes, and then escape them into a backslash in the regular expression. The native string in Python solves this problem well. The regular expression in this example can be represented by r. Similarly, "\ d" matching a number can be written as r "\ d ". With the native string, you no longer have to worry about missing the backslash, and the written expression is more intuitive. 1.4. the Regular Expression of the matching mode provides some available matching modes, such as case-insensitive and multi-row matching. This part of content will be in the factory method re of the Pattern class. compile (pattern [, flags. 2. re module 2.1. Use re www.2cto. comPython to support regular expressions through re module. The general step to Use re is to first compile the string form of the regular expression into a Pattern instance, and then use the Pattern instance to process the text and obtain the matching result (a Match instance ), finally, use the Match instance to obtain information and perform other operations. [Python] # encoding: UTF-8 import re # compile the regular expression into the Pattern object pattern = re. compile (r 'hello') # use Pattern to match the text and obtain the matching result. If the match fails, None match = pattern is returned. match ('Hello world! ') If match: # Use Match to obtain the group information print match. group () ### output #### hello re. compile (strPattern [, flag]): This method is a factory method of the Pattern class. It is used to compile a regular expression in the string form into a Pattern object. The second parameter flag is the matching mode. The value can take effect simultaneously using the bitwise OR operator '|', such as re. I | re. M. In addition, you can specify the mode in the regex string, such as re. compile ('pattern', re. I | re. M) and re. compile ('(? Im) pattern ') is equivalent. Optional values: re. I (re. IGNORECASE): case-insensitive (the complete method is written in parentheses, the same below) M (MULTILINE): MULTILINE mode, changing the behavior of '^' and '$' (SEE) S (DOTALL): Any point matching mode, change '. 'behavior L (LOCALE): Make the pre-defined character class \ w \ W \ B \ B \ s \ S depends on the current region set U (UNICODE ): make the predefined character class \ w \ W \ B \ B \ s \ S \ d \ D Dependent on the Character attribute X (VERBOSE) defined by unicode: VERBOSE mode. In this mode, the regular expression can be multiple rows, ignore blank characters, and add comments. The following two regular expressions are equivalent: [python] a = re. compile (r "\ d + # the integral part \. # the decimal point \ d * # some fractional digits ", re. x) B = re. compile (r "\ d + \. \ d * ") re provides many module methods to complete the regular expression function. These methods can be replaced by the corresponding method of the Pattern instance. The only advantage is that less re. compile () code is written, but the compiled Pattern object cannot be reused at the same time. These methods will be introduced together in the instance method section of the Pattern class. For example, the preceding example can be abbreviated as [python] m = re. match (r 'hello', 'Hello world! ') Print m. group () re module also provides a method escape (string) to use the regular expression metacharacters in the string, such as */+ /? If you add an escape character before returning it, it is useful when you need to match a large number of metacharacters. 2.2. the MatchMatch object is a matching result and contains a lot of information about the matching. You can use the readable attribute or method provided by Match to obtain the information. Attribute: string: the text used for matching. Re: Specifies the Pattern object used for matching. Pos: index in the text where regular expressions start to search. The value is the same as that of the Pattern. match () and Pattern. seach () methods. Endpos: Index of the ending search by a regular expression in the text. The value is the same as that of the Pattern. match () and Pattern. seach () methods. Lastindex: The index of the last captured group in the text. If no captured group exists, the value is None. Lastgroup: the alias of the last captured group. If this group does not have an alias or is not captured, it is set to None. Method: group ([group1,…]): Gets one or more string intercepted by a group. If multiple parameters are specified, the string is returned as a tuple. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, None is returned; the group that has been intercepted multiple times returns the last intercepted substring. Groups ([default]): returns the string intercepted by all groups in the form of tuples. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None. Groupdict ([default]): returns a dictionary with the alias of an alias group as the key and the value of the substring intercepted by this group as the value. groups without aliases are not included. The meaning of default is the same as that of default. Start ([group]): returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0. End ([group]): returns the end index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0. Span ([group]): Returns (start (group), end (group )). Expand (template): place the matched group into the template and return the result. You can use \ id, \ g <id>, \ g <name> to reference groups in template, but cannot use number 0. \ Id and \ g <id> are equivalent, but \ 10 will be considered as 10th groups. If you want to express \ 1 followed by the character '0 ', only \ g <1> 0 can be used. [Python] import re m = re. match (R' (\ w + )(? P <sign>. *) ', 'Hello world! ') Print "m. string: ", m. string print "m. re: ", m. re print "m. pos: ", m. pos print "m. endpos: ", m. endpos print "m. lastindex: ", m. lastindex print "m. lastgroup: ", m. lastgroup print "m. group (1, 2): ", m. group (1, 2) print "m. groups (): ", m. groups () print "m. groupdict (): ", m. groupdict () print "m. start (2): ", m. start (2) print "m. end (2): ", m. end (2) print "m. span (2): ", m. span (2) print r "m. expand (R' \ 2 \ 1 \ 3 '): ", m. e XP and (R' \ 2 \ 1 \ 3') ### output #### m. string: hello world! # M. re: <_ sre. SRE_Pattern object at 0x016E1A38> # m. pos: 0 # m. endpos: 12 # m. lastindex: 3 # m. lastgroup: sign # m. group (1, 2): ('hello', 'World') # m. groups (): ('hello', 'World ','! ') # M. groupdict (): {'sign ':'! '} # M. start (2): 6 # m. end (2): 11 # m. span (2): (6, 11) # m. expand (R' \ 2 \ 1 \ 3 '): world hello! 2.3. The PatternPattern object is a compiled regular expression. You can use a series of methods provided by Pattern to search for the text. Pattern cannot be directly instantiated and must be constructed using re. compile. Pattern provides several readable attributes used to obtain information about an expression: pattern: The expression string used for compilation. Flags: the matching mode used during compilation. Digit format. Groups: number of groups in the expression. Groupindex: the key is the alias of a group with an alias in the expression, and the number of the group is the value of the dictionary. A group without an alias is not included. [Python] import re m = re. match (R' (\ w + )(? P <sign>. *) ', 'Hello world! ') Print "m. string: ", m. string print "m. re: ", m. re print "m. pos: ", m. pos print "m. endpos: ", m. endpos print "m. lastindex: ", m. lastindex print "m. lastgroup: ", m. lastgroup print "m. group (1, 2): ", m. group (1, 2) print "m. groups (): ", m. groups () print "m. groupdict (): ", m. groupdict () print "m. start (2): ", m. start (2) print "m. end (2): ", m. end (2) print "m. span (2): ", m. span (2) print r "m. expand (R' \ 2 \ 1 \ 3 '): ", m. e XP and (R' \ 2 \ 1 \ 3') ### output #### m. string: hello world! # M. re: <_ sre. SRE_Pattern object at 0x016E1A38> # m. pos: 0 # m. endpos: 12 # m. lastindex: 3 # m. lastgroup: sign # m. group (1, 2): ('hello', 'World') # m. groups (): ('hello', 'World ','! ') # M. groupdict (): {'sign ':'! '} # M. start (2): 6 # m. end (2): 11 # m. span (2): (6, 11) # m. expand (R' \ 2 \ 1 \ 3 '): world hello! Instance method [| re module method]: match (string [, pos [, endpos]) | re. match (pattern, string [, flags]): This method will try to Match pattern from the pos subscript of string; If pattern can still be matched at the end, a match object will be returned; if the pattern does not match during the matching process, or the matching has reached endpos before it is completed, None is returned. The default values of pos and endpos are 0 and len (string), respectively. re. match () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern. Note: This method does not fully match. When pattern ends, if the string contains any remaining characters, the operation is still considered successful. To perform a full match, you can add the boundary match '$' At the end of the expression '. For an example, see section 2.1. Search (string [, pos [, endpos]) | re. search (pattern, string [, flags]): This method is used to find substrings that can be matched successfully in a string. Match pattern from the pos subscript of string. If pattern can still be matched at the end, a Match object is returned. If it cannot be matched, add pos to 1 and try again; if the pos = endpos still does not match, None is returned. The default values of pos and endpos are 0 and len (string) respectively. re. search () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern. [Python] import re m = re. match (R' (\ w + )(? P <sign>. *) ', 'Hello world! ') Print "m. string: ", m. string print "m. re: ", m. re print "m. pos: ", m. pos print "m. endpos: ", m. endpos print "m. lastindex: ", m. lastindex print "m. lastgroup: ", m. lastgroup print "m. group (1, 2): ", m. group (1, 2) print "m. groups (): ", m. groups () print "m. groupdict (): ", m. groupdict () print "m. start (2): ", m. start (2) print "m. end (2): ", m. end (2) print "m. span (2): ", m. span (2) print r "m. expand (R' \ 2 \ 1 \ 3 '): ", m. e XP and (R' \ 2 \ 1 \ 3') ### output #### m. string: hello world! # M. re: <_ sre. SRE_Pattern object at 0x016E1A38> # m. pos: 0 # m. endpos: 12 # m. lastindex: 3 # m. lastgroup: sign # m. group (1, 2): ('hello', 'World') # m. groups (): ('hello', 'World ','! ') # M. groupdict (): {'sign ':'! '} # M. start (2): 6 # m. end (2): 11 # m. span (2): (6, 11) # m. expand (R' \ 2 \ 1 \ 3 '): world hello! Split (string [, maxsplit]) | re. split (pattern, string [, maxsplit]): splits string Based on Matched substrings and returns the list. Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed. [Python] import re p = re. compile (R' \ d + ') print p. split ('one1two2three3four4 ') ### output ### ['one', 'two', 'three', 'four', ''] findall (string [, pos [, endpos]) | re. findall (pattern, string [, flags]): searches for strings and returns all matched substrings in the form of a list. [Python] import re p = re. compile (R' \ d + ') print p. findall ('one1two2three3four4 ') ### output ### ['1', '2', '3', '4'] finditer (string [, pos [, endpos]) | re. finditer (pattern, string [, flags]): searches for strings and returns an iterator that accesses each matching result (Match object) sequentially. [Python] import re p = re. compile (R' \ d + ') for m in p. finditer ('one1two2three3four4 '): print m. group (), ### output #### 1 2 3 4 sub (repl, string [, count]) | re. sub (pattern, repl, string [, count]): Use repl to replace each matched substring in string, and then return the replaced string. When repl is a string, you can use \ id, \ g <id>, \ g <name> to reference the group, but cannot use number 0. When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (the returned string cannot reference the group ). Count is used to specify the maximum number of replicas. If not specified, all replicas are replaced. [Python] import re p = re. compile (R' (\ w +) ') s =' I say, hello world! 'Print p. sub (R' \ 2 \ 1', s) def func (m): return m. group (1 ). title () + ''+ m. group (2 ). title () print p. sub (func, s) ### output #### say I, world hello! # I Say, Hello World! Subn (repl, string [, count]) | re. sub (pattern, repl, string [, count]): Returns (sub (repl, string [, count]), replacement times ). [Python] import re p = re. compile (R' (\ w +) ') s =' I say, hello world! 'Print p. subn (R' \ 2 \ 1', s) def func (m): return m. group (1 ). title () + ''+ m. group (2 ). title () print p. subn (func, s) ### output #### ('say I, world hello! ', 2) # (' I Say, Hello World! ', 2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.