Introduction to Python Regular Expressions _python

Source: Internet
Author: User
Tags character classes instance method locale readable in python
Note: This article is based on Python2.4 completion, if you see the words do not understand, please remember Baidu Google or wiki, whatever.

1. Regular Expression Basics
1.1. Brief introduction

Regular expressions are not part of Python. Regular expressions are powerful tools for handling strings, with their own unique syntax and an independent processing engine that may not be as efficient as the STR-band approach, but powerful. Thanks to this, in a language that provides regular expressions, the syntax of regular expressions is the same, except that the number of grammars supported by different programming languages is different, but don't worry, unsupported syntax is usually a less common part. If you've already used regular expressions in other languages, just take a quick look at them.

The following illustration shows a process that uses regular expressions to match:



The approximate matching process for regular expressions is to take out the expressions and the character comparisons in the text, and if each character matches, the match succeeds, and the match fails if there is a matching unsuccessful character. If there are quantifiers or boundaries in the expression, the process will be slightly different, but it is well understood, look at the example in the figure below and use it a few more times to understand.

The following figure lists the regular expression meta characters and syntax that Python supports:



1.2. The greedy pattern and non-greedy mode of counting classifier

Regular expressions are typically used to find matching strings in text. The quantitative word in Python is greedy by default (in a few languages it may also be the default and not greedy), always trying to match as many characters as possible; instead of greedy, always try to match as few characters as possible. For example, if the regular expression "ab*" is used to find "ABBBC", "abbb" will be found. And if you use a non greedy number word "ab*", you will find "a".

1.3. Anti-slash puzzle

As with most programming languages, "\" is used as the escape character in regular expressions, which can cause a backslash to be bothered. If you need to match the character "\" in the text, then the regular expression in the programming language will require 4 backslashes "\\\\": the first two and the last two are used to escape the backslash in the programming language, converted to two backslashes and then escaped into a backslash in the regular expression. The native string in Python solves the problem well, and the regular expression in this example can be expressed using R "\". Similarly, the "\\d" that matches a number can be written as r "\d". With the original string, you no longer have to worry about whether you have written a backslash, and the expression is more intuitive.

1.4. Matching mode

Regular expressions provide a number of matching patterns that are available, such as ignoring case, multiline matching, and so on, which will be introduced together in the pattern class's Factory method Re.compile (pattern[, flags]).

2. Re module

2.1. Start using RE

Python provides support for regular expressions through the RE module. The general step in using the RE is to compile the string form of the regular expression into a pattern instance, and then use the patterns instance to process the text and get the matching result (a match instance), and finally use the match instance to get the information and do other things.
Copy Code code as follows:

# Encoding:utf-8
Import re
# compiles a regular expression into a pattern object
Pattern = Re.compile (R ' Hello ')
# match the text with pattern, get the matching result, and return none when the match cannot be matched
Match = Pattern.match (' Hello world! ')
If match:
# Use Match to get grouped information
Print Match.group ()
### Output ###
# Hello

Re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which is used to compile a regular expression in a string as a patterns object. The second parameter flag is a matching pattern, which can be used by bitwise OR operator ' | ' means that it takes effect at the same time, such as re. I | Re. M. Alternatively, you can specify patterns in the Regex string, such as re.compile (' pattern ', re. I | Re. M) is equivalent to Re.compile (' (? im) pattern ').
Optional values are:
Re. I (re. IGNORECASE): Ignore case (in parentheses is the complete writing, the same below)
M (MULTILINE): Multiline mode, changing ' ^ ' and ' $ ' behavior (see above)
S (Dotall): Point arbitrary matching mode, change '. ' The behavior
L (LOCALE): Make predefined character classes \w \w \b \b \s depending on the current locale
U (UNICODE): Make predefined character classes \w \w \b \b \s \s \d depending on the character attributes of the UNICODE definition
X (VERBOSE): Verbose mode. The regular expression in this mode can be multiple lines, ignoring whitespace characters, and can be added to a comment. The following two regular expressions are equivalent:
Copy Code code as follows:

A = Re.compile (r "" "\d + # The integral part
\. # The decimal point
\d * # Some fractional digits "", Re. X
b = Re.compile (r "\d+\.\d*")

Re provides a number of modular methods for completing regular expression functions. These methods can be substituted by the corresponding method of the pattern instance, with the only advantage being to write a single line of Re.compile () code, but not to reuse the compiled pattern object at the same time. These methods are described in conjunction with the instance Methods section of the pattern class. As the above example can be abbreviated as:
Copy Code code as follows:

m = Re.match (R ' Hello ', ' Hello world! ')
Print M.group ()

The RE module also provides a method escape (string), which is used to return the regular expression metacharacters in a string before Furu */+/with the escape character, which is useful when a large number of matching meta characters are required.
2.2. Match
The match object is a matching result that contains a lot of information about the match, and you can use the readable properties or methods provided by match to get that information.
Property:
String: The text to be used when matching.
Re: The pattern object used when matching.
POS: The index in which the literal expression begins the search. The value is the same as the Pattern.match () and Pattern.seach () method with the same name.
Endpos: The index in which the literal expression ends the search. The value is the same as the Pattern.match () and Pattern.seach () method with the same name.
Lastindex: The index of the last captured grouping in text. If no groupings are captured, none will be.
Lastgroup: The alias of the last captured group. If this group has no alias or is not a captured grouping, it will be none.
Method:
Group ([Group1, ...]):
Gets the string that is intercepted by one or more groups, and when multiple parameters are specified, they are returned as tuples. Group1 can use the number or alias, and number 0 represents the entire matched substring, returns Group (0) without filling the argument, the group that does not intercept the string, returns none, and the group that has been intercepted repeatedly returns the last intercepted substring.
Groups ([default]):
Returns a string of all grouped interceptions in a tuple form. Equivalent to calling group (1,2,... last). Default indicates that a group with no intercept string is substituted with this value, and the defaults to none.
Groupdict ([default]):
Returns a dictionary with the alias of an alias group as the value of the substring intercepted by the group, and a group without aliases is not included. Default meaning ditto.
Start ([group]):
Returns the starting index (the index of the first character of the substring) of the substring that the specified group intercepts in string. The group default value is 0.
End ([group]):
Returns the end index of the substring of the specified group intercept in string (the index of the last character of the substring + 1). The group default value is 0.
span ([group]):
Returns (Start (group), End (group)).
Expand (Template):
The group to be matched into the template and then returned. Template can be grouped using \id or \g<id>, \g<name> references, but no number 0 is used. \id is equivalent to \g<id>, but \10 is considered to be the 10th grouping, and if you want to express \1 the character ' 0 ', you can only use \g<1>0.
Copy Code code as follows:

Import re
m = Re.match (R ' (\w+) (\w+) (? p<sign>.*) ', ' Hello world! '
Print "m.string:", m.string
Print "M.re:", m.re
Print "M.pos:", M.pos
Print "M.endpos:", M.endpos
Print "M.lastindex:", M.lastindex
Print "M.lastgroup:", M.lastgroup
Print "M.group (1,2):", M.group (1, 2)
Print "M.groups ():", M.groups ()
Print "M.groupdict ():", M.groupdict ()
Print "M.start (2):", M.start (2)
Print "M.end (2):", M.end (2)
Print "M.span (2):", M.span (2)
Print R "M.expand (R ' \2 \1\3 '):", M.expand (R ' \2 \1\3 ')
### Output ###
# M.string:hello world!
# m.re: <_sre. Sre_pattern Object at 0x016e1a38>
# m.pos:0
# M.endpos:12
# M.lastindex:3
# m.lastgroup:sign
# M.group (1,2): (' Hello ', ' world ')
# m.groups (): (' Hello ', ' world ', '! ')
# m.groupdict (): {' sign ': '! '}
# M.start (2): 6
# M.end (2): 11
# M.span (2): (6, 11)
# M.expand (R ' \2 \1\3 '): World hello!

2.3. Pattern
A pattern object is a compiled regular expression that can be matched to a search by a series of methods provided by pattern.
Pattern cannot be directly instantiated and must be constructed using Re.compile ().
Pattern provides several readable properties for obtaining information about an expression:
Pattern: An expression string used at compile time.
Flags: A matching pattern for compile-time use. Digital form.
Groups: The number of groups in an expression.
Groupindex: A dictionary with the alias of an alias group in an expression as a key and a value for that group number, no alias group is included.
Copy Code code as follows:

Import re
p = re.compile (R ' (\w+) (\w+) (?) p<sign>.*) ', Re. Dotall)
Print "P.pattern:", P.pattern
Print "P.flags:", p.flags
Print "P.groups:", p.groups
Print "P.groupindex:", P.groupindex
### Output ###
# P.pattern: (\w+) (\w+) (?) p<sign>.*)
# p.flags:16
# P.groups:3
# P.groupindex: {' sign ': 3}

instance method [| Re module Method]:
1, Match (string[, pos[, Endpos]) | Re.match (pattern, string[, flags]):
This method attempts to match pattern from a string's pos subscript, or if the pattern ends up being matched, returns a Match object, or none if the pattern does not match during the match or if the match has reached endpos at the end.
The default values for POS and Endpos are 0 and Len (string), and Re.match () cannot specify these two parameters, and the parameter flags specify matching patterns when compiling pattern.
Note: This method does not match exactly. When pattern ends, string and remaining characters are still considered successful. To match exactly, you can add the boundary match ' $ ' at the end of the expression.
See section 2.1 for an example.
2, Search (string[, pos[, Endpos]) | Re.search (pattern, string[, flags]):
This method is used to find substrings in a string that can match a success. Attempts to match pattern from a string's pos subscript, if the pattern is still matched at the end of the pattern, returns a Match object, and if it does not match, the POS is added 1 and then the match is tried again, and none is returned until the Pos=endpos is still unable to match.
The default values for POS and Endpos are 0 and len (string) respectively, and Re.search () cannot specify these two parameters, and the parameter flags specify matching patterns when compiling pattern.
Copy Code code as follows:

# Encoding:utf-8
Import re
# compiles a regular expression into a pattern object
Pattern = Re.compile (R ' World ')
# Use Search () to find a matching substring that does not have a matching substring to return none
# This example uses match () to not match successfully
Match = Pattern.search (' Hello world! ')
If match:
# Use Match to get grouped information
Print Match.group ()
### Output ###
# World

3, Split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]):
Returns a list after the string is split by a substring that can match. Maxsplit is used to specify the maximum number of partitions, without specifying that all will be split.
Copy Code code as follows:

Import re
p = re.compile (R ' \d+ ')
Print p.split (' One1two2three3four4 ')
### Output ###
# [' One ', ' two ', ' three ', ' four ', ']

4, FindAll (string[, pos[, Endpos]) | Re.findall (pattern, string[, flags]):
Searches for a string that returns all substrings that can be matched in the form of a list.
Copy Code code as follows:

Import re
p = re.compile (R ' \d+ ')
Print P.findall (' One1two2three3four4 ')
### Output ###
# [' 1 ', ' 2 ', ' 3 ', ' 4 ']

5, Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.
Copy Code code as follows:

Import re
p = re.compile (R ' \d+ ')
For M in P.finditer (' One1two2three3four4 '):
Print M.group (),
### Output ###
# 1 2 3 4

6, Sub (REPL, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]):
Returns the replaced string after each matching substring in the string is replaced with REPL.
When Repl is a string, you can group with \id or \g<id>, \g<name> references, but you cannot use number 0.
When Repl is a method, this method should accept only one argument (the match object) and return a string to replace (the returned string cannot be referenced in a group).
Count is used to specify the maximum number of substitutions and replace them when not specified.
Copy Code code as follows:

Import re
p = re.compile (R ' (\w+) (\w+) ')
s = ' I say, hello world! '
Print p.sub (R ' \2 \1 ', s)
def func (m):
return M.group (1). Title () + ' + m.group (2). Title ()
Print P.sub (func, s)
### Output ###
# Say I, World hello!
# I Say, Hello world!

7, Subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count]):
Returns (Sub (REPL, string[, Count), number of substitutions).
Copy Code code as follows:

Import re
p = re.compile (R ' (\w+) (\w+) ')
s = ' I say, hello world! '
Print p.subn (R ' \2 \1 ', s)
def func (m):
return M.group (1). Title () + ' + m.group (2). Title ()
Print P.subn (func, s)
### Output ###
# (' Say I, World hello! ', 2)
# (' I Say, Hello world! ', 2)

The above is Python's support for regular expressions. Mastering regular expressions is a skill that every programmer must have, and there are no programs that don't deal with strings these days. The author is also in the primary stage, with June, ^_^
In addition, the special structure part of the graph does not give examples, and the regular expressions used in these are difficult. Be interested to think about how to match a word that is not started with ABC ^_^
End of full text
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.