Introduction to the Python regular expression re module

Source: Internet
Author: User

Brief introduction

Regular Expressions (regular expression) are patterns that can match text fragments. The simplest regular expression is an ordinary string that can match itself. For example, the regular expression ' hello ' can match the string ' hello '.

Note that the regular expression is not a program, but a pattern for working with strings, and if you want to use it to handle strings, you have to work with tools that support regular expressions, such as awk in Linux, sed, grep, or the programming language Perl, Python, Java, and so on.

There are several different styles of regular expressions, and the following table lists some of the meta-characters and descriptions for programming languages such as Python or Perl:

Re module

In Python, we can use the built-in re module to use regular expressions.

It is important to note that regular expressions are used \ to escape special characters, for example, in order to match the string ' python.org ', we need to use regular expressions 'python\.org' , and the Python string itself is \ escaped, so the above regular expression Python should be written 'python\\.org' , this will be easy to get into \ the puzzle, therefore, we recommend using the original Python string, just add an R prefix, the above regular expression can be written as:

R ' python\.org '

The RE module provides a number of useful functions to match strings, such as:

    • Compile function

    • Match function

    • Search function

    • FindAll function

    • Finditer function

    • Split function

    • Sub function

    • subn function

The general use steps of the RE module are as follows:

    • Use the compile function to compile the string form of a regular expression into a pattern object

    • Match the text with a series of methods provided by the Pattern object for matching results (a match object)

    • Finally, use the properties and methods provided by the Match object to get information and perform other actions as needed

Compile function

The compile function compiles regular expressions and generates a Pattern object , which is typically used in the following form:

Re.compile (pattern[, flag])

Where pattern is a regular expression in the form of a string, flag is an optional parameter that represents a matching pattern, such as ignoring case, multiline mode, and so on.

Let's take a look at the example below.

Import re# compiles regular expressions into pattern objects pattern = Re.compile (R ' \d+ ')

Above, we have compiled a regular expression into the pattern object, then we can use the pattern of a series of methods to find the text matching. Some common methods of Pattern objects include:

    • Match method

    • Search method

    • FindAll method

    • Finditer method

    • Split method

    • Sub method

    • Subn method

Match method

The match method is used to find the head of a string (you can also specify a starting position), which is a match, and returns if a matching result is found, rather than finding all matching results. Its general form of use is as follows:

Match (string[, pos[, Endpos])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length). Therefore, when you do not specify POS and Endpos, the match method defaults to the header of the string .

When the match succeeds, a match object is returned and none is returned if there is no match.

Take a look at examples.

>>> Import re>>> pattern = Re.compile (R ' \d+ ')                    # is used to match at least one number >>> m = Pattern.match (' One12twothree34four ')        # Find header, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 2, 10) # Start match from ' e ' position, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 3, 10) # Start match from ' 1 ' position, Exactly match >>> print M                                         # Returns a Match object <_sre. Sre_match object at 0x10a42aac0>>>> m.group (0)   # can omit 0 ' >>> m.start (0)   # can omit 03>>& Gt M.end (0)     # can be omitted 05>>> m.span (0)    # can be omitted 0 (3, 5)

On top, when the match succeeds, a match object is returned, where:

    • group([group1, …])method is used to obtain one or more grouped matching strings, which can be used directly when the entire matched substring is to be obtained group() group(0) ;

    • start([group])method is used to get the starting position of the substring in the entire string (the index of the first character of the substring), the default value of the parameter is 0;

    • end([group])The method is used to get the ending position of the grouped matched substring in the entire string (the index of the last character of the substring + 1), the default value of the parameter is 0;

    • span([group])Method is returned (start(group), end(group)) .

Take a look at one more example:

>>> Import re>>> pattern = Re.compile (R ' ([a-z]+) ([a-z]+) ', Re. I) # Re. I means ignore case >>> m = Pattern.match (' Hello World Wide Web ') >>> print M # match succeeded, return Returns a Match object <_sre. Sre_match object at 0x10bea83e8>>>> m.group (0) # Returns the entire substring matching success ' Hello World ' &GT;&GT;&G T M.span (0) # Returns the index of the entire substring of the matching success (0, one-by-one) >>> M.group (1) # returns the first grouped horse                            With successful substring ' Hello ' >>> m.span (1) # returns the index of the first packet matching successful substring (0, 5) >>> M.group (2) # returns the second packet matching successful substring ' world ' >>> M.span (2) # returns the second packet matching successful substring (6, one-by-one) >&gt ;> m.groups () # Equivalent to (M.group (1), M.group (2), ...) (' Hello ', ' World ') >>> M.group (3) # There is no third grouping traceback (most recent call last): File "& Lt;stdin> ", line 1, in <module>indexerror:no such group

Search method

The search method is used to find any location of a string, it is also a match, and returns if a matching result is found, rather than finding all matching results, which is generally used as follows:

Search (string[, pos[, Endpos])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).

When the match succeeds, a match object is returned and none is returned if there is no match.

Let's take a look at the example:

>>> Import re>>> pattern = re.compile (' \d+ ') >>> m = Pattern.search (' One12twothree34four ')  # This does not match >>> m<_sre if you use the match method. Sre_match object at 0x10cc03ac0>>>> m.group () ' >>> m = pattern.search (' One12twothree34four ', 10 )  # Specify the string range >>> m<_sre. Sre_match object at 0x10cc03b28>>>> m.group () ' >>> M.span () (13, 15)

Let's look at one more example:

#-*-Coding:utf-8-*-import re# compiles a regular expression into a pattern object pattern = Re.compile (R ' \d+ ') # Use Search () to find a matching substring that will return none when there are no matching substrings # here using match () cannot successfully match m = pattern.search (' Hello 123456 789 ') if M:     # Use Match to get Group info     print ' matching string: ', M.G Roup ()    print ' position: ', M.span ()

Execution Result:

Matching string:123456position: (6, 12)

FindAll method

The match and search methods above are all a match, as long as a match is found and the result is returned. However, most of the time, we need to search the entire string to get all the matching results.

The FindAll method is used in the following form:

FindAll (string[, pos[, Endpos])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).

FindAll returns all matched substrings in a list, and returns an empty list if there is no match.

Take a look at the example:

Import Repattern = Re.compile (R ' \d+ ')   # find Number result1 = Pattern.findall (' Hello 123456 789 ') result2 = Pattern.findall (' One1two2three3four4 ', 0, print Result1print result2

Execution Result:

[' 123456 ', ' 789 '] [' 1 ', ' 2 ']

Finditer method

The behavior of the Finditer method is similar to the behavior of FindAll, and it also searches the entire string for all matching results. But it returns an iterator that sequentially accesses each match result (Match object).

Take a look at the example:

#-*-coding:utf-8-*-import repattern = re.compile (R ' \d+ ') Result_iter1 = Pattern.finditer (' Hello 123456 789 ') result_ite r2 = pattern.finditer (' One1two2three3four4 ', 0, ten) print type (result_iter1) print type (result_iter2) print ' Result1 ... ' For M1 in Result_iter1:   # M1 is the Match object    print ' matching string: {}, Position: {} '. Format (M1.group (), M1.span ()) PRI NT ' result2 ... ' for m2 in result_iter2:    print ' matching string: {}, Position: {} '. Format (M2.group (), M2.span ())

Execution Result:

<type ' callable-iterator ' ><type ' callable-iterator ' >result1...matching string:123456, Position: (6, 12) Matching string:789, Position: (4) result2...matching string:1, Position: (3,) matching string:2, Position: (7, 8)

Split method

The Split method returns the list after splitting the string by a substring that can be matched, using the following form:

Split (string[, Maxsplit])

Where Maxsplit is used to specify the maximum number of splits and does not specify that all will be split.

Take a look at the example:

Import rep = Re.compile (R ' [\s\,\;] + ') Print p.split (' A, b;; c   d ')

Execution Result:

[' A ', ' B ', ' C ', ' d ']

Sub method

The sub method is used for substitution. It is used in the following form:

Sub (repl, string[, Count])

Where repl can be a string, or it can be a function:

    • If Repl is a string, it uses REPL to replace each matched substring of the string and returns the substituted string, and repl can also use \id the form to refer to the grouping, but cannot use the number 0;

    • If Repl is a function, this method should only accept one argument (the Match object) and return a string for substitution (the returned string cannot be referenced in a group).

Count is used to specify the maximum number of replacements, not all when specified.

Take a look at the example:

Import rep = Re.compile (R ' (\w+) (\w+) ') s = ' Hello 123, Hello 456 ' def func (m):    return ' Hi ' + ' + m.group (2) Print p.su B (R ' Hello World ', s)  # use ' Hello World ' to replace ' Hello 123 ' and ' Hello 456 ' print p.sub (R ' \2 \1 ', s)        # Reference Group Print p.sub (fun c, s) print P.sub (func, S, 1)         # Replace at most

Execution Result:

Hello world, Hello world123 Hello, 456 hellohi 123, Hi 456hi 123, Hello 456

Subn method

The Subn method is similar to the behavior of the sub method and is also used for substitution. It is used in the following form:

Subn (REPL, string[, Count])

It returns a tuple:

(Sub (repl, string[, Count]), number of replacements)

The tuple has two elements, the first element is the result of using the Sub method, and the second element returns the number of times the original string was replaced.

Take a look at the example:

Import rep = Re.compile (R ' (\w+) (\w+) ') s = ' Hello 123, Hello 456 ' def func (m):    return ' Hi ' + ' + m.group (2) Print p.su bn (R ' Hello World ', s) print p.subn (R ' \2 \1 ', s) print P.subn (func, s) print P.subn (func, S, 1)

Execution Result:

(' Hello World, Hello World ', 2) (' 123 Hello, 456 hello ', 2) (' Hi 123, Hi 456 ', 2) (' Hi 123, Hello 456 ', 1)

Other functions

In fact, a series of methods of Pattern objects generated using the compile function correspond to most functions of the RE module, but there are subtle differences in usage.

Match function

The match function is used in the following form:

Re.match (pattern, string[, flags]):

Where pattern is a string form of a regular expression, for example \d+ , [a-z]+ .

The pattern object's match method is used in the following form:

Match (string[, pos[, Endpos])

As you can see, the match function cannot specify a range of strings, it can only search the head and see examples:

Import rem1 = Re.match (R ' \d+ ', ' One12twothree34four ') if M1:    print ' matching string: ', M1.group () Else:    print ' M1 is: ', m1m2 = Re.match (R ' \d+ ', ' 12twothree34four ') if m2:    print ' matching string: ', M2.group () Else: "    print ' m2 is: ' , M2

Execution Result:

M1 is:nonematching String:12

Search function

The search function is used in the following form:

Re.search (pattern, string[, flags])

The search function does not specify a searching interval for a string, similar to that used by the Pattern object.

FindAll function

The FindAll function is used in the following form:

Re.findall (pattern, string[, flags])

The FindAll function cannot specify a search interval for a string similar to the FindAll method of the Pattern object.

Take a look at the example:

Import reprint Re.findall (R ' \d+ ', ' Hello 12345 789 ') # output [' 12345 ', ' 789 ']

Finditer function

The Finditer function is used in the same way as the pattern Finditer method, in the following form:

Re.finditer (pattern, string[, flags])

Split function

The Split function is used in the following form:

Re.split (Pattern, string[, Maxsplit])

Sub function

The sub function is used in the following form:

Re.sub (Pattern, REPL, string[, Count])

subn function

The SUBN function is used in the following form:

RE.SUBN (Pattern, REPL, string[, Count])

In what way?

As you can see from the above, there are two ways to use the RE module:

    • Use the Re.compile function to generate a pattern object, and then use a series of methods of the pattern object to match the text to find;

    • Directly using Re.match, Re.search and Re.findall functions to find the text matching directly;

Below, we use an example to illustrate these two methods.

Let's look at the 1th usage:

Import re# compiles the regular expression first into the pattern object pattern = Re.compile (R ' \d+ ') print Pattern.match (' 123, 123 ') Print Pattern.search (' 234, 234 ') Print Pattern.findall (' 345, 345 ')

Let's look at the 2nd usage:

Import reprint Re.match (R ' \d+ ', ' 123, 123 ') Print Re.search (R ' \d+ ', ' 234, 234 ') Print Re.findall (R ' \d+ ', ' 345, 345 ')

If a regular expression needs to be used multiple times (for example, above \d+ ) and is often used on many occasions, for efficiency reasons, we should compile the regular expression in advance, generate a Pattern object, and then use the object's series of methods to match the files that need to be matched, and if you directly use the Re.match, Re.search, and so on, each time a regular expression is passed in, it will be compiled once and the efficiency will be greatly compromised.

Therefore, we recommend the use of a 1th usage.

Match Chinese

In some cases, we want to match the characters in the text, it is important to note that the Chinese Unicode encoding range is mainly [\u4e00-\u9fa5] , here is mainly because this range is not complete, such as does not include full-width (Chinese) punctuation, but in most cases, should be sufficient.

Suppose you want to title = u'你好,hello,世界' extract the Chinese in the string now, you can do this:

#-*-Coding:utf-8-*-import retitle = u ' Hello, hello, world ' pattern = re.compile (ur ' [\u4e00-\u9fa5]+ ') result = Pattern.findall ( title) Print Result

Notice that we have added two prefixes to the regular expression ur , which means that the r original string is used to u represent a Unicode string.

Execution Result:

[u ' \u4f60\u597d ', U ' \u4e16\u754c ']

Greedy match

In Python, a regular match is a greedy match by default (which may be non-greedy in a few languages), which is to match as many characters as possible .

For example, we want to find out all the blocks in a string p :

Import recontent = ' aa<p>test1</p>bb<p>test2</p>cc ' pattern = Re.compile (R ' <p>.*</p > ') result = Pattern.findall (content) Print result

Execution Result:

[' <p>test1</p>bb<p>test2</p> ']

Because the regular match is a greedy match, that is, as many matches as possible, </p> it also tries to match to the right when it succeeds to the first, to see if there is a longer substring that can be successfully matched.

If we want a non-greedy match, you can add one ? , as follows:

Import recontent = ' aa<p>test1</p>bb<p>test2</p>cc ' pattern = Re.compile (R ' <p>.*?</ P> ')    # Plus? result = Pattern.findall (content) Print result

Results:

[' <p>test1</p> ', ' <p>test2</p> ']

Summary

The general use steps of the RE module are as follows:

    • Use the compile function to compile the string form of a regular expression into a pattern object;

    • The matching result (a match object) is obtained by matching the text with a series of methods provided by the Pattern object.

    • Finally, use the properties and methods provided by the Match object to obtain information, and perform other actions as needed;

The Python regular match is the greedy match by default.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.