Python regular expression

Source: Internet
Author: User
Regularexpression is a pattern that can match text segments. The simplest regular expression is a normal string that can match itself. For example, the regular expression 'hello' can match the string 'hello '. Introduction

A regular expression (regular expression) is a pattern that matches a text clip. The simplest regular expression is a normal string that can match itself. For example, the regular expression 'hello' can match the string 'hello '.

Note that a regular expression is not a program, but a pattern used to process strings. if you want to use it to process strings, you must use a tool that supports regular expressions, for example, awk, sed, grep in Linux, or Perl, Python, and Java programming languages.

Regular expressions have different styles. The following table lists some metacharacters applicable to Python, Perl, and other programming languages:

Re module

In Python, we can use the built-in re module to use regular expressions.

Note that regular expressions escape special characters, for example, to match the string 'Python. org ', we need to use the regular expression 'Python. org ', and Python strings are also escaped. Therefore, the above regular expression should be written as 'Python \. org ', which is easy to get into trouble. Therefore, we recommend that you use the original Python string with an r prefix. the above regular expression can be written as follows:

r'python\.org'

The re module provides many useful functions to match strings, such:

  • Compile functions

  • Match function

  • Search functions

  • Findall function

  • Finditer function

  • Split function

  • Sub function

  • Subn function

The general steps for using the re module are as follows:

  • Compile the regular expression string into a Pattern object using the compile function.

  • Match the text by using a series of methods provided by the Pattern object to obtain the matching result (a Match object)

  • Finally, use the properties and methods provided by the Match object to obtain information and perform other operations as needed.

  • Compile functions

The compile function is used to compile regular expressions and generate a Pattern object. the general usage format is as follows:

re.compile(pattern[, flag])

Here, pattern is a regular expression in the form of a string, and flag is an optional parameter, indicating the matching mode, such as case-insensitive and multiline mode.

Next, let's take a look at the example.

Import re # compile the regular expression into the Pattern object pattern = re. compile (r' \ d + ')

In the above section, we have compiled a regular expression into a Pattern object. Next, we can use a series of methods of pattern to search for the text. Common methods of Pattern objects include:

  • Match method

  • Search method

  • Findall method

  • Finditer method

  • Split method

  • Sub method

  • Subn method

  • Match method

The match method is used to find the header of a string (you can also specify the starting position). It is a match. if a matching result is found, it is returned, instead of all matching results. It is generally used as follows:

match(string[, pos[, endpos]])

Here, string is the string to be matched, and pos and endpos are optional parameters, specifying the start and end positions of the string. the default values are 0 and len (string length), respectively ). Therefore, if you do not specify pos and endpos, the match method matches the string header by default.

If a Match is successful, a Match object is returned. If no Match is found, None is returned.

Let's look at the example.

>>> Import re >>> pattern = re. compile (r' \ d + ') # used to match at least one number> m = pattern. match ('one12twothree34four ') # search for the header without matching >>> print mNone >>> m = pattern. match ('one12twothree34four ', 2, 10) # match from the 'e' position, no match >>>> print mNone >>> m = pattern. match ('one12twothree34four ', 3, 10) # Match from the position '1', exactly match >>> print m # returns a Match object <_ sre. SRE_Match object at 0x10a42aac0 >>> m. group (0) # you can omit 0'12'> m. start (0) # can be omitted 03 >>> m. end (0) # can be omitted 05 >>> m. span (0) #0 (3, 5) can be omitted)

In the preceding example, a Match object is returned when the matching is successful, where:

  • Group ([group1,…]) The method is used to obtain one or more strings matched by a group. to obtain the entire matched substring, you can directly use group () or group (0 );

  • The start ([group]) method is used to obtain the starting position (index of the first character of the substring) of the substring that matches the group in the entire string. the default value of the parameter is 0;

  • The end ([group]) method is used to obtain the end position of the substring that matches the group in the entire string (index of the last character of the substring + 1). The default value of the parameter is 0;

  • The span ([group]) method returns (start (group), end (group )).

Let's look at an example:

>>> Import re >>> pattern = re. compile (r' ([a-z] +) ([a-z] +) ', re. i) # re. I indicates case insensitive >>> m = pattern. match ('Hello World Wide Web') >>> print m # A Match object <_ sre. SRE_Match object at 0x10bea83e8 >>> m. group (0) # The entire substring 'Hello world'> m. span (0) # returns the index (0, 11) of the matched substring> m. group (1) # return the substring 'hello'> m. span (1) # returns the index (0, 5)> m. group (2) # return the child string 'world'> m. span (2) # returns the child string (6, 11) that matches the second group successfully. >>> m. groups () # equivalent to (m. group (1), m. group (2 ),...) ('hello', 'World') >>> m. group (3) # The third group Traceback (most recent call last) does not exist: File"
 
  
", Line 1, in
  
   
IndexError: no such group
  
 

Search method

The search method is used to search for any position of a string. it is also a match. if a matching result is found, it is returned instead of all matching results. the general usage of the search method is as follows:

search(string[, pos[, endpos]])

Here, string is the string to be matched, and pos and endpos are optional parameters, specifying the start and end positions of the string. the default values are 0 and len (string length), respectively ).

If a Match is successful, a Match object is returned. If no Match is found, None is returned.

Let's take a look at the example:

>>> Import re >>> pattern = re. compile ('\ d +') >>> m = pattern. search ('one12twothree34four ') # here, if the match method is used, it does not match >>> m <_ sre. SRE_Match object at 0x10cc03ac0 >>> m. group () '12' >>> m = pattern. search ('one12twothree34four ', 10, 30) # specify the string range >>> m <_ sre. SRE_Match object at 0x10cc03b28 >>> m. group () '34'> m. span () (13, 15)

Let's look at an example:

#-*-Coding: UTF-8-*-import re # compile the regular expression into the Pattern object pattern = re. compile (r' \ d + ') # Use search () to find matched substrings. If no matched substrings exist, None is returned. # match () is used here () failed to match m = pattern. search ('Hello 123456 789 ') if m: # Use Match to obtain the group information print 'matching string:', m. group () print 'position: ', m. span ()

Execution result:

matching string: 123456position: (6, 12)

Findall method

The match and search methods above are all matched once. if a matching result is found, the results are returned. However, in most cases, we need to search the entire string to obtain all matching results.

The findall method is used as follows:

findall(string[, pos[, endpos]])

Here, string is the string to be matched, and pos and endpos are optional parameters, specifying the start and end positions of the string. the default values are 0 and len (string length), respectively ).

Findall returns all matched substrings in the form of a list. If no match exists, an empty list is returned.

Let's look at the example:

Import repattern = re. compile (r' \ d + ') # search for the number result1 = pattern. findall ('Hello 123456 789 ') result2 = pattern. findall ('one1two2three3four4 ', 0, 10) print result1print result2

Execution result:

['123456', '789']['1', '2']

Finditer method

The behavior of the finditer method is similar to that of findall. it is used to search the entire string and obtain all matching results. However, it returns an iterator that accesses each matching result (Match object) sequentially.

Let's look at the example:

#-*-Coding: UTF-8-*-import repattern = re. compile (r' \ d + ') result_iter1 = pattern. finditer ('Hello 123456 789 ') result_iter2 = pattern. finditer ('one1two2three3four4 ', 0, 10) print type (result_iter1) print type (result_iter2) print 'result1... 'for m1 in result_iter1: # m1 is the Match object print 'matching string :{}, position :{}'. format (m1.group (), m1.span () print 'result2... 'for m2 in result_iter2: print 'matching string :{}, position :{}'. format (m2.group (), m2.span ())

Execution result:

 
  
   result1...matching string: 123456, position: (6, 12)matching string: 789, position: (13, 16)result2...matching string: 1, position: (3, 4)matching string: 2, position: (7, 8)
  
 

Split method

The split method splits strings according to matching substrings and returns the list. the format is as follows:

split(string[, maxsplit])

Maxsplit is used to specify the maximum number of splits. if not specified, all splits are performed.

Let's look at the example:

import rep = re.compile(r'[\s\,\;]+')print p.split('a,b;; c   d'

Execution result:

['a', 'b', 'c', 'd']

Sub method

The sub method is used for replacement. It is used as follows:

sub(repl, string[, count])

Here, repl can be a string or a function:

  • If repl is a string, repl is used to replace each matched substring of the string, and the replaced substring is returned. In addition, repl can also reference the group by id, however, No. 0 is allowed;

  • If repl is a function, this method should only accept one parameter (Match object) and return a string for replacement (The Returned string cannot reference the group ).

  • Count is used to specify the maximum number of replicas. if not specified, all replicas are replaced.

Let's look at the example:

Import rep = re. compile (r' (\ w +) ') s = 'Hello 123, hello 456' def func (m ): return 'hi' + ''+ m. group (2) print p. sub (r 'Hello World', s) # use 'Hello World' to replace 'Hello 123' and 'Hello 123' print p. sub (r' \ 2 \ 1 ', s) # Reference Group print p. sub (func, s) print p. sub (func, s, 1) # One replacement at most

Execution result:

hello world, hello world123 hello, 456 hellohi 123, hi 456hi 123, hello 456

Subn method

The subn method is similar to the sub method and is also used for replacement. It is used as follows:

subn(repl, string[, count])

It returns a tuples:

(Sub (repl, string [, count]), replacement times)

The tuples have two elements. The first element is the result of using the sub method, and the second element returns the number of times the original string is replaced.

Let's look at the example:

import rep = re.compile(r'(\w+) (\w+)')s = 'hello 123, hello 456' def func(m):    return 'hi' + ' ' + m.group(2) print p.subn(r'hello world', s)print p.subn(r'\2 \1', s)print p.subn(func, s)print p.subn(func, s, 1)

Execution result:

('hello world, hello world', 2)('123 hello, 456 hello', 2)('hi 123, hi 456', 2)('hi 123, hello 456', 1)

Other functions

In fact, a series of methods of the Pattern object generated using the compile function correspond to most functions of the re module, but there are slight differences in usage.

Match function

The use of the match function is as follows:

re.match(pattern, string[, flags]):

Here, pattern is a string form of a regular expression, such as d + and [a-z] +.

The Pattern object uses the following match method:

match(string[, pos[, endpos]])

As you can see, the match function cannot specify the string range. it can only search for the header. let's look at the example:

import re m1 = re.match(r'\d+', 'One12twothree34four')if m1:    print 'matching string:',m1.group()else:    print 'm1 is:',m1    m2 = re.match(r'\d+', '12twothree34four')if m2:    print 'matching string:', m2.group()else:    print 'm2 is:',m2

Execution result:

m1 is: Nonematching string: 12

Search functions

The search function is used as follows:

re.search(pattern, string[, flags])

The search function cannot specify the search intervals of strings. the usage is similar to that of the search method of the Pattern object.

Findall function

The findall function is used as follows:

re.findall(pattern, string[, flags])

The findall function cannot specify the search interval of a string. its usage is similar to that of the findall method of the Pattern object.

Let's look at the example:

Import re print re. findall (r' \ d + ', 'Hello 12345 789') # output ['20180101', '20180101']

Finditer function

The usage of the finditer function is similar to that of the finditer method of Pattern. The form is as follows:

re.finditer(pattern, string[, flags])

Split function

The use of the split function is as follows:

re.split(pattern, string[, maxsplit])

Sub function

The sub function is used as follows:

re.sub(pattern, repl, string[, count])

Subn function

The subn function is used as follows:

re.subn(pattern, repl, string[, count])

Which method is used?

We can see from the above that there are two ways to use the re module:

  • Use the re. compile function to generate a Pattern object, and then use a series of methods of the Pattern object to search for the text;

  • Use functions such as re. match, re. search and re. findall to directly search for text matching;

The following example shows the two methods.

First, let's look at the 1st usage:

Import re # compile the regular expression into the Pattern object pattern = re. compile (r' \ d + ') print pattern. match ('20140901') print pattern. search ('20140901') print pattern. findall ('20140901 ')

Then let's look at the 2nd usage:

import re print re.match(r'\d+', '123, 123')print re.search(r'\d+', '234, 234')print re.findall(r'\d+', '345, 345')

If a regular expression needs to be used multiple times (such as the above d +), it is often used in many occasions. for efficiency considerations, we should compile this regular expression in advance, generate a Pattern object, and then use a series of methods of the object to match the file to be matched. if you directly use re. match, re. search and other functions, each time a regular expression is input, it will be compiled once, the efficiency will be greatly reduced.

Therefore, we recommend 1st usage.

Match Chinese characters

In some cases, we want to match Chinese characters in the text. one thing to note is that the unicode encoding range of Chinese characters is mainly in the [u4e00-u9fa5], which is mainly because the range is incomplete, for example, it does not contain full-angle (Chinese) punctuation marks. However, in most cases, it should be enough.

If you want to extract Chinese characters from the string title = u'hello, hello, world', you can do this:

#-*-Coding: UTF-8-*-import re title = U' hello, hello, world 'pattern' = re. compile (ur '[\ u4e00-\ u9fa5] +') result = pattern. findall (title) print result

Note that we have added two prefixes before the regular expression, ur. r indicates that the original string is used, and u indicates that the unicode string is used.

Execution result:

[u'\u4f60\u597d', u'\u4e16\u754c']

Greedy match

In Python, regular expression matching is greedy by default (in a few languages, it may not be greedy), that is, matching as many characters as possible.

For example, we want to find all p blocks in the string:

import re content = 'aa

test1

bb

test2

cc'pattern = re.compile(r'

.*

')result = pattern.findall(content) print result

Execution result:

['

test1

bb

test2

']

Because regular expression matching is greedy, that is, as many matches as possible

It will also try to match to the right to check whether there are longer substrings that can be matched successfully.

If we want non-greedy match, can we add one ?, As follows:

Import re content = 'AA

Test1

Bb

Test2

Cc 'pattern' = re. compile (r'

.*?

') # Add? Result = pattern. findall (content) print result

Result:

['

test1

', '

test2

']

Summary

  • The general steps for using the re module are as follows:

  1. Compile the regular expression string into a Pattern object using the compile function;

  2. Match text by using a series of methods provided by the Pattern object to obtain the matching result (a Match object );

  3. Finally, use the properties and methods provided by the Match object to obtain information and perform other operations as needed;

  • Python regular expression matching is greedy by default.


The above is the content of the Python regular expression. For more information, see The PHP Chinese website (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.