Detailed tutorial on regular expressions in Python, python Regular Expressions

Source: Internet
Author: User

Detailed tutorial on regular expressions in Python, python Regular Expressions

1. Understand Regular Expressions

A regular expression is a logical formula for string operations. It uses predefined characters and combinations of these specific characters to form a "rule string ", this "rule string" is used to express a filtering logic for strings.

Regular Expressions are very powerful tools used to match strings. They are also used in other programming languages. Python is no exception and uses regular expressions, we want to extract the content from the returned page.

The general matching process of a regular expression is as follows:
1. Compare the expression with the characters in the text in sequence,
2. If each character can be matched, the match is successful. If any character cannot be matched, the match fails.
3. If the expression contains quantifiers or boundaries, this process will be slightly different.

2. Regular expression syntax rules

Below are some matching rules of Regular Expressions in Python. The picture is from CSDN

3. Regular Expression related annotations
(1) greedy and non-Greedy modes of quantifiers

Regular Expressions are usually used to search for matched strings in the text. In Python, quantifiers are greedy by default (in a few languages, they may also be non-Greedy by default). They always try to match as many characters as possible. If they are not greedy, the opposite is true, always try to match as few characters as possible. For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ". If you use a non-Greedy number word "AB *?", "A" is found ".

Note: We generally use the non-Greedy mode for extraction.
(2) backslash

Similar to most programming languages, regular expressions use "\" as escape characters, which may cause backlash troubles. If you need to match the character "\" in the text, four Backslash "\" will be required in the regular expression expressed in programming language "\\\\": the first two and the last two are used to convert them into backslashes in the programming language, convert them into two backslashes, and then escape them into a backslash in the regular expression.

The native string in Python solves this problem well. In this example, the regular expression can be represented by r. Similarly, "\ d" matching a number can be written as r "\ d ". With the native string, mom doesn't have to worry about missing the backslash, and the written expression is more intuitive.
4. Python Re Module

Python comes with the re module, which provides support for regular expressions. The main methods used are as follows:
 

# Return the pattern object re. compile (string [, flag]) # The following is the matching function re. match (pattern, string [, flags]) re. search (pattern, string [, flags]) re. split (pattern, string [, maxsplit]) re. findall (pattern, string [, flags]) re. finditer (pattern, string [, flags]) re. sub (pattern, repl, string [, count]) re. subn (pattern, repl, string [, count])

Before introducing these methods, let's first introduce the concept of pattern. pattern can be understood as a matching pattern. How can we get this matching pattern? It's easy. We need to use the re. compile method. For example
 

pattern = re.compile(r'hello')

In the parameters, we pass in the native String object, compile and generate a pattern object using the compile method, and then use this object for further matching.

In addition, you may have noticed another parameter flags. Here we will explain the meaning of this parameter:

The flag parameter is a matching mode. The value can take effect using the bitwise OR operator '|', for example, re. I | re. M.

Optional values:
 

  • ? Re. I (full spelling: IGNORECASE): case-insensitive (complete writing is in brackets, the same below)
  • ? Re. M (full spelling: MULTILINE): MULTILINE mode, changing the behavior of '^' and '$' (SEE)
  • ? Re. S (full spell: DOTALL): Any point matching mode, changing the behavior '.'
  • ? Re. L (full spelling: LOCALE): Make the pre-defined character class \ w \ W \ B \ B \ s \ S depends on the current region settings
  • ? Re. U (full spell: UNICODE): Make the pre-defined character class \ w \ W \ B \ B \ s \ S \ d \ D depends on the Character attribute defined by unicode
  • ? Re. X (full spell: VERBOSE): detailed mode. In this mode, the regular expression can be multiple rows, ignore blank characters, and add comments.

We need to use this pattern in the other methods we just mentioned, such as re. match. We will introduce it one by one.

Note: flags in the following seven methods also represent the matching mode. If flags is specified during pattern generation, this parameter is not required in the following methods.

(1) re. match (pattern, string [, flags])

This method starts from the beginning of string (the string we want to match) and tries to match pattern until backward matching. If any character that cannot be matched is encountered, None is returned immediately, if the match has not ended and it has reached the end of the string, None is returned. Both results indicate that the match failed. Otherwise, the match is successful and the match ends. The following is an example.
 

_ Author _ = 'cqc '#-*-coding: UTF-8-*-# import re module import re # compile the regular expression into a Pattern object, note that r in front of hello indicates "Native string" pattern = re. compile (r 'hello') # Use re. match matches the text to obtain the matching result. If the match fails, Noneresult1 = re is returned. match (pattern, 'Hello') result2 = re. match (pattern, 'helloo CQC! ') Result3 = re. match (pattern, 'helo CQC! ') Result4 = re. match (pattern, 'Hello CQC! ') # If 1 is matched successfully if result1: # Use Match to obtain the group information print result1.group () else: print '1: the Match failed! '# If 2 matches successfully if result2: # Use Match to obtain the group information print result2.group () else: print' 2 Match failed! '# If 3 matches successfully if result3: # Use Match to obtain the group information print result3.group () else: print' 3: The Match fails! '# If 4 matches successfully if result4: # Use Match to obtain the group information print result4.group () else: print' 4. The Match failed! '

Running result
 

Hellohello3 match failed! Hello

Matching analysis

1. The first match. The regular expression of pattern is 'hello', and the target string we match is also 'Hello'. The match is successful from the beginning to the end.

2. the second match. The string is helloo CQC. Matching pattern from the string header can be completely matched. The pattern match ends, and the matching ends. The following o CQC does not match any more. A successful match is returned.

3. Third match. string is helo CQC. It matches pattern starting from the string header. If 'O' is found, the matching cannot be completed. If the matching ends, None is returned.

4. The fourth matching is the same as the Second Matching Principle, and will not be affected even if a space character is encountered.

The result. group () is finally printed. What does this mean? The following describes the attributes and methods of the match object.
A Match object is a matching result that contains a lot of information about this matching. You can use the readable attributes or methods provided by Match to obtain this information.

Attribute:
1. string: the text used for matching.
2. re: Specifies the Pattern object used for matching.
3. pos: The index that the regular expression in the text starts to search. The value is the same as that of the Pattern. match () and Pattern. seach () methods.
4. endpos: Index of the ending search by a regular expression in the text. The value is the same as that of the Pattern. match () and Pattern. seach () methods.
5. lastindex: Index of the last captured group in the text. If no captured group exists, the value is None.
6. lastgroup: the alias of the last captured group. If this group does not have an alias or is not captured, it is set to None.

Method:
1. group ([group1,…]) :
Obtain one or more string intercepted by a group. If multiple parameters are specified, the string is returned as a tuple. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, None is returned; the group that has been intercepted multiple times returns the last intercepted substring.
2. groups ([default]):
Returns the string intercepted by all groups in the form of tuples. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None.
3. groupdict ([default]):
Returns a dictionary that uses the alias of an alias group as the key and the intercepted substring as the value. A group without an alias is not included. The meaning of default is the same as that of default.
4. start ([group]):
Returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0.
5. end ([group]):
Returns the ending index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0.
6. span ([group]):
Returns (start (group), end (group )).
7. expand (template ):
Place the matched group into the template and return the result. You can use \ id or \ g to reference a group in template, but cannot use number 0. \ Id is equivalent to \ g, but \ 10 is considered to be 10th groups. If you want to express \ 1 followed by the character '0', you can only use \ g0.

The following is an example.
 

#-*-Coding: UTF-8-*-# A simple match instance import re # matches the following content: Word + space + word + any character m = re. match (R' (\ w + )(? P. *) ', 'Hello world! ') Print "m. string: ", m. stringprint "m. re: ", m. reprint "m. pos: ", m. posprint "m. endpos: ", m. endposprint "m. lastindex: ", m. lastindexprint "m. lastgroup: ", m. lastgroupprint "m. group (): ", m. group () print "m. group (1, 2): ", m. group (1, 2) print "m. groups (): ", m. groups () print "m. groupdict (): ", m. groupdict () print "m. start (2): ", m. start (2) print "m. end (2): ", m. end (2) print "m. span (2): ", m. span (2) print r "m. expand (r '\ G \ G'): ", m. expand (R' \ 2 \ 1 \ 3') ### output #### m. string: hello world! # M. re: # m. pos: 0 # m. endpos: 12 # m. lastindex: 3 # m. lastgroup: sign # m. group (1, 2): ('hello', 'World') # m. groups (): ('hello', 'World ','! ') # M. groupdict (): {'sign ':'! '} # M. start (2): 6 # m. end (2): 11 # m. span (2): (6, 11) # m. expand (R' \ 2 \ 1 \ 3 '): world hello!

(2) re. search (pattern, string [, flags])

The search method is similar to the match method. The difference is that the match () function only checks whether the re matches at the starting position of the string. search () scans the entire string for matching, and match () only when the 0-position match is successful is returned. If the start position match is not successful, match () returns None. Similarly, the return object of the search method also matches () to return the method and attribute of the object. Let's take an example.
 

# Import re module import re # compile the regular expression into the Pattern object pattern = re. compile (r'world') # use search () to find matched substrings. If no matched substrings exist, None is returned. # match () is used in this example () match = re. search (pattern, 'Hello world! ') If match: # Use Match to obtain the group information print match. group () ### output ### world

(3) re. split (pattern, string [, maxsplit])

Split string by matching substrings and return to the list. Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed. Let's take a look at the example below.

Import re pattern = re. compile (R' \ d + ') print re. split (pattern, 'one1two2three3four4') ### output ### ['one', 'two', 'three ', 'four', '']

(4) re. findall (pattern, string [, flags])

Search for strings and return all matching substrings in the form of a list. Let's take a look at the import re pattern = re. compile (R' \ d + ') print re. findall (pattern, 'one1two2three3four4 ') ### output ### ['1', '2', '3', '4']

(5) re. finditer (pattern, string [, flags])

Returns an iterator that accesses each matching result (Match object) sequentially. Let's take a look at the example below.
 

Import re pattern = re. compile (R' \ d + ') for m in re. finditer (pattern, 'one1two2three3four4 '): print m. group (), ### output ### 1 2 3 4

(6) re. sub (pattern, repl, string [, count])

Use repl to replace each matched substring in the string, and then return the replaced string.
When repl is a string, you can use \ id or \ g, \ g to reference the group, but cannot use number 0.
When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (the returned string cannot reference the group ).
Count is used to specify the maximum number of replicas. If not specified, all replicas are replaced.
 

import re pattern = re.compile(r'(\w+) (\w+)')s = 'i say, hello world!' print re.sub(pattern,r'\2 \1', s) def func(m):  return m.group(1).title() + ' ' + m.group(2).title() print re.sub(pattern,func, s) ### output #### say i, world hello!# I Say, Hello World!

(7) re. subn (pattern, repl, string [, count])

Returns (sub (repl, string [, count]), replacement times ).

 

import re pattern = re.compile(r'(\w+) (\w+)')s = 'i say, hello world!' print re.subn(pattern,r'\2 \1', s) def func(m):  return m.group(1).title() + ' ' + m.group(2).title() print re.subn(pattern,func, s) ### output #### ('say i, world hello!', 2)# ('I Say, Hello World!', 2)

5. Another method of using the Python Re Module

We have introduced 7 tool methods above, such as match and search, but the call methods are all re. match, re. in fact, there is another way to call the search method. You can use pattern. match, pattern. search call, so that you do not need to pass pattern as the first parameter.

Function API list
 

match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])search(string[, pos[, endpos]]) | re.search(pattern, string[, flags])split(string[, maxsplit]) | re.split(pattern, string[, maxsplit])findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags])finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags])sub(repl, string[, count]) | re.sub(pattern, repl, string[, count])subn(repl, string[, count]) |re.sub(pattern, repl, string[, count])

You don't have to go into details about the specific call method. The principle is similar, but the parameter changes are different. Let's try it ~

Let's cheer up. It doesn't matter if you see this section in the dark. Next we will use some practical examples to help you master regular expressions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.