Python Regular Expression Learning

Source: Internet
Author: User

1. Basic usage of Python Regular Expression

1.1 Basic Rules

1.2 duplicates

1.2.1 minimum matching and exact matching

1.3 forward and backward Definitions

1.4 BASIC Group Knowledge

2. Basic functions of the RE Module

2.1 use compile Acceleration

2.2 match and search

2.3 finditer

2.4 modify and replace strings

3. More in-depth understanding of RE groups and objects

3.1 compiled pattern object

3.2 groups and match objects

3.2.1 group name and serial number

3.2.2 Method for matching objects

4. More information

Python is a beginner and has a deep impression on Python's text processing capabilities. In addition to some of the methods inherent in the STR object, it is a powerful module of regular expressions. However, for beginners, it is still a little difficult to use this function. It took me a long time to figure out the portal. Because I don't have a good memory, it's easy to forget, so it's better to write it down. At the same time, it can also deepen the printing and sort out ideas.

I am a beginner, so I will certainly have some mistakes. I hope you can give me some advice and point out my mistakes.

1. Basic usage of Python Regular Expression

The module of the python regular expression is 'Re'. Its basic syntax is to specify a character sequence. For example, you need to find the string 'abc' in a string S = '123abc456 ', write as follows:

>>> Import re

>>> S = '123abc456eabc789'

>>> Re. findall (r 'abc', S)

The result is:

['Abc', 'abc']

The "findall (rule, target [, Flag])" function used here is a more intuitive function, that is, to find the matching strings in the target string. The first parameter is the rule, the second parameter is the target string, and a rule option can be followed later (the option function will be described in detail in the compile function ). The returned result is a list containing a string that complies with the rules. If no matching string is found, an empty list is returned.

Why use the R' .. 'string (raw string )? The regular expression rules are also defined by a string, while the regular expression uses a large number of escape characters '\'. If raw strings are not used, when you need to write a '\', you must write it as '\'. When you want to match a '\' from the target string, you have to write four '\' '\\\\'! This is of course troublesome and not intuitive, so r'' is generally used to define rule strings. Of course, in some cases, it may be better to use raw strings.

The above is the simplest example. Of course, such a simple usage is practically meaningless. To implement complex rule search, re defines several syntax rules. They are divided into the following categories:

Function character: '.' * '+' | ''? ''^'' $ ''\ 'And so on. They have special functional meanings. Especially the '\' character. It is an escape guide character. The character following it generally has a special meaning.

Rule delimiters: '['']'' ('')'' {''}' and so on, that is, several parentheses.

Predefined Escape Character Set: "\ D" "\ W" "\ s" and so on. They start with the character '\' and are followed by a specific character, used to indicate the meaning of a predefined definition.

Other special feature characters :'#''! '':''-', And so on. They only indicate special meanings under specific circumstances, such (? #...) It indicates a comment, and the content in it will be ignored.

The following describes the meaning of these rules one by one. However, the order is not in the order above, but in my opinion, it is arranged from a simple to a complex order. At the same time, in order to be intuitive, give as many examples as possible during the description process to facilitate understanding.

1.1 Basic Rules

'['']' Character set delimiter

First, describe how to set character sets. A character enclosed by square brackets indicates a character set that can match any character contained in it. For example, [ABC123] indicates that the character 'a' 'B' C' '1' '2' '3' meets its requirements. Can be matched.

In '['']', you can also use '-' to reduce the number of characters in a character set, for example, you can use [A-Za-Z] to specify the uppercase and lowercase letters, because English letters are arranged in ascending order. You cannot reverse the order of the size, for example, writing it as [Z-.

If a '^' sign is written at the beginning of '['']', it indicates that the characters in the brackets do not match. For example, [^ A-Za-Z] indicates that all English letters are not matched. However, if '^' is not at the beginning, it no longer indicates taking rather than itself, for example, [A-Z ^ A-Z] indicates matching all English letters and characters '^ '.

'|' Or a rule

Concatenate the two rules and connect them with '|', indicating that matching can be performed if one of them is satisfied. For example

[A-Za-Z] | [0-9] indicates a match that matches a number or letter, which is equivalent to a [a-zA-Z0-9]

Note: For '|', pay attention to the following two points:

First, it no longer represents or in '['']', but represents its own character. If you want to represent a '|' character outside '['']', you must use a backslash to guide the character, that is, '\ | ';

Second, its effective range is the entire rule on both sides. For example, 'dog | cat' matches 'dog 'and 'cat', rather than 'g' and 'c '. To limit its valid range, you must use a non-capturing group '(? . For example, to match 'I have a dog' or' I have a cat', you need to write it into r' I have (? : Dog | cat) ', but cannot be written as r' I have a dog | cat'

Example

>>> S = 'I have a dog, I Have A cat'

>>> Re. findall (r' I have (? : Dog | cat) ', S)

['I have a dog',' I have a cat'] # As we want

Next, let's take a look at the consequences of no capturing group:

>>> Re. findall (r' I have a dog | cat', S)

['I have a dog', 'cat'] # It treats' I have a dog' and 'cat' as two rules.

The usage of the no-capturing group will be carefully described later. Skip this step.

'.' Matches all characters

Match All characters except the linefeed '\ n. If the 's' option is used, all characters including '\ n' are matched.

Example:

>>> S = '2017 \ n456 \ n789'

>>> Findall (R'. + ', S)

['20160301', '20160301', '20160301']

>>> Re. findall (R'. + ', S, re. s)

['2014 \ n456 \ n789 ']

'^' And '$' match the start and end of the string

Note that '^' cannot be in '[]'. Otherwise, the meaning changes. For details, see the description of '['] 'above. In multi-row mode, they can match the beginning and end of each row. For details, see the 'M' option section in the compile function description.

'\ D' matches numbers

This is an escape character starting with '\'. '\ d' indicates matching a number, which is equivalent to [0-9].

'\ D' matches non-Numbers

This is the inverse set above, that is, matching a non-numeric character, equivalent to [^ 0-9]. Note that they are case sensitive. Below we will also see the case-sensitive format of many escape characters in Python's regular rules, representing the complementary relationship. This is easy to remember.

'\ W' matches letters and numbers

Match all English letters and numbers, that is, equivalent to [a-zA-Z0-9].

'\ W' matches non-English letters and numbers

That is, the complement set of '\ W', which is equivalent to [^ a-zA-Z0-9].

'\ S' matching Interval

It matches characters with space characters, tabs, carriage returns, and other characters indicating the meaning of separation. It is equivalent to [\ t \ r \ n \ f \ v]. (Note that there is a space at the beginning)

'\ S' match non-delimiter

That is, the set of separators, equivalent to [^ \ t \ r \ n \ f \ v]

'\ A' matches the start of the string

Matches the start of a string. The difference between it and '^' Is that '\ a' only matches the beginning of the entire string. Even in 'M' mode, it does not match the beginning of other rows.

'\ Z' matches the end of the string

Matches the end of a string. The difference between it and '$' Is that '\ Z' only matches the end of the entire string. Even in 'M' mode, it does not match the end of any other row.

Example:

>>> S = '12 34 \ n56 78 \ n90'

>>> Re. findall (R' ^ \ D + ', S, re. m) # match the number at the beginning of the row

['12', '56', '90']

>>> Re. findall (R' \ A \ D + ', S, re. m) # match the number at the beginning of the string

['12']

>>> Re. findall (R' \ D + $ ', S, re. m) # match the number at the end of the row

['34', '78 ', '90']

>>> Re. findall (R' \ D + \ Z', S, re. m) # match the number at the end of the string

['90']

'\ B' match the word boundary

It matches the boundary of a word, such as a space, but it is a '0' length character. The matched string does not include the delimiter. If '\ s' is used for matching, the matched string contains the separator.

Example:

>>> S = 'abc abcde bc bcd'

>>> Re. findall (R' \ BBC \ B ', S) # Matches a separate word 'bc', but does not match when it is part of other words.

['Bc'] # Only the independent 'bc' is found'

>>> Re. findall (R' \ SBC \ s', S) # match a separate word 'bc'

['Bc'] # only find the separate 'bc', but note that there are two spaces before and after, which may not be clear.

'\ B' match non-Boundary

Opposite to '\ B', it only matches non-boundary characters. It is also a 0-length character.

Example:

>>> Re. findall (R' \ BBC \ W + ', S) # match words that contain 'bc' but do not start with 'bc'

['Bcde'] # matched 'bcde' in 'abcde' but not 'bcd'

'(? :) 'No capturing Group

When you want to perform some operations on a part of the rule as a whole, such as specifying its repetition times, you need to use '(? : '') ', Instead of just a pair of parentheses, it will produce an absolutely unexpected result.

For example, match the repeated 'AB' in the string'

>>> S = 'ababab abbabb aabaab'

>>> Re. findall (R' \ B (? : AB) + \ B ', S)

['Ababab']

If you only use a pair of parentheses, see what the result will be:

>>> Re. findall (R' \ B (AB) + \ B ', S)

['AB']

This is because if only one pair of parentheses is used, it becomes a group ). The use of groups is complex and will be explained in detail later.

'(? #) 'Comment

Python allows you to write comments in regular expressions, in '(? # '') 'Is ignored.

(? Ilmsux) specifies the compilation Option

Python's regular expression can specify some options. This option can be written in findall or compile parameters, or in regular expressions to become part of the regular expression. This may be easier in some cases. For more information about the options, see the description of the compile function.

Here, the compile option 'I' is equivalent to ignorecase, L is equivalent to local, M is equivalent to multiline, S is equivalent to dotall, U is equivalent to Unicode, and X is equivalent to verbose.

Note the case sensitivity. You can specify only a portion during use. For example, you can specify to ignore case sensitivity only and write '(? I) '. To ignore the case sensitivity and use multiline mode at the same time, you can write it '(? Im )'.

In addition, you must note that the effective range of the option is the entire rule, that is, the option will be valid for all the regular expressions written anywhere in the rule.

1.2 duplicates

Regular Expressions must match strings of an indefinite length, so they must represent repeated indicators. The regular expression in python is rich and flexible. A repeat rule is generally followed by a character rule to indicate the number of repeat times. It indicates that you need to repeat the previous rule for a certain number of times. Repeated rules include:

'*' 0 or multiple matches

Matches the previous rule 0 or multiple times.

'+' Match once or multiple times

Indicates that the previous rule is matched at least once and can be matched multiple times.

For example, match the first part of the following string with a letter and the last part with a number or no variable name.

>>> S = 'aaa bbb111 cc22cc 33dd'

>>> Re. findall (R' \ B [a-Z] + \ D * \ B ', S) # It must start with at least 1 letter and end with a continuous number or no number

['Aaa', 'bbb111']

Note that in the preceding example, the '\ B' indicator indicating the word boundary is added before and after the rule. Otherwise, the result will become:

>>> Re. findall (R' [A-Z] + \ D * ', S)

['Aaa', 'bbb111', 'cc22', 'cc', 'dd'] # Open the word

In most cases, this is not the expected result.

'? '0 or 1 match

Only match the previous rule 0 times or 1 time.

For example, match a number. This number can be an integer or a number recorded in scientific notation. For example, 123 and 10e3 are correct numbers.

>>> S = '2014 10e3 20e4e4 30ee5'

>>> Re. findall (R' \ B \ D + [EE]? \ D * \ B ', S)

['20140901', '10e3 ']

It matches 123 and 10e3 correctly, which is what we expect. Pay attention to the use of '\ B' before and after, otherwise unexpected results will be obtained.

 

Note: This content is retrieved and reproduced from the Internet!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.