Python3 how to gracefully use regular expressions (detailed four)

Last Update:2015-01-12 Source: Internet

Author: User

Tags character classes rfc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

more powerful features

So far, we've just introduced some of the features of regular expressions. In this article, we will learn some new meta-characters and then teach you how to use groups to get the partially matched text.

more meta characters

There are some meta-characters we didn't talk about, then the Little turtle one by one for everyone to explain.

Some metacharacters do not match any characters, but simply indicate success or failure, so these words Fuye called 0 wide assertions. For example\bIndicates that the current position is at the boundary of a word, but\bdoes not change position. Therefore, the 0-wide assertion should not be reused because\bDoes not modify the current position, so\b\bWith\bThere's no difference between the two.

The Little Turtle explains: Many people may not understand the meaning of "change position" and "0 wide assertion"? I try to explain that, for example, after ABC matches a a , our current position will move to continue to match B, and so on ... But \babc ,\b indicates that the current position is at the boundary of the word (the first or last letter of the word), when the current position does not change, and a matches the character of the current position ...

|

Or an operator to perform or manipulate two regular expressions. If a and B are regular expressions,A | BWill matchAOrBAny characters that appear in the. In order to be able to work more rationally,|Has a very low priority. For examplefish| CShould matchFishOrC, rather than matchingFis, and then a' h 'Or' C '。

Similarly, we use\|to match' | 'Character itself, or contained in a character class, like this[|]。

^

The starting position of the matching string. If the MULTILINE flag is set, it becomes the starting position for each line. In MULTILINE, matches are immediately matched whenever a newline character is encountered.

For example, if you only want to match the word from at the beginning of the string, your regular expression can be written as ^from:

>>> Print (Re.search (' ^from ', ' From Here to Eternity '))
<_sre. Sre_match object; Span= (0, 4), match= ' from ' >
>>> Print (Re.search (' ^from ', ' reciting from Memory '))
None

Copy Code

The

$

matches the end of the string, and every time a newline character is encountered, the match is left.

>>> Print (Re.search ('}$ ', ' {block} '))
<_sre. Sre_match object; Span= (6, 7), Match= '} ',
>>> print (Re.search ('}$ ', ' {block} ')
None
>>> Print ( Re.search ('}$ ', ' {block}\n '))
<_sre. Sre_match object; Span= (6, 7), match= '} ';

Similarly, we use \$ to match ' $ ' character itself , or contained in a character class, like this [$] .

\a

matches only the starting position of the string. If the MULTILINE flag is not set, \a and ^ functions the same, but if the MULTILINE flag is set, there are some differences: \ A matches the starting position of the string, but ^ matches each line in the string.

\z

matches only the end position of the string.

\b

Word boundary, which is a 0-wide assertion that matches only the beginning and end of a word. "word" is defined as an alphanumeric sequence, so the end of the word refers to a space or a non-alphanumeric character.

In the example below, class only matches if a complete word is present, class , and does not match if it appears in another word.

>>> p = re.compile (R ' \bclass\b ')
>>> Print (P.search (' No class at all ')
<_sre. Sre_match object; Span= (3, 8), Match= ' class ' >
>>> Print (P.search (' The Declassified Algorithm '))
None
>>> Print (P.search (' One subclass is '))
None

Copy Code

There are two points to note when using these special sequences: the 1th point to note is that the Python string is conflicting with the regular expression on some characters (the example of a backslash before recalling). In Python, for example,\b represents backspace (ASCII value is 8). So, if you don't use the original string, Python will convert \b to backspace processing, so it's definitely not the same as your expectations.

In the following example, we deliberately do not write the ' R 'that represents the original string, and the result is indeed a large phase diameter:

>>> p = re.compile (' \bclass\b ')
>>> Print (P.search (' No class at all ')
None
>>> Print (p.search (' \b ' + ' class ' + ' \b '))
<_sre. Sre_match object; span= (0, 7), match= ' \x08class\x08 ' >

Copy Code

2nd, it is important to note that this assertion cannot be used in character classes. Like Python, in character classes,\b is used only to represent BACKSPACE.

\b

Another 0-wide assertion, contrary to \b 's meaning,\b represents the position of a non-word boundary.

Group

Usually in the actual application process, we need more information in addition to knowing whether a regular expression matches. For more complex content, regular expressions typically use grouping to match different content separately.

For the following example, we will use the ":" Number in the RFC-822 header to match the name and value separately:

From: [Email protected]
User-agent:thunderbird 1.5.0.9 (x11/20061227)
mime-version:1.0
To: [Email protected]

Copy Code

In this case, we can write a regular expression first to match the entire RFC-822 header, and then use the Grouping function, a group to match the name of the header, and the other group to match the name corresponding to the value.

Turtle Explanation: RFC-822 is the standard format of e-mail, of course, see here you do not know how to divide the group, not urgent, please continue to look down ...

In regular expressions, use metacharacters () to divide groups. () metacharacters have the same meaning as parentheses in mathematical expressions, and they combine the expressions that are contained inside, so you can use a meta-character that repeats operations on the contents of a group, such as *,+,? , or {m, n}.

For example, an(AB) * will match 0 or more ab:

>>> p = re.compile (' (AB) * ')
>>> Print (P.match (' Ababababab '). span ())
(0, 10)

Copy Code

Subgroups represented with () We can also index it hierarchically, and we can pass the index values as parameters to these methods: Group (), Start (), End (), and span (). The ordinal 0 represents the first grouping (this is the default grouping, which is always present, so the parameters that do not pass in are equivalent to the default value of 0):

>>> p = re.compile (' (a) B ')
>>> m = p.match (' ab ')
>>> M.group ()
' AB '
>>> M.group (0)
' AB '

Copy Code

The Little Turtle explains: there are several pairs of parentheses that are divided into subgroups, for example (a) (b) and (A (b)) are composed of two subgroups.

The index values for subgroups are numbered from left to right, and subgroups allow nesting, so we can determine the number of subgroups by counting the left parenthesis from left to right.

>>> p = re.compile (' (A (b) c) d ')
>>> m = p.match (' ABCD ')
>>> M.group (0)
' ABCD '
>>> M.group (1)
' ABC '
>>> M.group (2)
' B '

Copy Code

The group () method can pass in the ordinal of multiple subgroups at once:

>>> M.group (2,1,2)
(' B ', ' abc ', ' B ')

Copy Code

The Turtle explains: Start () is the starting position of the sub-group to get the parameter, end () is the ending position of the corresponding subgroup, and span () is the range that gets the corresponding subgroup.

We're still special. The groups () method can be used to return all substring-matched strings at once:

>>> m.groups ()
(' abc ', ' B ')

Copy Code

There is also a concept of reverse referencing that needs to be introduced. The reverse reference refers to the fact that you can use the previously matched content at a later point, using a backslash plus a number. For example, \1 represents a subgroup that references a successfully matched ordinal number of 1.

>>> p = re.compile (R ' (\b\w+) \s+\1 ')
>>> P.search (' Paris in the The Spring '). Group ()
' The '

Copy Code

If you are just searching for a string, the reverse reference will not be used because very few text formats will repeat the character. However, you will soon find that the reverse reference is very useful in the substitution of strings (deep-well ice)!

Turtle Note: Notice that the ASCII character of the numeric value is represented by a backslash and a number in a Python string, so in a regular expression that uses a reverse index, we still emphasize the use of the original string.

Python3 how to gracefully use regular expressions (detailed four)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More