Python3 Regular Expressions (4) and python3 Regular Expressions

Last Update:2017-05-30 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python3 Regular Expressions (4) and python3 Regular Expressions

Previous Article: Explanation of Python3 Regular Expression (3)

Https://docs.python.org/3.4/howto/regex.html

The blogger made some comments and modifications to this question ^_^

Note: many people may not understand the meaning of "changing location" and "zero-width assertion? I will try to explain that, for example, after abc matches a, the current position will move to continue matching B, and so on... however, for \ babc, \ B indicates that the current position is at the boundary of the word (the first or last letter of the word). At this time, the current position will not change, then match a with the character at the current position ......

1. I

Or operator to operate on two regular expressions. If expression A and expression B are regular expressions, A | B matches any character in expression A or expression B. In order to be more rational, the priority of | is very low. For example, Fa | n should match Fa or n instead of F, and then a 'A' or 'N '.

Similarly, we use \ | to match the '|' character, or include it in a character class, such as [|].

2. ^

Start position of the matched string. If the MULTILNE flag is set, the starting position of each row is matched. In MULTILNE, a line break is matched immediately.

For example, if you only want to match the word From at the beginning of the string, your regular expression can be written as ^ From:

3. $

Match the end position of the string. The string will be matched whenever a line break is encountered.

Similarly, we use \ $ to match the '$' character itself, or include it in a character class, like this [$].

4. \

Only match the starting position of the string. If the MULTILINE flag is not in place, the functions of \ A and ^ are the same. However, if the MULTILNE flag is set, there are some differences: \ A matches the starting position of the string, but ^ matches each line of the string.

5. \ Z

Only matches the end position of the string.

6. \ B

Word boundary. This is a zero-width assertion that matches the start and end of a word. A word is a sequence of letters and numbers. Therefore, the end of a word is a space or a non-alphanumeric character.

In the following example, the class matches only when a complete word class appears. If it appears in another word, it does not.

When using these special sequences, note the following two points, python strings and regular expressions conflict with each other in some characters (recall the previous backslash example ). For example, in Python, \ B indicates a return character (the ASCII value is 8 ). Therefore, if you do not use the original string, Python will convert \ B to a return character, which is certainly different from what you expected.

In the following example, we intentionally did not write the 'R' that represents the original string. The result is indeed a large court path:

Second, you must note that this asserted cannot be used in character classes. Like Python, in character classes, \ B is only used to represent the escape character.

7. \ B

Another zero-width assertion, which is opposite to \ B, represents the position of a non-word boundary.

Group

In practice, we usually need more information in addition to knowing whether a regular expression matches. For complex content, regular expressions usually match different content by grouping.

In the following example, we divide the RFC-822 header into names and values for matching:

From: author@example.comUser-Agent: Thunderbird 1.5.0.9 (X11/20061227)MIME-Version: 1.0To: editor@example.com

In this case, we can write a regular expression first to match an entire RFC-822 header, then use the grouping function, use a group to match the header name, and the other group matches the value corresponding to the name.

Note: RFC-822 is the standard format of e-mail, of course, here you do not know how the Group points, don't worry, please continue to look down ......

In regular expressions, metacharacters () are used to divide groups. () Metacharacters are similar to parentheses in mathematical expressions. They are combined with internal expressions, so you can use repeated metacharacters for the content of a group, for example, *, + ,? Or {m, n }.

For example, (AB) * matches zero or multiple AB:

You can also perform a hierarchical index on the sub-group represented by (). You can pass the index value as a parameter to these methods: group (), start (), end () and span (). No. 0 indicates the first group (this is the default group, which always exists. Therefore, if you do not input a parameter, it is equivalent to the default value 0 ):

Annotation: several pairs of parentheses are divided into several sub-groups. For example, (a) (B) and (a (B) are composed of two sub-groups.

The index value of the sub-group is numbered from left to right, and the sub-group can also be nested. Therefore, we can count the left parentheses (to determine the serial number of the sub-group) from left to right.

The group () method can be used to input the serial numbers of multiple sub-groups at a time:

Annotation: start () is the starting position for obtaining the parameter Sub-Group; end () is the end position for obtaining the corresponding sub-group; span () is the range for obtaining the corresponding sub-group.

We also use the groups () method to return all strings matched by the Child Group at a time:

There is also a concept of reverse reference that needs to be introduced. Reverse reference means that you can use previously matched content in the following position. It is used to add a number with a backslash. For example, \ 1 indicates that the child group with the serial number 1 successfully matched with the front edge is referenced.

If you only search for strings, reverse references are not used because few text formats repeat the characters. However, you will soon find that reverse references are very useful when replacing strings (Deep Water )!

Note: In a Python string, the backslash and number are used to represent the ASCII characters corresponding to the value of a number. Therefore, in the regular expression of the reverse index, we still emphasize the use of the original string.

(This article is complete)

Next article: Explanation of Python3 Regular Expressions (5)

If you like this article, please use the "score" below to encourage me. ^_^

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More