Python -- Regular Expression (1)
Summary
This document is a guide to using regular expressions through the re module in Python. It provides a more detailed introduction than the corresponding section of the reference class library.
========================================================== ========================================================== ==========
1. Introduction
Regular Expressions (also known as REs, regexes, or regex patterns) are essentially a fine and highly specialized programming language embedded in Python and are provided to programmers through the re module. Using Regular Expressions, You need to customize rules for the string set you want to match. These sets may contain English sentences, email addresses, TeX commands, or anything you want. Then you will ask a question like this: "Does this string match this pattern ?" Or "Is there a suitable pattern for matching in this string ?" You can use a regular expression to modify a string or use various methods to split the string.
The regular expression mode is compiled into a series of bytecode and then executed by a matching engine written in C language. For advanced applications, you must pay attention to how the engine executes a given RE and write the RE in a certain way to make these bytecode run faster. However, this document does not contain RE-optimized content, because you need a good understanding of the engine kernel.
The Regular Expression Language is relatively small and strict, so not all string processing tasks can be completed using regular expressions. Some tasks can be performed using regular expressions, but the expressions are very complex. In these cases, you 'd better use regular Python code to complete the task. Although Python code may be slower to run than well-written regular expressions, it is relatively easier to understand.
========================================================== ========================================================== ==========
2. Simple Mode
We start from learning the simplest regular expression. Since regular expressions are often used to process strings, we start with the most common task: character matching.
For more details about the potential Regular Expressions in computer science, refer to the relevant compiler documents.
Certificate ------------------------------------------------------------------------------------------------------------------------------------------------------------
2. 1. Character matching
Most characters and letters will simply match themselves. For example, the regular expression test can fully match the string test (you can use case-insensitive mode, this regular expression can also match Test or TEST. More details will be introduced later ).
Of course, this rule also has exceptions. Some characters are special [metacharacters] and they do not match themselves. On the contrary, they indicate that they will match something different, or affect other parts of RE by repeating, changing meaning, etc. Most of this document is devoted to discussing various metacharacters and their usage.
This is a complete list of metacharacters and their meaning will be discussed later:
. ^ $ * + ? { } [ ] \ | ( )
Let's take a look at square brackets []. They specify a [character class] to store the character set that you want to match. These characters can be listed separately. A continuous range of characters can also be specified by using the horizontal bar '-' to connect the first and last two characters. For example, [abc] can match a, B, c, and [a-c]. The latter uses a range expression to specify a character set that is the same as the former. If you only want to match lower-case letters, your regular expression is [a-z].
Note that metacharacters are not activated in square brackets! For example, [akm $] can match the characters 'A', 'k', 'M', or '$'. We know that '$' is also a metacharacter, but it is in square brackets, it will lose its features and will only match itself.
You can also match characters not listed in square brackets. This requires the '^' character before the first character in square brackets, characters except the square brackets character set can simply match the '^' symbol. For example, [^ 5] can match any character that is not '5.
The most important metacharacters are backslash \. In general Python syntax, the backslash follows different characters to indicate special sequences. The same is true in regular expressions. If the backslash is followed by a metacharacter, The metacharacter "special function" will not be triggered. For example, if you need to match a [or \, you can add a backslash \ before them to remove their special semantics: \ [or \\
There are some special and common sequences that also start with a backslash, such as decimal numbers or non-blank character sets.
Let's take an example: \ w matches all alphanumeric characters, if the regular expression is in byte format, this is equivalent to the character class [a-zA-Z0-9]. If the regular expression is a string, \ w matches all characters marked as letters in the Unicode database. When compiling a regular expression, you can further limit the definition of \ w by providing the re. ASCII mark.
The following lists some special sequences composed of backslash and character (incomplete ).
\ D: matches all decimal numbers, which is equivalent to the character class [0-9]
\ D: match all non-numeric characters, equivalent to the character class [^ 0-9]
\ S: matches all blank characters (including tabs and line breaks), which is equivalent to the character class [\ t \ n \ r \ f \ v]
\ S: matches all non-blank characters, which is equivalent to the character class [^ \ t \ n \ r \ f \ v]
\ W: match all alphanumeric characters, equivalent to the character class [a-zA-Z0-9]
\ W: match all non-alphanumeric characters, equivalent to the character class [^ a-zA-Z0-9]
These sequences can be contained in a character class. For example, [\ s,.] is a character class that can match any blank character, or ',', and '.'
The final metacharacter is '.', which matches any character except the line break. If the re. DOTALL flag is set, it matches all characters including the line break.
Certificate ------------------------------------------------------------------------------------------------------------------------------------------------------------
2. repetitive tasks
Matching different characters is the first thing that regular expressions can do, but the existing methods of Python strings cannot. However, if this is the only advantage of regular expressions, it will not make much progress. Another powerful feature of this function is that you can specify the number of RESS.
First, let's learn the first metacharacters * That have repeated functions. Instead of matching itself, it specifies that the previous character of the metacharacters are repeated 0 times or multiple times, instead of determining the number of times.
For example, ca * t can match ct (with 0 a characters), cat (with 1 a character), and caaat (with 3 a characters. However, due to the limitation of the int type in the Regular Expression Engine C language, the repeat times of the 'A' character cannot exceed 0.2 billion times. However, we usually do not use such large data.
The repeat rules of regular expressions are greedy. When a repeat matches a RE, the matching engine will try to match as many as possible until it does not match or ends, and the matching engine will return a character, then try again.
We use examples to make this concept clearer step by step. Let's consider a regular expression a [bcd] * B. It must first match the character 'a' and then match the characters in the 0 or multiple character classes [bcd, end with the character 'B. Now let's assume it matches the string 'abcbd ':
Procedure |
Match |
Explanation |
1 |
A |
Match the first character of RE |
2 |
Abcbd |
The engine matches [bcd] * as far as possible until the end of the string. |
3 |
Failure |
The engine tried to match B, but the current position is already at the end of the string, so it failed. |
4 |
Abcb |
Returns one character, so [bcd] * matches one less character. |
5 |
Failure |
The engine tries to match B again, but the last character at the current position is d, so it fails. |
6 |
Abc |
One character is returned again, so [bdc] * only matches bc |
7 |
Abcb |
Match B again. The character at the current position is exactly B, so the match is successful. |
RE matching is complete, so the final result of RE matching is abcb. This example proves that the Regular Expression Engine always matches repeated characters as far as possible. If it does not match, it will gradually roll back and try again to match the remaining part of the RE, until the number of repeated characters is 0, if the matching still fails, the engine will conclude that the string does not match the RE.
The other metacharacters that are repeated are the plus signs +, which specify to match the first character once or multiple times. Note the difference between asterisk * and the plus sign +. The asterisk * matches 0 times or multiple times, so the content to be matched may not appear at the root, however, the plus sign + requires that duplicate characters appear at least once. For example, ca + t can match cat (1 a character) and caaat (3 a character), but it does not match ct.
There are two metacharacters, one of which is the question mark? It is used to specify whether to match the first character 0 or 1. You can think of it as an optional character. For example, home -? Brew can match homebrew or home-brew.
The most flexible qualifier for repeated expressions should be {m, n}, where both m and n are decimal numbers. This indicates that the previous character must be repeated at least m times and at most n times. For example, a/{1, 3} B can match a/B, a/B, and a // B. But it does not match AB, because AB does not have a slash or a // B, because it has more than three 4 slashes.
You can also omit m or n. In this case, the engine assumes a reasonable alternative value. If m is omitted, the default lower limit is 0. If n is omitted, There is no upper limit by default, that is, the upper limit is infinite. As mentioned above, it cannot exceed 0.2 billion. (Note: Another method is to use {n} to specify that the character before it is repeated n times)
Smart readers may have discovered the asterisks *, plus signs +, and question marks mentioned above? Can be expressed with this qualifier. {0,} is the same as the asterisk *, {1,} is the same as the plus sign +, {0, 1} and question mark? Same role. However, in practice, asterisk (*), plus sign +, and question mark (?) are preferred ?, Because they are shorter and more efficient.