Using regular expressions
Now let's start writing some simple regex expressions. Python provides an interface to the regular expression engine through the RE module, while allowing you to compile regular expressions into schema objects and use them for matching.
The Turtle explains: The RE module is written in C, so the efficiency is much higher than the normal string method, and the regular expression is compiled (compile) to further improve efficiency; We will often refer to "pattern" in the back, which refers to the pattern object that the regular expression is compiled into.
Compiling regular expressions
The regular expression is compiled into a schema object that has various methods for manipulating strings, such as finding pattern matching or performing string substitution.
- >>> Import re
- >>> p = re.compile (' ab* ')
- >>> P
- <_sre. Sre_pattern Object at 0x...>
Copy Code
Re.compile () can also accept the flags parameter, which is used to turn on a variety of special functions and grammatical changes, which we'll cover in the rear.
Now let's look at a simple example:
- >>> p = re.compile (' ab* ', re. IGNORECASE)
Copy Code
The
Regular expression is passed as a string parameter to Re.compile (). Because the regular expression is not a core part of Python, there is no special syntax support for it, so the regular expression can only be represented as a string. (Some apps don't need to use regular expressions at all, so the Python community's small partners don't think it's necessary to incorporate them into Python's core.) Instead, the RE module is only included as a C extension module in Python, like the socket module and the Zlib module. The
uses a string to denote that the regular expression preserves Python's concise, consistent style, but has some negative effects, so let's talk about it.
The troublesome backslash
in the previous article we have mentioned that regular expressions use ' \ ' The character character allows some ordinary characters to have special abilities (such as \d to match any decimal number), or the ability to deprive some special characters (such as \[ matches the left parenthesis ' . This conflicts with characters in the Python string that implement the same functionality.
The Turtle explains: It's a mouthful, and then you know it.
Now the situation is that you need to be in LaTeX The file uses a regular expression to match the string ' \section '. Because the backslash is a special character that needs to be matched, you need to add a backslash to the front to deprive it of its special function. So we're going to write the character of the regular expression ' \\section '.
But don't forget that Python also uses backslashes in strings to represent special meanings. So, if we want to pass ' \\section ' completely to Re.compile (), we need to add two more backslashes ...
Match character |
Matching phase |
\section |
String that needs to be matched |
\\section |
The regular expression uses ' \ \ ' to denote the match character ' \ ' |
"\\\\section" |
Unfortunately, the Python string also uses ' \ \ ' to denote the character ' \ ' |
In short, in order to match the backslash character, we need to use four backslashes in a string. Therefore, the frequent use of backslashes in regular expressions can cause a backslash storm, which in turn makes your string extremely difficult to understand.
The workaround is to use the original Python string to represent the regular expression (just add r in front of the string, you remember ...) ):
Regular string |
Raw string |
"Ab*" |
R "Ab*" |
"\\\\section" |
R "\\section" |
"\\w+\\s+\\1" |
R "\w+\s+\1" |
The Little Turtle explains: It is strongly recommended to use the original string to express the regular expression.
Implementation matching
When you compile the regular expression, you get a schema object. What are you going to do with him? Schema objects have many methods and properties, and we list the most important ones below:
Method |
Function |
Match () |
Determines whether a regular expression matches a string from the beginning |
Search () |
Traverse string to find the first position of a regular expression match |
FindAll () |
Iterate through the string, find all the locations where the regular expression matches, and return as a list |
Finditer () |
Iterates through a string, finds all locations where the regular expression matches, and returns as an iterator |
If no match is found, match () and search () returns None, and if the match succeeds, a match object is returned with all matching information: for example, where to start, where to end, matching substrings, and so on.
Let's walk through the following steps:
- >>> Import re
- >>> p = re.compile (' [a-z]+ ')
- >>> P
- Re.compile (' [a-z]+ ')
Copy Code
Now, you can try using regular Expressions [a-z]+ to match the various strings.
For example:
- >>> P.match ("")
- >>> Print (P.match (""))
- None
Copy Code
An empty string cannot be matched because the + represents a match one or more times. Therefore, match () returns None.
Let's try another string that matches:
- >>> m = p.match (' FISHC ')
- >>> m
- <_sre. Sre_match object; Span= (0, 5), match= ' FISHC ' >
Copy Code
In this example, match () returns a matching object, which we store in the variable m for later use.
Let's take a look at what's inside the matching object. The matching object contains many methods and properties, the following are the most important:
Method |
Function |
Group () |
Returns a matching string |
Start () |
Returns the starting position of the match |
End () |
Returns the end position of the match |
Span () |
Returns a tuple representing the matching location (start, end) |
You see:
- >>> M.group ()
- ' FISHC '
- >>> M.start ()
- 0
- >>> M.end ()
- 5
- >>> M.span ()
- (0, 5)
Copy Code
Start () always returns 0 because match () checks only if the regular expression matches the starting position of the string.
However, the search () method can be different:
- >>> Print (P.match (' ^_^ FISHC '))
- None
- >>> m = P.search (' ^_^ FISHC ')
- >>> Print (m)
- <_sre. Sre_match object; Span= (3, 8), match= ' FISHC ' >
- >>> M.group ()
- ' FISHC '
- >>> M.span ()
- (3, 8)
Copy Code
In practical applications, the most common way is to store matching objects in a local variable and check that their return value is None.
The form is usually as follows:
- p = re.compile (...)
- m = P.match (' string goes here ')
- If M:
- Print (' Match found: ', M.group ())
- Else
- Print (' No match ')
Copy Code
There are two ways to return all matching results, one is findall () and the other is Finditer ().
FindAll () returns a list:
- >>> p = re.compile (' \d+ ')
- >>> P.findall (' 3 Little Turtle, 15 legs, where is the extra 3? ‘)
- [' 3 ', ' 15 ', ' 3 ']
Copy Code
FindAll () needs to create a list before returning, and Finditer () returns the matching object as an iterator:
- >>> iterator = P.finditer (' 3 Little Turtle, 15 legs, and 3. ‘)
- >>> iterator
- <callable_iterator Object at 0x10511b588>
- >>> for match in iterator:
- Print (Match.span ())
- (0, 1)
- (6, 8)
- (13, 14)
Copy Code
The turtle explains: If the list is large, then the efficiency of the return iterator is much higher. For an iterator, see: "0 basic Beginner Learning python" 048 | Magic Method: Iterators
Python3 How to use regular expressions gracefully (two-way)