This document is translated from the official documentation:
Regular Expression HOWTO
Reference article:
python--Regular Expressions (1)
python--Regular Expressions (2)
python--Regular Expressions (3)
python--Regular Expressions (4)
Full-Text Download:
Python Regular Expression Basics
======================================================================================
6. Problems
Regular expressions are very powerful tools in your application, but sometimes they do not perform intuitively as you wish. This section will point out some common questions about using regular expressions.
--------------------------------------------------------------------------------------------------------------- ---------------------------------------
6.1. Using string methods
Sometimes it's a mistake to use the RE module. If you want to match a fixed string, or a separate character class, and you don't use any of the re-module flags, such as ignorecase, then you don't need regular expressions. Python's strings have some methods of manipulating fixed strings, often faster, because they are implemented through a separate, optimized, C-language small loop, rather than through a more generic regular expression engine.
Give an example of replacing a fixed string. For example, you want to replace word with deed. Re.sub () seems to be able to do this, but it is still possible to use the Replace () method of the string directly. Note that the replace () method also replaces the word string in the middle of the word, such as replacing the swordfish with a sdeedfish, and the simple regular expression Word will do the same. To avoid replacing word substrings in the middle of a word, you must use the regular expression pattern \bword\b to determine the existence of word boundaries on both sides of the word. This requires the use of regular expressions, as this is beyond the functional scope of the replace () function.
Another common task is to remove a single character from a string or replace it with another character. You might do this by re.sub (' \ n ', ', S), but the Translata () method of the string can do exactly the same work and usually runs faster than regular expressions.
In short, before you use the RE module, consider whether your problem can be solved by faster and simpler string functions.
--------------------------------------------------------------------------------------------------------------- ---------------------------------------
6.2. Match () VS search ()
The match () method matches only at the beginning of the string, and the search () method scans the entire string for the existence of a matching string. It is important to remember this distinction. Again, the match () method reports only a match where the start position is 0, and does not report if the match start position is not 0,match ().
>>> Print (Re.match (' super ', ' superstition '). span ()) (0, 5) >>> print (Re.match (' Super ', ' Insuperable ') None
The search () method scans the entire string and then reports the first match it finds:
>>> Print (Re.search (' super ', ' superstition '). span ()) (0, 5) >>> print (Re.search (' Super ', ' Insuperable '). span ()) (2, 7)
Sometimes you might risk using Re.match () and add '. * ' to the front of your re, preferably not, but to use Re.search (). The regular expression engine performs some analysis of regular expressions to speed up the execution of matches. The general analysis will first find out what the first matching character is, for example, a pattern starting with Crow must match from the character C, then the matching engine will quickly scan the string to find the character C after parsing, and only begin to match when the character C is found.
So, adding '. * ' will cause this optimization to fail, which requires a scan from start to finish and then back to match the rest of the re. Therefore, it is best to use Re.search ().
----------------------------------------------------------------------------------------------------------- -------------------------------------------
6.3. Greed vs Non-greed
When repeating a regular expression, such as a *, the result is to match as much as possible. This result can sometimes hurt you, especially if you match a pair of separators. For example, an HTML tag surrounded by angle brackets. Due to the repeated greedy nature of asterisks, regular regular expressions to match a pair of individual HTML tags will be faulted.
>>> s = '
The regular expression first matches the opening parenthesis < in
In this case, the solution is to use non-greedy symbols *?, +?、?? or {m,n}?, these characters will match as much as possible with less text content. In the example above, the closing parenthesis > tries to match immediately after the first opening parenthesis < match, and if the match fails, the matching engine advances one character, and each step tries to match again until it succeeds or ends at the end of the string. This will give us the results we want:
>>> Print (Re.match (' <.*?> ', s). Group ()) Note: parsing HTML or XML with regular expressions is painful. Regular expressions you write can handle some common situations, but HTML and XML always break the rules. So when you write a regular expression to handle all the possible situations, this regular expression can be very complex. In this case, it is advisable to use HTML and XML parsers to handle the more appropriate.
--------------------------------------------------------------------------------------------------------------- ---------------------------------------
6.4. Use RE. VERBOSE
Now you've noticed that regular expressions are a compact notation, but they can be very difficult to read. Moderately complex regular expressions can contain many backslashes, parentheses, and metacharacters, which makes them difficult to read and understand.
For such a regular expression, the compile time specifies the re. The verbose flag is helpful because it allows you to organize regular expressions more clearly.
Re. The verbose logo has several effects. Spaces in regular expressions are ignored if they are not in the character class, which means that they are similar to dog | Regular expressions like cat and dog|cat have the same meaning as poor readability, but the character class [a b] still matches the character A, b, or a space. Alternatively, you can write a comment in the regular expression, which is the content from the # number to the next line. When you use a three-quote string, you make the regular expression more concise.
Pat = Re.compile (R ' ' \s* #Skip leading whitespace (? p
This is much clearer than what is written below:
Pat = Re.compile (R ' \s* (? p
python--Regular Expressions (5)