Python3 How to use regular expressions gracefully (detail one)

Source: Internet
Author: User
Tags character classes repetition expression engine

Note: This article is translated from Regular Expression HOWTO, small turtle children's shoes made some comments and modifications.


Introduction to Regular Expressions

The regular expression (Regular expressions, also known as REs, or regexes or regex patterns) is essentially a tiny, highly specialized programming language. It is embedded in Python and is provided to the program ape using the RE module. With regular expressions, you need to specify rules that describe the set of strings that you want to match. These string collections may contain English sentences, e-mail addresses, TeX commands, or whatever you want.

The regular expression pattern is compiled into a sequence of bytecode, which is then executed by a matching engine written in the C language. For advanced use, you might want to focus more on how the matching engine executes a given re and write the re in a way that produces a byte code that can run faster. This article does not explain the details of optimization, because it requires you to have a good understanding of the internal mechanism of the matching engine. However, the examples in this article are the regular expression syntax that conforms to the standard.

The Little Turtle notes: Python's regular expression engine is written in C, so the efficiency is very high. Another, the so-called regular expression, the RE in this case, is the "some rules" that we mentioned above.

Regular expression languages are relatively small and limited, so not all possible string processing tasks can be done using regular expressions. There are special tasks that can be done with regular expressions, but the expressions become very complex. In this case, you might be better off by writing your own Python code, although Python code can be slower to execute than a neat regular expression, but it might be easier to understand.

Turtle Note: This is probably what we often say, "ugly words said before" bar, we leave him, the regular expression is very good, she can handle your 98.3% of the text task, you must learn well ~~~~~


Simple Mode

We'll start with the simplest form of regular expression learning. Since regular expressions are often used to manipulate strings, we start with the most common tasks: character matching.


Character Matching

Most letters and characters will match themselves. For example, a regular expressionFISHCWill exactly match the string"FISHC"。 (You can enable case insensitive mode, which makesFISHCCan match"FISHC"Or"FISHC", we'll discuss the subject in the back. )

There are exceptions to this rule, of course. There are a few special characters we call metacharacters (metacharacter), they do not match themselves, they define character classes, subgroup matches, pattern repetitions, and so on. In this paper, a great deal of space is devoted to various meta-characters and their functions.

Below is a complete list of metacharacters (we'll walk through them):

.   ^   $   *   +   ?   { }   [ ]   \   | ( )

Turtle Note: Without these metacharacters, the regular expression becomes as banal as the find () method of the string ...


Let's look at the brackets below.[ ], they specify a character class to hold the character set you want to match. You can list the characters you want to match individually, or you can use two characters and a crossbar-Specifies the range of matches. For example[ABC]Matches charactersa,bOrC;[A-c]The same functionality can be achieved. The latter uses a range to represent the same set of characters as the former. If you want to match only lowercase letters, your RE can be written[A-z]。

One thing to note is that metacharacters do not trigger "special features" in square brackets, and in character classes they only match themselves. For example[akm$]will match any character' A ',' K ',' m 'Or' $ ',' $ 'is a meta-character, but it does not represent a special meaning in square brackets, it only matches' $ 'Character itself.

You can also match all other characters that are not listed in square brackets. The practice is to add a caret at the beginning of the class^For example[^5]Will match except' 5 'Any characters outside of the.


Perhaps the most important metacharacters is a backslash.\The As with Python's string rules, if a meta character is followed by a backslash, the "special function" of the metacharacters is not triggered. For example, you need to match symbols[Or\, you can precede them with a backslash to eliminate their special features:\[,\\。

A backslash followed by some characters can also represent special meanings, such as representing decimal digits, representing all letters, or character sets that represent non-whitespace.

Turtle Explanation: Anti-slash really good, backslash followed by meta-character removal of special functions, back slash behind with ordinary characters to achieve special functions.

Let's give an example:\wmatches any character. If the regular expression is represented as a byte, this is equivalent to the character class[a-za-z0-9_]If the regular expression is a string,\wMatches all Unicode databases (provided by the Unicodedata module) with letters marked as alphabetic characters. You can compile the regular expression by providing the re. ASCII indicates further restrictions\wThe definition.

The Little Turtle explains: Re. The ASCII flag makes \w only match ASCII characters, and don't forget that Python3 is Unicode.

The following is a list of some of the special meanings of the backslash plus character:

Special characters Meaning
\d Match any decimal number; equivalent to class [0-9]
\d In contrast to \d , matches any character that is not a decimal digit; equivalent to a class [^0-9]
\s matches any whitespace character (including spaces, line breaks, tabs, etc.); equivalent class [\t\n\r\f\v]
\s In contrast to \s , match any non-whitespace character; equivalent class [^ \t\n\r\f\v]
\w Match any character, see above explanation
\w In \w opposite
\b Match the beginning or end of a word
\b Contrary to \b


They can be contained in a character class and have a special meaning as well. For example[\s,.]is a character class that will match any whitespace character (/ Sof the special meaning),', 'Or'. '。

The last meta-character we're going to talk about is., which matches any character except newline characters. If you set up re. Dotall logo,.Any characters, including line breaks, will be matched.


Repetition of things

Using regular expressions makes it easy to match different character sets, but the existing methods of Python strings cannot be implemented. However, if you think this is the only advantage of regular expressions, then you too young too native. The regular expression has another powerful function, that is, you can specify the number of times that re is partially repeated.


Let's take a look.*This meta-character, of course, it is not a match' * 'The character itself (we said that the metacharacters have special abilities) is used to specify that the previous character matches 0 or more times.

For exampleca*tWill matchCT(0 characters a),Cat(1 characters a),Caaat(3 characters a), and so on. It is important to note that the regular expression engine restricts the number of repetitions of the character ' a ' to no more than 2 billion due to the internal limit of the size of the int type of the C language, but usually we do not use that much data in our work.


Regular expressions The default repeating rule is greedy, and when you repeatedly match a re, the matching engine tries to match as much as possible. Until the RE does not match or ends, the matching engine rolls back one character and then continues to try the match.

We explain what is called "greed" through examples: first consider the expressiona[bcd]*b, you first need to match the characters' A ', then 0 to multiple[BCD], and finally to' B 'End. Now imagine that this RE matches the stringABCBDWhat will happen?

Steps The Description
1 A Match the first character of the RE ' a '
2 Abcbd The engine matches the rule as closely as possible [bcd]*, until the end of the string
3 Failed The engine tried to match the last character of the RE ' B ', but the current position was already the end of the string, so the failure ended
4 Abcb fallback, so [bcd]* match less one character
5 Failed Try again to match the last character ' B 'of the RE, but the last character of the string is ' d ', so the failure ends
6 Abc Back again, so [bcd]* this time only matches ' BC '
7 Abcb Try to match the character ' B 'again, this time the current position of the string is exactly ' B ', and the match succeeds


Finally, the result of RE matching isABCB。

The Turtle explains: The regular expression default matching rule is greedy, and behind it teaches you how to use a non-greedy method to match.


Another metacharacters that implements repetition is+that specifies that the previous character matches one or more times.

Pay special attention to*And+The difference:*Matches 0 or more times, so the duplicated content may not appear at all;+Must appear at least once. For exampleca+tWill matchCatAndCaaat, but does not matchCT。


There are also two meta-characters that represent duplicates, one of which is a question mark?that specifies that the previous character matches 0 or one time. You can think of it as a sign of something that is optional. For exampleLittle Turtle.Can matchLittle Turtle, or you can matchTurtle。


The most flexible should be meta-characters{m, n}(M and n are all decimal integers), the above-mentioned metacharacters can be used to express, it means that the previous character must match m times to N times. For exampleA/{1, 3}bWill matchA/ b,a//bAnda///b。 But it doesn't match.AB(no slash);a////b(more than three slashes).

You can omit m or n, so that the engine assumes a reasonable value instead. Omitting m will be interpreted as a lower bound of 0, and omitting n will be interpreted as infinity (in fact the 2 billion we mentioned above).

The Turtle explains: {, n} equals {0, n}, if {m,} equals {m, + infinity}, or {n}, it repeats the previous character N times.


The smart fish oil should have been discovered, in fact*,+And?can be used{m, n}To replace.{0,}With*is the same;{1,}With+is the same;{0, 1}With?is the same. However, we encourage you to remember and use*,+And?Because these characters are shorter and easier to read.

The Turtle explains: Another reason is that the matching engine is optimized for * +?, and the efficiency is higher.



Python3 How to use regular expressions gracefully (detail one)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.