Detailed Python3 Regular expression (i)

Source: Internet
Author: User
Tags character classes repetition expression engine

This article translated from: https://docs.python.org/3.4/howto/regex.html

Bloggers have made some comments and changes to this ^_^

Introduction to Regular expressions

The regular expression (Regular expressions, also known as REs, or regexes or regex patterns) is essentially a tiny, highly specialized programming language. It is embedded in Python and is provided to programmers using the RE module. With regular expressions, you need to specify rules that describe the set of strings that you want to match. These string collections may contain English sentences, e-mail addresses, TeX commands, or whatever you want.

The regular expression pattern is compiled into a sequence of bytecode, which is then executed by a matching engine written in the C language. For advanced use, you might want to focus on how the matching engine executes a given re and write the re in a way that produces a byte code that can run faster. This article does not explain the details of optimization, because it requires you to have a good understanding of the internal mechanism of the matching engine. However, the examples in this article are the regular expression syntax that conforms to the standard.

Annotations:Python's regular expression engine is written in C, so the efficiency is very high. Another, the so-called regular expression, the RE in this case, is the "some rules" that we mentioned above.

Regular expression languages are relatively small and limited, so not all possible character processing tasks can be done using regular expressions. There are special tasks that can be done with regular expressions, but the expressions become very complex. In this case, you might be better off by writing your own Python code, although Python code can be slower to execute than a neat regular expression, but it might be easier to understand.

Annotations: This is probably what we often say, "ugly words said before" bar, we leave it, the regular expression is very good, it can handle your 98.3% of text tasks, you must learn well ~ ~ ~

Simple mode

We'll start with the simplest form of regular expression learning. Since regular expressions are often used to manipulate strings, we start with the most common tasks: character matching.

Character matching

Most letters and characters will match themselves. For example, the regular expression Fanfan will exactly match the string "Fanfan" (You can enable the case-insensitive mode, which will allow Fanfan to match "Fanfan" or "Fanfan" and we will discuss this topic in the back).

Of course there are exceptions to this rule, there are a few special characters we call metacharacters (metacharacter), they do not match themselves, they define character classes, subgroup matches, pattern repetitions, and so on. In this paper, a great deal of space is devoted to various meta-characters and their functions.

Below is a complete list of metacharacters (we'll walk through them):

. ^ $ * + ? {} [] \ | ()

Annotations: without these metacharacters, the regular expression becomes as banal as the find () method of the string ...

Let's look at the parentheses below [], which specify a character class to hold the character set you need to match. You can list the characters you want to match individually, or you can specify a range of matches by two characters and a crossbar. For example [ABC] will match the characters A, B or c;[a-c] can achieve the same function. The latter uses a range to represent the same set of characters as the former. If you want to match only lowercase letters, your RE may be written as a [a-z].

One thing to note is that metacharacters do not trigger "special features" in square brackets, and in character classes they only match themselves. For example [akm$] will match any character ' a ', ' k ', ' m ' or ' $ ', ' $ ' is a meta-character, but in square brackets it does not represent a special meaning, it only matches the ' $ ' character itself.

You can also match all other characters not listed in square brackets by adding a caret ^at the beginning of the class, such as [^5] matches any character except ' 5 '.

Perhaps the most important metacharacters is a backslash \ , like a Python string rule, if a meta character is followed by a backslash, the "special function" of the metacharacters is not triggered. For example you need to match symbols [ or \, you can precede them with a backslash to remove their special features:\[,\ \ .

A backslash followed by some characters can also represent special meanings, such as representing decimal digits, representing all letters, or character sets that represent non-whitespace.

Annotations: anti-Slash is really good, the backslash followed by the meta-character removal of special functions, back slash followed by ordinary characters to achieve special functions.

Let's give an example:\w matches any word character. If the regular expression is represented as a byte, this is the equivalent of the character class [a-za-z0-9_]; If the regular expression is a string,\w matches all characters in the Unicode database (provided by the Unicodedata module) that are marked as letters. You can compile the regular expression by providing the re. The ASCII representation further restricts the definition of \w .

Annotations: Re. The ASCII flag makes \w only match ASCII characters, and don't forget that Python3 is Unicode.

The following is a list of some of the special meanings of the backslash plus character:

Special characters Meaning
\d Matches any decimal number equivalent to class [0-9]
\d In contrast to \d, matches any character that is not a decimal digit, equivalent to [^0-9]
\s matches any whitespace character (including spaces, line breaks, tabs, and so on), equivalent to class [\t\n\r\f\v]
\s In contrast to \s, match any non-whitespace character when compared to class [^\t\n\r\f\v]
\w Match any word character, see above explanation
\w Contrary to \w
\b Match the beginning or end of a word
\b Contrary to \b

They can be contained in a character class and have a special meaning as well. For example [\s,.] is a character class that will match any white space character (the special meaning of/s ), ', ' or '. '.

The last meta-character we're going to talk about is ., which matches any character except the line break. If you set up re. The Dotall flag,. matches any character including line breaks.

Repetition of Things

Using regular expressions makes it easy to match different character sets, but the existing methods of Python strings cannot be implemented. However, if you think this is the only advantage of regular expressions, then you too young too native . The regular expression has another powerful function, that is, you can specify the number of times that re is partially repeated.

Let's take a look at this meta-character, which of course does not match the ' * ' character itself (we said the metacharacters have special abilities), which is used to specify the previous character to match 0 or more times.

For example ca*t will match CT (0 characters a), cat (1 characters a), Caaat (3 characters a), and so on. It is important to note that the regular expression engine restricts the number of repetitions of the character ' a ' to no more than 2 billion due to the internal limit of the size of the int type of the C language, but usually we do not use that much data for our work.

Regular expressions The default repeating rule is greedy, and when you repeatedly match a re, the matching engine tries to match as much as possible. Until the RE does not match or ends, the matching engine rolls back one character and then continues to try the match.

We pass the example step by step to explain what is called "greed": first consider the expression a[bcd]*b, first need to match the character ' a ', then 0 to multiple [BCD], and finally end with ' B '. Now imagine, what happens to this RE-match string ABCBD?

Steps The Description
1 A Match the first character of the RE ' a '
2 Abcbd The engine matches the rule as closely as possible [bcd]*, until the end of the string
3 Failed The engine tried to match the last character of the RE ' B ', but the current position was already the end of the string, so the failure ended
4 Abcb fallback, so [bcd]* match less one character
5 Failed Try again to match the last character ' B ' of the RE, but the last character of the string is ' B ', so the failure ends
6 Abc Back again, so [bcd]* this time only matches ' BC '
7 Abcb Try to match the character ' B ' again, this time the string is pointing to the same characters exactly as ' B ', the match succeeds

Finally, the result of RE matching is ABCB.

Annotations: The regular expression default match rule is greedy, and behind it teaches you how to use a non-greedy method to match.

Another metacharacters that implements repetition is +, which specifies that the previous character matches one or more times.

Pay special attention to the difference between * and + :* matches 0 or more times, so the duplicated content may not appear at all;+ must appear at least once. For example, ca+t matches cat and Caaat, but does not match Ct.

There are also two meta characters that represent duplicates, one of which is a question mark ?, which specifies that the previous character matches 0 or one time. You can think of it as a sign of something that is optional.

The most flexible should be the meta-character {m,n} (M and n are decimal numbers), the above-mentioned several meta-characters can be used to express, it means that the previous character must match m times to N times between. For example A/{1,3}b will match a/b,a//b and a///b. But does not match AB (no slash), nor does it match a////b (more than three slashes).

You can omit m or n, so that the engine assumes a reasonable value instead. Omitting m will be interpreted as a lower bound of 0, and omitting n will be interpreted as infinity (in fact the 2 billion we mentioned above).

Annotations: if {, n} is equivalent to {0,n}, if {m,} is equal to {m, positive infinity}, or {n}, the previous character is repeated n times. There is also a super error is written {m, n}, looking pretty beautiful, but note that the regular expression inside can not add space, otherwise it will change the original meaning.

Finally, you can use {m,n} instead for * ,+ , and ? . {0,} is the same as * , {1,} is the same as +; {0,1} is the same as. However, you are encouraged to remember and use * ,+ and ?because these characters are shorter and easier to read.

Annotations: Another reason is that the matching engine is optimized for * + , which is more efficient.

(End of this article)

Next: Detailed Python3 regular expression (ii)

If you like this article, please give me encouragement through the "comments" below ^_^

Detailed Python3 Regular expression (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.