Python3 Regular Expressions (1) and python3 Regular Expressions

Source: Internet
Author: User
Tags character classes expression engine

Python3 Regular Expressions (1) and python3 Regular Expressions

Https://docs.python.org/3.4/howto/regex.html

The blogger made some comments and modifications to this question ^_^

Annotation: the Python Regular Expression Engine is written in C, so the efficiency is extremely high. In addition, the so-called regular expression RE here is the "Rules" we mentioned above ".

The Regular Expression Language is relatively small and restricted. Therefore, not all possible character processing tasks can be completed using regular expressions. Some special tasks can be completed using regular expressions, but the expressions become very complex. In this case, you may write your own Python code for better processing. Although Python code is slower to execute than a sophisticated regular expression, it may be easier to understand.

Note: this may be a common saying of "ugly saying goes first". Don't worry about it. Regular Expressions are excellent and can process 98.3% of your text tasks, be sure to study well ~~~~

Simple Mode

We will start from the simplest Regular Expression learning. Because regular expressions are often used to operate strings, we start with the most common task: character matching.

Character matching

Most letters and characters match themselves. For example, the regular expression Fanfan will fully match the string "Fanfan" (you can enable the not-distinguished size mode, which will enable Fanfan to match "FANFAN" or "fanfan ", we will discuss this topic later ).

Of course, this rule also has exceptions. There are a few special characters called metacharacter, which cannot match themselves, they define character classes, sub-group matching, and pattern repetition times. This article has devoted a lot of space to discussing various metacharacters and their functions.

Below is a complete list of metacharacters (We will explain them one by one ):

. ^ $ * + ? {} [] \ | ()

Annotation: without these metacharacters, the regular expression will become mediocre like the find () method of the string ......

Let's take a look at the brackets [] below. They specify a character class to store the character set that you need to match. You can separately list the characters to be matched, or use two characters and a horizontal bar to specify the matching range. For example, [abc] matches characters a, B, or c. [a-c] can implement the same function. The latter uses a range to indicate a combination of the same character sets as the former. If you want to match only lower-case letters, Your RE may be written as [a-z].

Note that metacharacters do not trigger "special functions" in square brackets. In character classes, they only match themselves. For example, [akm $] matches any character 'a', 'k', 'M', or '$', and '$' is a metacharacter, but it does not indicate special meaning in square brackets. It only matches the '$' character itself.

You can also match all other characters not listed in square brackets by adding an escape character ^ at the beginning of the class, for example, [^ 5] matches any character except '5.

Perhaps the most important metacharacters are backslash \. Like the Python string rules, if a metacharacters are followed by the backslash, then, the metacharacters "special functions" will not be triggered. For example, if you need to match the symbols [or \, you can add a backslash before them to remove their special features :\[,\\.

The backslash followed by some characters can also represent special meanings, such as decimal numbers, all letters, or non-blank character sets.

Annotation: The backslash is awesome. Special functions are removed from the backend of the backslash and the metacharacters. the backend of the backslash is followed by common characters to implement special functions.

Let's take an example: \ w matches any word character. If the regular expression is represented in bytes, this equivalent character class [a-zA-Z0-9 _]; if the regular expression is a string, \ w will match all Unicode databases (provided by the unicodedata module) the character marked as a letter. You can further limit the definition of \ w by providing re. ASCII when compiling a regular expression.

Annotation: The re. ASCII flag enables \ w to match only ASCII characters. Do not forget that Python3 is Unicode.

The following lists some special meanings of backslash and character composition:

Special characters Description
\ D Matches any decimal number, which is equivalent to a class [0-9].
\ D In contrast to \ d, matching any non-decimal number is equivalent to [^ 0-9].
\ S Matches any blank characters (including spaces, line breaks, and tabs), which is equivalent to the class [\ t \ n \ r \ f \ v]
\ S In contrast to \ s, it matches any non-blank characters when it is in the class [^ \ t \ n \ r \ f \ v]
\ W Match any word characters, as described above
\ W Opposite to \ w
\ B Start or end of a matching word
\ B Opposite to \ B

They can be contained in a character class and have special meanings. For example, [\ s,.] is a character class that matches any blank characters (special meanings of/s), ',' or '.'.

The final metacharacter we will talk about is., which matches any character except the line break. If the re. DOTALL flag is set, any characters including line breaks will be matched.

Repetitive tasks

Regular Expressions can easily match different character sets, but the existing methods of Python strings cannot be implemented. However, if you think this is the only advantage of regular expressions, you can too young too native. Another powerful function of regular expressions is that you can specify the number of RESS.

Let's take a look at the * metacharacters. Of course it does not match the '*' character itself (we have said that metacharacters have special capabilities ), it is used to specify that the first character matches zero or multiple times.

For example, ca * t will match ct (0 characters a), cat (1 character a), caaat (3 characters a), and so on. Note that, due to the internal limitation of the int type size in the C language, the Regular Expression Engine limits the number of duplicates of the character 'a' to no more than 2 billion. However, we usually do not use that big data for our work.

The repeat rules of regular expressions are greedy by default. When you repeat a RE, the matching engine will try to match as many as possible. Until RE does not match or ends, the matching engine will return a character and try again.

We will explain to you step by step through examples what is "greedy": first, consider expression a [bcd] * B. First, we need to match the character 'a ', then there are zero to multiple [bcd] and end with 'B. Now imagine what will happen if this RE matches the string abcbd?

Procedure Match Description
1 A Match the first character 'a' of RE'
2 Abcbd The engine tries its best to match [bcd] * when complying with the rules until the end of the string.
3 Failed The engine tried to match the last character 'B' of RE, but the current position is the end of the string, so it failed.
4 Abcb Therefore, the [bcd] * matches one character less.
5 Failed Try again to match the last character 'B' of RE, but the last character of the string is 'B', so it fails.
6 Abc So [bcd] * This time only matches 'bc'
7 Abcb Try again to match this character 'B'. The current position of the character string is exactly 'B', And the match is successful.

Finally, the result of RE matching is abcb.

Annotation: the regular expression's default matching rules are greedy. Later, we will teach you how to use non-Greedy methods for matching.

Another implementation of repeated metacharacters is +, which is used to specify that the previous character matches once or multiple times.

Pay special attention to the difference between * and +: * The matching is zero or multiple times, so the Duplicated content may not appear at the root; + must appear at least once. For example, ca + t matches cat and caaat, but does not match ct.

There are also two metacharacters, one of which is a question mark ?, It is used to specify whether the previous character matches zero times or once. You can think so. Its function is to set the flag of something to be optional.

The most flexible is the metacharacter {m, n} (both m and n are decimal numbers). The metacharacters mentioned above can be expressed using it, it means that the first character must match m to n times. For example, a/{1, 3} B will match a/B, a/B, and a ///B. But it does not match AB (No slash); it does not match a // B (more than three slashes ).

You can omit m or n. In this case, the engine assumes a reasonable value instead. If m is omitted, it is interpreted as the lower limit 0; if n is omitted, it is interpreted as Infinity (in fact, we mentioned 2 billion ).

Annotation: If it is {, n}, it is equivalent to {0, n}; if it is {m,} It is equivalent to {m, positive infinity}; if it is {n }, repeat the previous CHARACTER n times. Another super error is written as {m, n}. It looks pretty good, but note that spaces cannot be added to regular expressions at will, otherwise it will change the original meaning.

Finally, *, +, and? Both can be replaced by {m, n. {0,} is the same as *; {1,} is the same as +; {0, 1} is the same? Is the same. However, we encourage everyone to remember and use *, +, and ?, Because these characters are shorter and easier to read.

Note: Another reason is that the matching engine has * +? After optimization, the efficiency is higher.

(This article is complete)

Next article: Explanation of Python3 Regular Expressions (2)

If you like this article, please use the "Comments" below to encourage me. ^_^

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.