Super verbose python Regular expression operation guide (re used), one

Source: Internet
Author: User
Tags alphabetic character repetition expression engine

Python has added the RE module since version 1.5, which provides a Perl-style regular expression pattern. Prior to Python version 1.5, the Emecs style mode was provided through the Regex module. Emacs style mode is less readable and functionally weak, so try not to use the Regex module when writing new code, but occasionally you may find it in the old code.

In essence, a regular expression (or RE) is a small, highly specialized programming language (in Python) that is embedded in Python and implemented through the RE module. With this small language, you can specify rules for the corresponding set of strings you want to match, which may contain English statements, e-mail addresses, Tex commands, or whatever you want to fix. Then you can ask, "Does this string match the pattern?" "or" does a part of this string match the pattern? ”。 You can also use RE to modify or split strings in various ways.

The regular expression pattern is compiled into a sequence of bytecode, which is then executed by a matching engine written in C. In advanced usage, you might also want to pay careful attention to how the engine executes the given re, and how to write the re in a specific way to make the bytecode run faster. This article does not involve optimization because it requires that you have a good grasp of the internal mechanism of the matching engine.

Regular expression languages are relatively small and limited (limited functionality), so not all string processing can be done with regular expressions. Of course, some tasks can be done with regular expressions, but the final expression becomes unusually complex. In these situations, it may be better to write Python code for processing, although Python code is much slower than a neat regular expression, but it is easier to understand.

Simple mode

We'll start with the simplest form of regular expression learning. Python Learning AC -719-139-688 since regular expressions are commonly used for string manipulation, we start with the most common task: character matching.

For a detailed explanation of the underlying computer science in regular expressions (deterministic and nondeterministic finite automata), you can refer to any textbook that is related to writing compilers.

Character matching

Most letters and characters will usually match themselves. For example, the regular expression test will match the string "test" exactly. (You can also use the case insensitive mode, which also allows the RE to match "test" or "test"; there will be more explanations later.) )

There are exceptions to this rule, and some characters are special, and they do not match themselves, but rather indicate that they should be matched to something special, or that they affect the repetition of other parts of the re. In this paper, a large number of meta-characters and their effects are discussed.

There is a complete list of metacharacters, and the implications are discussed in the remainder of this guide.

. ^ $ *+ ? { [ ] | ( )

The first metacharacters we examine are "[" and "]". They are often used to specify a character category, the so-called character category is a character set that you want to match. Characters can be listed individually, or a character interval can be represented by the two given characters separated by a "-". For example, [ABC] will match any one of the characters in "a", "B", or "C", or you can use the interval [a-c] to represent the same character set, and the former effect is the same. If you only want to match lowercase letters, then RE should be written as [A-z].

Metacharacters does not work in the category. For example, [akm$] will match the character "a", "K", "M", or "\ (any of the" \)"is usually used as a meta-character, but in the character category, its attributes are removed and reverted to normal characters.

You can use a complement to match characters that are not in the interval range. The practice is to put "^" as the first character of the category; "^" in other places simply matches the "^" character itself. For example, [^5] will match any character except "5".

Perhaps the most important metacharacters is the backslash "". As a string letter in Python, the back of the backslash can be combined with a different character utilises to represent different special meanings. It can also be used to cancel all meta characters so that you can match them in the pattern. For example, if you need to match the characters "[" or "" ", you can use backslashes before them to remove their special meaning:" [or "".

Some of the predefined character sets that are represented by special characters that begin with "" are often useful, like number sets, alphabetic sets, or other non-empty character sets. The following are the available preset special characters:

More detailed Python Regular expression Operations Guide (re used), one

Such special characters can be included in a character class. For example, ["s,.] The character class will match any whitespace character or "," or ".".

The last meta-character in this section is. 。 It matches any character except the newline character in alternate mode (re. Dotall) It can even match line breaks. "." is usually used where you want to match "any character".

Repeat

The first thing a regular expression can do is to be able to match an indeterminate character set, which is something that can be done on a string. However, if that is the only additional function of the regular expression, then they are not so good. Another feature of them is that you can specify the number of repetitions of a part of a regular expression.

The first repetition of the function we discussed is metacharacters . does not match the alphabetic character "*"; instead, it specifies that the previous character can be matched 0 or more times instead of only once.

For example, Ca*t will match "CT" (0 "a" characters), "Cat" (A "a"), "Caaat" (3 "a" characters) and so on. The RE engine has various internal limits from the integer type size of C to prevent it from matching more than 200 million "a" characters; you may not have enough memory to build such a large string, so it will not accumulate to that limit.

Repetition like * is "greedy"; when you repeat a re, the matching engine tries to repeat as many times as possible. If the back part of the pattern is not matched, the matching engine will return and try a smaller repetition again.

A step-by-step example can make it clearer. Let's consider an expression a[bcd]*b. It matches the letter "a", 0 or more letters from the class [BCD] and ends with "B". Now think about the RE-Match of the string "ABCBD".

More detailed Python Regular expression Operations Guide (re used), one

The end of the RE can now be reached, which matches "ABCB". This proves that the match engine will do its best to match at the start, and if there is no match then step back and try again the rest of the re. Until it returns an attempt to match [BCD] to 0 times, if it fails then the engine will assume that the string does not match the RE at all.

Another repeating meta-character is +, which indicates a match one or more times. Please note the difference between * and +; * matches 0 or more times, so it does not appear at all, and + requires at least one occurrence. In the same example, Ca+t can match "cat" (A "a"), "Caaat" (3 "a"), but not "CT".

There are more qualifiers. Question mark? Match once or 0 times; You can think of it as an option to identify something. For example: Home-?brew matches "homebrew" or "home-brew".

The most complex repetition qualifier is {m,n}, where m and n are decimal integers. The qualifier means at least m duplicates, up to n repetitions. For example, A/{1,3}b will match a "A/b", "a//b" and "a///b". It cannot match "AB" because there are no slashes, nor can it match "a////b" because there are four.

You can ignore m or n, because a reasonable value is assumed for the missing value. Ignoring m would consider the lower boundary to be 0, while ignoring n would result in an infinity on the upper boundary-actually the 2 trillion we mentioned earlier, but this may be the same as infinity.

Careful readers may notice that the other three qualifiers can be represented in such a way. {0,} equals , {1,} equals +, and {0,1} is the same as. If possible, best to use , +, or?. Very simple because they are shorter and easier to understand.

Using regular expressions

Now that we've seen some simple regex expressions, how do we actually use them in Python? The RE module provides an interface to the regular expression engine that allows you to compile the REs into objects and use them for matching.

Compiling regular expressions

Regular expressions are compiled into RegexObject instances that can provide methods for different operations, such as pattern-matching searches or string substitutions.

More detailed Python Regular expression Operations Guide (re used), one

Re.compile () also accepts optional flag parameters, which are commonly used to implement different special functions and syntax changes. We'll look at all the available settings later, but for now, just one example:

More detailed Python Regular expression Operations Guide (re used), one

RE is sent as a string to Re.compile (). REs is processed into a string because the regular expression is not a core part of the Python language, nor does it create a specific syntax for it. (The application doesn't need REs at all, so it's not necessary to include them to make the language description bloated.) The RE module is only included in Python as a C extension module, just like a socket or zlib module.

Use REs as a string to keep the Python language concise, but the trouble with this is that it's like the next section heading.

The trouble with the back slash

In an earlier rule, a regular expression uses a backslash character ("" ") to represent a special format or special use that allows special characters to be used without invoking it. This creates a conflict with the same characters that Python does in the string.

Let's illustrate that you want to write an RE to match the string "" section "that might be found in a LATEX file. In order to judge in the program code, first write the string that you want to match. Next, you need to add a backslash to all backslashes and metacharacters to remove the special meaning of the backslash.

More detailed Python Regular expression Operations Guide (re used), one

Simply put, in order to match a backslash, you have to write ' ' in the RE string, because the regular expression must be "", and each backslash must be represented as "" in a Python string letter. This repetition of backslashes in REs results in a large number of repeated backslashes, and the resulting string is difficult to understand.

The solution is to use the Python raw string representation for regular expressions, and the "R" backslash before the string is not handled in any special way, so R "" is two characters containing "" and "N", and "" is a character that represents a newline. Regular expressions are typically represented in Python code with this raw string.

More detailed Python Regular expression Operations Guide (re used), one

Super verbose python Regular expression operation guide (re used), one

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.