Python Regular Expression Operations Guide _python

Source: Internet
Author: User
Tags alphabetic character character classes first string numeric repetition string methods alphanumeric characters expression engine
Original author: a.m. kuchling (amk@amk.ca)


Licensing: Authoring Sharing protocol


Translation staff: Firehare


Proofreading Staff: Leal


Applicable version: Python 1.5 and subsequent versions


http://wiki.ubuntu.org.cn/Python%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97


Directory




Directory

[ Hide ]
  • 1 Introduction
  • 2 Simple mode
    • 2.1 characters match /span>
    • 2.2 repeat
  • 3
    • 3.1
    • 3.2 The trouble with the backslash
    • 3.3
    • 3.4 module-level functions
    • 3.5 compilation flags
  • 4 more mode features
    • 4.1
    • 4.2 grouping
    • 4.3
    • 4.4 forward definition
  • 5
    • 5.1
    • 5.2 search and replace
  • 6 FAQ
    • 6.1 Using string method
    • 6.2 Match () vs search ()
    • 6.3 Greed vs no greed
    • 6.4 Do not need re. VERBOSE
  • 7 Feedback
  • 8 about this document


[ Edit ] Brief Introduction



Python has added the RE module since version 1.5, which provides a Perl-style regular expression pattern. Prior to the Python 1.5 version, the EMACS-style model was provided through the Regex module. Emacs style mode is slightly less readable and not very functional, so try not to use the Regex module when writing new code, but occasionally you may find traces in the old code.




By its very nature, a regular expression (or RE) is a small, highly specialized programming language that is embedded in Python and implemented through the RE module. With this small language, you can specify rules for the set of strings that you want to match, which may contain English statements, e-mail addresses, Tex commands, or anything you want to fix. Then you can ask like "does this string match the pattern?" "or" is there a part of the string that matches the pattern? ”。 You can also use the RE to modify or split strings in various ways.




The regular expression pattern is compiled into a series of bytecode, which is then executed by a matching engine written in C. In advanced usage, you may also want to pay close attention to how the engine executes a given re, and how to write the re in a specific way to make the production bytecode run faster. This article does not involve optimizations because it requires that you have a good grasp of the internal mechanism of the matching engine. Ha ha



Regular expression languages are relatively small and restricted (with limited functionality), so not all string processing can be done with regular expressions. Of course, some tasks can be done with regular expressions, but the final expression becomes extraordinarily complex. In these situations, writing Python code for processing might be better; Although Python code is slower than a clever regular expression, it is easier to understand.



[ Edit ] Simple Mode



We will begin with the simplest form of regular expression learning. Because regular expressions are often used for string operations, we start with the most common task: character matching.




For a detailed explanation of the underlying computer science of regular expressions (deterministic and non-deterministic finite automata), you can refer to any textbook that compiles the compiler-related.



Character matching



Most letters and characters will generally match themselves. For example, the regular expression test would exactly match the string "test". (You can also use the case insensitive mode, which also allows the RE to match "test" or "test"; there will be more explanations later.) )



There are, of course, exceptions to this rule; some characters are special, they don't match themselves, they show you should match something special, or they affect the number of repetitions of other parts of the RE. This article is devoted to a large number of meta characters and their functions.



Here is a complete list of metacharacters, and the meaning is discussed in the remainder of this guide.


. ^ $ * + ? { [ ] \ | ( )


The first metacharacters we examine are "[" and "]". They are often used to specify a character category, which is called a character set that you want to match. Characters can be listed individually, or two given characters separated by a "-" sign to represent a character interval. For example, [ABC] will match any of the characters in "a", "B", or "C", or you can use interval [a-c] to represent the same character set, and the former effect is consistent. If you only want to match lowercase letters, then RE should be written as a [a-z].



Metacharacters does not work in the category. For example, [akm$] will match the character "a", "K", "M", or any of the "$", and "$" is usually used as a metacharacters, but in a character category, its attributes are dropped and restored to normal characters.



You can use a complement to match characters that are not in range. The practice is to put "^" as the first character of the category; "^" in other places will simply match the "^" character itself. For example, [^5] will match any character other than "5".



Perhaps the most important metacharacters is the backslash "\". As a string letter in Python, a backslash can be followed by a different character specifier to indicate different special meanings. It can also be used to cancel all the meta characters so that you can match them in the pattern. For example, if you need to match the characters "[" or "\", you can use backslashes before them to remove their special meaning: \[or \.



Some predefined character sets, which are represented by special characters starting with "\", are often useful, such as a set of numbers, an alphabetic set, or other non-null character sets. The following preset special characters are available:


\d matches any decimal number, which is equivalent to class [0-9].
\d matches any non-numeric character; it corresponds to the class [^0-9].
\s matches any whitespace character; it corresponds to the class [\t\n\r\f\v].
\s matches any non-white-space character; it corresponds to the class [^ \t\n\r\f\v].
\w matches any alphanumeric character; it is equivalent to a class [a-za-z0-9_].
\w matches any non-alphanumeric character; it corresponds to the class [^a-za-z0-9_].





Such special characters can be included in a character class. For example, [\s,.] The character class will match any white space character or "," or ".".



The last meta character in this section is. 。 It matches any character except the newline character in the alternate mode (re. Dotall) It can even match line wrapping. "." is usually used where you want to match "any character".



[ Edit ] Repeat



The first thing a regular expression can do is to be able to match an indefinite set of characters that cannot be done by other methods that act on the string. However, if that is the only additional feature of regular expressions, then they are not so good. Another function of them is that you can specify the number of repetitions of a part of the regular expression.



The first repetition function we discussed metacharacters is *. * Does not match the alphabetic character "*"; instead, it specifies that the previous character can be matched 0 or more times, rather than just once.



For example, Ca*t will match "CT" (0 "a" characters), "Cat" (A "a"), "Caaat" (3 "a" characters), and so on. The RE engine has a variety of internal limits from the integer type size of C to prevent it from matching more than 200 million "a" characters; you may not have enough memory to build such a large string, so it will not accumulate to that limit.



Repetition like this is "greedy"; when you repeat a RE, the matching engine tries to repeat as many times as possible. If the later part of the pattern is not matched, the matching engine will return and try a smaller repetition again.




A step-by-step example can make it clearer. Let's consider an expression a[bcd]*b. It matches the letter "a", 0 or more letters from the class [BCD], and ends with "B". Now think about the RE's match for the string "ABCBD".


Step Matched Explanation
1 A A match mode
2 Abcbd Engine match [bcd]*, and match the end of the string to its best
3 Failure The engine tries to match B, but the current position is already the end of the character, so the failure
4 Abcb Back, [bcd]* tries to match one character less.
5 Failure Try again B, but the current last character is "D".
6 Abc return again, [bcd]* only matches BC.
7 Abcb Try b Again, this time the character on the current bit is just "B"


The end of the RE can now be reached, and it matches the "ABCB". This proves that the matching engine will do its best to match at first, if there is no match and then step back and try the remaining parts of the RE again and again. Until it returns to try to match [BCD] to 0, if it fails then the engine will assume that the string cannot match the RE at all.




The other repeating meta character is +, which means matching one or more times. Note the difference between * and +; * Match 0 or more, so you can not appear at all, and + requires at least one occurrence. With the same example, Ca+t can match "cat" (A "a"), "Caaat" (3 "a"), but cannot match "CT".




There are more qualifiers. Question mark? Match once or 0 times; You can think of it as an alternative to identifying something. For example: Home-?brew matches "homebrew" or "home-brew".




The most complex repeating qualifier is {m,n}, where m and n are decimal integers. The qualifier means at least m repeats, up to n repetitions. For example, A/{1,3}b will match "A/b", "a//b" and "a///b". It cannot match "AB" because there are no slashes and cannot match "a////b" because there are four.




You can ignore m or n because you will assume a reasonable value for the missing value. Ignoring M will assume that the bottom boundary is 0, and the result of ignoring n is that the upper boundary is infinity-actually the 2 trillion we mentioned earlier, but this may be the same as infinity.




Attentive readers may notice that the other three qualifiers can be represented in such a way. {0,} equals *,{1,} equals +, and {0,1} with? Same. If you can, it is best to use *,+, or? It's simple because they're shorter and easier to understand.



[ Edit ] using regular Expressions



Now that we've looked at some simple regular expressions, how do we actually use them in Python? The RE module provides an interface to a regular expression engine that allows you to compile the REs into objects and use them for matching.



[ Edit ] Compiling regular Expressions



Regular expressions are compiled into ' regexobject ' instances, which provide methods for different operations, such as pattern-matching search or string substitution.


#!python
>>> Import re
>>> p = re.compile (' ab* ')
>>> print P
<re. Regexobject instance at 80b4150>


Re.compile () also accepts optional flag parameters, which are commonly used to implement different special functions and syntax changes. We'll look at all the settings that are available later, but let's just cite one example:


#!python
>>> p = re.compile (' ab* ', re. IGNORECASE)


RE is made as a string sent to Re.compile (). REs is processed as a string because the regular expression is not a central part of the Python language, nor does it create a specific syntax for it. (Applications do not need REs at all, so there is no need to include them to make the language instructions bloated.) And the re module is included in Python as a C-extension module, just like a socket or zlib module.




Using REs as a string to keep the Python language simple, one of the problems with this is as described in the next section heading.



[ Edit ] the trouble with the back slash



In earlier rules, regular expressions used a backslash character ("\") to represent a special format or to allow special characters to be used without invoking its special usage. This creates a conflict with the same characters that Python plays in the string.




Let's illustrate that you want to write a RE to match the string "\section", possibly in a LATEX file lookup. In order to judge in the program code, first write the string that you want to match. Next you need to add a backslash before all backslashes and metacharacters to remove its special meaning.


Character Stage
\section The string to match
\\section The special meaning of canceling backslashes for Re.compile
"\\\\section" Suppress backslash for string



To put it simply, in order to match a backslash, you have to write ' \\\\ ' in the RE string because the regular expression must be "\ \" and each backslash must be represented as "\" by the usual Python string letter. This repeating attribute of a backslash in REs causes a large number of repeated backslashes, and the resulting string is difficult to understand.




The solution is to use Python's raw string representation for regular expressions; Adding an "R" backslash before a string is not handled in any particular way, so R "\ n" is a two character containing "\" and "N", and "\ n" is a character that represents a newline. Regular expressions are usually represented in Python code with this raw string.


General string Raw string
"Ab*" R "Ab*"
"\\\\section" R "\\section"
"\\w+\\s+\\1" R "\w+\s+\1"


Perform a match



Once you have the object of the compiled regular expression, what do you want to do with it? The ' Regexobject ' instance has some methods and properties. Only the most important ones are shown here, and if you want to see the full list, check out the Python Library Reference


Methods/Properties Role
Match () Determines whether the RE matches at a position where the string is just started
Search () Scan string to find where this RE matches
FindAll () Find all substrings that the RE matches and return them as a list
Finditer () Find all the substrings that the RE matches and return them as an iterator



If there is no match, match () and search () will return none. If successful, it returns a ' Matchobject ' instance with this matching message: where it starts and ends, the substring it matches, and so on.






You can learn it by using the man-machine conversation and the RE module experiment. If you have a tkinter, you might consider referring to tools/scripts/redemo.py, a demo program included in the Python release.









First, run the Python interpreter, import the RE module and compile a re:





#!python
python 2.2.2 (#1, Feb 2003, 12:57:01)
>>> import re
>>> p = re.compile (' [a-z]+ ' )
>>> P
<_sre. Sre_pattern Object at 80c3c28>


Now, you can try to match the different strings with the [a-z]+] of the RE. An empty string will not match at all because + means "one or more repetitions". In this case the match () returns none because it causes the interpreter to have no output. You can clearly print out the results of match () to figure this out.


#!python
>>> P.match ("")
>>> print P.match ("")
None


Now, let's try to use it to match a string, such as "tempo". At this point, match () returns a matchobject. So you can save the results in a variable for later use.


#!python
>>> m = p.match (' tempo ')
>>> print M
<_sre. Sre_match Object at 80c4f68>


Now you can query ' Matchobject ' for information about matching strings. The Matchobject instance also has several methods and properties; The most important ones are as follows:


Methods/Properties Role
Group () Returns the string that is matched by the RE
Start () Returns the location where the match started
End () Returns the location where the match ended
Span () Returns the position of a tuple containing a match (start, end)



Try these methods soon to understand their role:


#!python
>>> m.group ()
' tempo '
>>> m.start (), M.end ()
(0, 5)
>>> M.span ()
(0, 5)


Group () returns the substring that the RE matches. Start () and end () return the index that matches the start and finish. Span () returns the index at the beginning and end together with a single tuple. Because the matching method checks that if the RE starts a match at the beginning of the string, start () will always be zero. However, the search method for the ' Regexobject ' instance scans the string below, in which case the match may not be zero.


#!python
>>> Print P.match ('::: Message ')
None
>>> m = P.search ('::: Message '); print m< C3/><re. Matchobject instance at 80c9650>
>>> m.group ()
' message '
>>> m.span ()
(4, 11)


In a real-world program, the most common practice is to keep ' matchobject ' in a variable, and then check whether it is None, usually as follows:


#!python
p = re.compile (...)
m = P.match (' string goes here ')
if M:
print ' match found: ', M.group ()
else:
print ' No match '


The two ' Regexobject ' method returns a substring of all matching patterns. FindAll () returns a matching string row table:


#!python
>>> p = re.compile (' \d+ ')
>>> p.findall (' drummers drumming, pipers piping, 10 Lords a-leaping ')
[' 12 ', ' 11 ', ' 10 ']


FindAll () has to create a list when it returns results. In Python 2.2, you can also use the Finditer () method.


#!python
>>> iterator = p.finditer (' Drummers drumming, 11 ...
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in Iterator: ...   Print Match.span () ...
(0, 2)
(in)
(29, 31)


[ Edit ] module-level functions



You don't necessarily have to generate a ' Regexobject ' object and then call it the method; The RE module also provides top-level function calls such as match (), search (), sub (), and so on. These functions use the RE string as the first argument, and the subsequent argument is the same as the corresponding ' Regexobject ' method parameter, and the return is either None or an instance of ' Matchobject '.


#!python
>>> Print re.match (R ' from\s+ ', ' fromage amk ')
None
>>> re.match (R ' from\s+ ', ' from AMK Thu May 19:12:10 1998 ')
<re. Matchobject instance at 80c5978>


Under the hood, these functions simply produce a regexoject and invoke the corresponding method on it. They also save the compiled object in the cache, so it will be quicker to invoke the same RE in the future.




Do you use these module-level functions or do you first get a ' regexobject ' and then call it? How to choose depends on how to use RE more efficient and your personal coding style. If a RE is only used once in the code, then the module-level function may be more convenient. If your program contains a lot of regular expressions, or if you reuse the same one in multiple places, it is more useful to put all the definitions together and compile all the REs in a piece of code in advance. See an example from the standard library, which is extracted from the xmllib.py file:


#!python
ref = Re.compile (...)
EntityRef = Re.compile (...)
Charref = Re.compile (...)
Starttagopen = Re.compile (...)


I usually prefer to use a compiled object, even if it is used only once, but few people would be as much as a purist about this as I am.



[ Edit ] Compile Flags



The compile flags allow you to modify some of the regular expression's running patterns. In the RE module, the logo can use two names, one is the full name such as IGNORECASE, one is abbreviated, one letter form as I. (If you are familiar with Perl's schema changes, use the same letter in one letter, for example, re.) The abbreviated form of verbose is re. X. Multiple flags can be specified by or-ing them by bit. such as Re. I | Re. M is set to the I and M flags:




There is a list of available flags, followed by a detailed description of each flag.


Sign Meaning
Dotall, S Make. Matches all characters, including line wraps
IGNORECASE, I Make matching not sensitive to case
LOCALE, L Do localization recognition (locale-aware) matching
MULTILINE, M Multiple lines matching, affecting ^ and $
VERBOSE, X Ability to use REs's verbose state to be organized to make it easier to understand


I
IGNORECASE



Makes a match insensitive to case, and ignores case when character classes and strings match letters. For example, [A-z] can also match lowercase letters, Spam can match "Spam", "Spam", or "Spam". This lowercase letter does not consider the current position.



L
LOCALE



Affects \w, \w, \b, and \b, depending on the current localization settings.



Locales is a feature in the C language library that is used to help with programming that needs to be considered in different languages. For example, if you are working with French text, you want to match the text with \w+, but \w only matches the character class [a-za-z]; it does not match "E" or "C". If your system is properly configured and localized to French, the internal C function will tell the program that "E" should also be considered a letter. Using the LOCALE flag when compiling a regular expression will get the compiled object that handles \w with these C functions; this will be slower, but it will also be possible to match the French text with \w+ as you would expect.



M
MULTILINE




(At this time ^ and $ will not be interpreted; they will be introduced in section 4.1.)




Use "^" only to match the start of a string, and $ to match only the end of the string and the end of the string directly before wrapping, if any. When this flag is specified, "^" matches the start of the string and the start of each line in the string. Similarly, the $ metacharacters match the end of the string and the end of each line in the string (immediately before each newline).



S
Dotall



Causes the "." Special character to match exactly any character, including wrapping; Without this flag, "." matches any character except the newline.



X
VERBOSE




The sign gives you a more flexible format so that you can write regular expressions easier to understand. When the flag is specified, the whitespace in the RE string is ignored unless the whitespace is in the character class or after the backslash, which allows you to organize and indent the re more clearly. It can also allow you to write annotations to the RE, which are ignored by the engine, and are identified with the "#" sign, but not after a string or backslash.




For example, here is a use of re. VERBOSE RE; How much easier is it to read it?


#!python
charref = Re.compile (r "" "
&[[]]    # Start of a numeric entity
(
[Reference c5/># Decimal Form
| 0[0-7]+[^0-7]  # octal Form
| x[0-9a-fa-f]+[^0-9a-fa-f] # hexadecimal form
)
" "", Re. VERBOSE)


Without verbose settings, the RE will look like this:


#!python
charref = Re.compile ("&# ([0-9]+[^0-9]"
"|0[0-7]+[^0-7]"
"|x[0-9a-fa-f]+[^0-9a-fa-f") ")


In the above example, the Python string automatic connection can be used to divide the re into smaller parts, but it is more than the re. More difficult to understand when VERBOSE signs



[ Edit ] More mode features



So far, we've only shown a subset of the functions of regular expressions. In this section, we'll show you some new metacharacters and how to use groups to retrieve the text portions that are matched.






[ Edit ] more metacharacters.



There are a few meta characters that we haven't shown, most of which will be shown in this section.




The remaining portion of the meta character to be discussed is the 0-width-defined character (Zero-width assertions). They do not make the engine faster when processing strings; instead, they do not correspond to any character at all, but simply succeed or fail. For example, \b is a qualifier (assertions) that locates the current position at the word boundary, and this position is not changed by \b at all. This means that the 0-wide qualifier (zero-width assertions) will never be repeated, because if they match once in a given position, they can obviously be matched countless times.



|




Optional, or "or" operator. If a and B are regular expressions, a| b matches any string that matches "a" or "B". | The priority is very low to be able to run properly when you have multiple strings to choose from. crow| Servo will match "Crow" or "servo" instead of "Cro", a "w" or an "S", and "Ervo".




To match the letter "|", you can use \| or include it in a character class, such as [|].



^




Matches the beginning of the line. Unless you set the MULTILINE flag, it simply matches the start of the string. In MULTILINE mode, it can also directly match each newline in a string.




For example, if you only want matches at the beginning of the word "from", then RE will use ^from.


#!python
>>> Print re.search (' ^from ', ' From there to Eternity ')
<re. Matchobject instance at 80c1520>
>>> print re.search (' ^from ', ' reciting from Memory ')
None


$




Matches the end of a line, which is defined as either the end of a string or any position after a newline character.


#!python
>>> Print re.search ('}$ ', ' {block} ')
<re. Matchobject instance at 80adfa8>
>>> print re.search ('}$ ', ' {block} ')
None
>>> Print Re.search ('}$ ', ' {block}\n ')
<re. Matchobject instance at 80adfa8>


Match a "$", use \$ or include it in a character class, such as [$].



\a




Matches only the first string. When not in MULTILINE mode, \a and ^ are actually the same. However, they are different in the MULTILINE mode; \a only matches the first string, and ^ can also match any position of the string after the newline character.



\z



Matches only on the end of the string.
Matches only the end of a string.



\b



The word boundary. This is a 0-width-defined character (Zero-width assertions) that matches only the word's first and final words. A word is defined as a sequence of alphanumeric characters, so the suffix is marked with a blank or non-alphanumeric character.




The following example matches only the entire word "class" and does not match when it is contained in another word.


#!python
>>> p = re.compile (R ' \bclass\b ')
>>> print P.search (' No class at all ')
<re. Matchobject instance at 80c8f28>
>>> print P.search (' The Declassified Algorithm ')
None
> >> Print P.search (' One subclass is ')
None


When using this particular sequence you should remember that there are two subtleties here. The first is the worst conflict between a Python string and a regular expression. In the Python string, "\b" is the backslash character, and the ASCII value is 8. If you don't use the raw string, Python will convert "\b" to a fallback character, and your RE will not match it as you would like. The following example looks like the re in front of us, but is missing an "R" before the re string.


#!python
>>> p = re.compile (' \bclass\b ')
>>> print P.search (' No class at all ')
None
>>> print p.search (' \b ' + ' class ' + ' \b ')
<re. Matchobject instance at 80c3ee0>


The second in the character class, this qualifier (assertion) does not work, and \b represents a fallback character to be compatible with the Python string.



\b




Another 0-width-defined character (Zero-width assertions), which is exactly the opposite of \b and matches only if the current position is not at the word boundary.



[ Edit ] grouped



You often need to get more information than whether or not the RE matches. Regular expressions are often used to parse strings, to write a RE to match the part of interest and to divide it into several groups. For example, a RFC-822 head is separated into a header name and a value, which can be done by writing a regular expression that matches the entire head, with a set of matching head names and another set of matching head values.




Groups are identified by the (and ")" Meta characters. "(" and ")" have a lot of the same meaning in mathematical expressions; Together they form a set of expressions within them. For example, you can repeat the contents of a group with a repeating qualifier, like *, +,?, and {m,n}, for example (AB) * will match 0 or more duplicate "AB".


#!python
>>> p = re.compile (' (AB) * ')
>>> print p.match (' Ababababab '). Span ()
(0, 10)


Groups are specified with "(" and ")" and get the start and end index of the matching text, which can be retrieved with a group (), Start (), End (), and span (). The group is counted starting from 0. Group 0 always exists; it is the whole RE, so the ' Matchobject ' method takes group 0 as their default parameter. Later we'll see how to express the span of the text that doesn't match them.


#!python
>>> p = re.compile (' (a) B ')
>>> m = p.match (' ab ')
>>> m.group ()
' Ab '
>>> m.group (0)
' AB '


The group is counted from left to right, starting from 1. Groups can be nested. The value of the count can be determined from left to right to calculate the number of open parentheses.


#!python
>>> p = re.compile (' (A (b) c) d ')
>>> m = P.match (' ABCD ')
>>> m.group (0)
' ABCD '
>>> m.group (1)
' abc '
>>> m.group (2)
' B '


Group () can enter multiple groups at once, in which case it returns a tuple that contains the corresponding values for those groups.


#!python
>>> m.group (2,1,2)
(' B ', ' abc ', ' B ')


The groups () method returns a tuple containing all the group strings, from 1 to the contained group number.


#!python
>>> m.groups ()
(' abc ', ' B ')


The reverse reference in the pattern allows you to specify the contents of the previously captured group, which must also be found at the current position of the string. For example, if the contents of group 1 can be found in the current position, \1 succeeds or fails. Remember that Python strings also use backslashes to add data to allow any character in the string, so make sure that the raw string is used when using a reverse reference in the RE.




For example, the following RE finds the pairs of words in a string.


#!python
>>> p = re.compile (R ' (\b\w+) \s+\1 ')
>>> p.search (' Paris in the ' Spring '). Group ()
' the '


It's not uncommon to search for a reverse reference to a string like this--the text format for repeating data in this way is not often seen--but you'll soon find it useful for string substitution.



[ Edit ] no capture groups and named groups



A well-designed REs may use many groups to capture both the strings of interest, and to group and structure the RE itself. In complex REs, tracking group numbers becomes difficult. There are two features that can help with this problem. They also use common syntax for regular expression extensions, so let's take a look at the first one.




Perl 5 adds several additional features to standard regular expressions, and Python's RE modules support most of them. It is difficult to choose a new single key character or a special sequence starting with "\" to represent new features without confusing Perl regular expressions with standard regular expressions. If you choose "&" as the new meta character, for example, the old expression thinks that "&" is a normal character and will not be escaped when using \& or [ampersand].




The Perl Developer's solution is to use (?...) as an extension syntax. The "?" will cause a syntax error directly after the parentheses, because no characters can be duplicated, so it does not produce any compatibility issues. The character following the "?" indicates the purpose of the extension, therefore (? =foo)




Python has added an extended syntax to the Perl extension syntax. If the first character after the question mark is "P", you can see that it is an extension of Python. There are currently two such extensions: (? P<name> ...) Define a named group, (? P=name) is a reverse reference to a named group. If the future version of Perl 5 adds the same functionality with different syntax, the RE module will also change to support the new syntax, a Python-specific syntax that is maintained for compatibility purposes.




Now let's look at the normal extension syntax, and we'll go back and simplify the features that work with groups in complex REs. Because groups are numbered from left to right, and a complex expression may use many groups, it can make it difficult to track the current group number, and modifying such a complex re is cumbersome. When you start by inserting a new group, you can change each group number after it.




First, sometimes you want to use a group to collect part of a regular expression, but not interested in the contents of the group. You can use a No capture group: (?: ...) to implement this feature so that you can send any other regular expression in parentheses.


#!python
>>> m = Re.match ("([ABC]) +", "abc")
>>> m.groups ()
(' C ',)
>>> m = Re.match ("(?: [ABC]) +", "abc")
>>> m.groups () ()


In addition to capturing the contents of a matching group, no capture group behaves exactly like a capturing group; you can put any character in it, you can repeat it with a repeating Furu "*", and you can nest it in other groups (no capture group and capturing group). (?:...) This is especially useful for modifying existing groups, because you can add a new group without changing all the other group numbers. Capturing groups and no capture groups are no different in search efficiency, and none is faster than the other.




Second, the more important and powerful is the naming group; Unlike a number-specific group, it can be specified by name.




The syntax of the command group is one of the Python private extensions: (?) P<name> ...). The name is clearly the name of the group. In addition to having a name for the group, the named group is the same as the capturing group. The ' Matchobject ' method handles either an integer that represents the group number or a string containing the group name when capturing the group. A named group can also be a number, so you can get information about a group in two ways:


#!python
>>> p = re.compile (R ' (?) p<word>\b\w+\b)
>>> m = P.search (((Lots of punctuation))
>>> m.group (' word ')
' Lots '
>>> m.group (1)
' Lots '


Named groups are easy to use because they let you use memorable names to replace the numbers you have to remember. Here is an example of RE from the Imaplib module:


#!python
internaldate = re.compile (R ' internaldate "'
R ') (?) p<day>[123][0-9])-(? P<MON>[A-Z][A-Z][A-Z])-'
 R ' (? P<year>[0-9][0-9][0-9][0-9]) '
R ' (? P<HOUR>[0-9][0-9]):(? P<MIN>[0-9][0-9]):(? P<sec>[0-9][0-9]) '
R ' (? P<zonen>[-+]) (? P<zoneh>[0-9][0-9]) (? P<zonem>[0-9][0-9]) '
R ' "')


Obviously, getting m.group (' zonem ') is much easier than remembering to get a group of 9.




Because the syntax of the converse reference, like (...) \1 Such an expression is represented by a group number, when the group name instead of the group number will naturally be different. There is also a Python extension: (?) P=name), which enables the contents of a group called name to be found again at the current location. Regular expressions in order to find duplicate words, (\b\w+) \s+\1 can also be written (?). p<word>\b\w+) \s+ (? P=word):


#!python
>>> p = re.compile (R ' (?) p<word>\b\w+) \s+ (? P=word)
>>> p.search (' Paris in the ' Spring '). Group () ' The
'


[ Edit ] forward-defined character



Another 0-width-defined character (Zero-width assertion) is a forward-defined character. The forward definition includes a forward positive definition and a negative definition in the preceding paragraph, as follows:



(?=...)



The forward affirmative-defined character. If a regular expression is included to ... Indicates that success occurs when the current location succeeds, or fails. But once the expression has been tried, the matching engine does not improve at all, and the rest of the pattern tries to define the right side of the symbol.



(?! ...)



Forward negative definition character. In contrast to affirmative-defined characters; successful when the containing expression cannot match the current position of the string




By demonstrating where the forward can be successful contributes to the concrete implementation. Consider a simple pattern to match a filename and pass it "." Into the base name and the extension two parts. As in "News.rc", "News" is the base name, and "RC" is the file name extension.




The matching pattern is simple:


.*[.]. *$


Note that "." Requires special treatment because it is a meta character; I put it in a character class. Note also the following $; This is added to ensure that all remaining parts of the string must be included in the extension. This regular expression matches "Foo.bar", "Autoexec.bat", "SENDMAIL.CF", and "printers.conf".




Now, consider complicating the problem, if you want to match a filename with an extension other than "bat"? Some incorrect attempts:


.*[.] [^b].*$


The first attempt to remove "bat" above requires that the first character of the extension not be "B". This is wrong because the pattern also does not match "Foo.bar".


.*[.] ([^b]..|. [^a].|.. [^t]) $


When you try to fix the first solution and ask for one of the following situations, the expression is more messy: the first character of the extension is not "B"; The second character is not "a", or the third character is not "T". This can accept "foo.bar" and Reject "Autoexec.bat", but this requires only a three-character extension that does not accept two-character extensions such as "SENDMAIL.CF." We're going to complicate that pattern again as we try to fix it.


.*[.] ([^b].?.? |. [^a]?.? | [^t]? $


In the third attempt, both the second and third letters become optional, in order to allow for matching extensions that are shorter than three characters, such as "SENDMAIL.CF."




The pattern now becomes very complex, which makes it difficult to read. Worse, if the problem changes and you want the extension to be not "bat" and "EXE", the pattern can become even more complex and confusing.




The forward negation cuts all these into:


.*[.] (?! bat$). *$


If the expression bat does not match here, try the rest of the pattern; if the bat$ match, the entire pattern will fail. The following $ is required to ensure that a "bat" extension such as "Sample.batch" will be allowed.




It is also easy to exclude another file name extension and simply make it optional in the qualifier. The following pattern excludes filenames that end with "bat" or "EXE".


.*[.] (?! bat$|exe$). *$


[ Edit ] Modifying Strings



So far, we've simply searched for a static string. Regular expressions are often used in different ways, using the ' Regexobject ' method below to modify the string.


Methods/Properties Role
Split () Fragment the string where the RE matches and generate a list,
Sub () Find all substrings that the RE matches and replace them with a different string
SUBN () Same as sub (), but returns new string and number of replacements


[ Edit ] fragment A string



The ' Regexobject ' Split () method fragments the string in the place where the RE matches, and returns the list. It is similar to the split () method of strings but provides more delimiters, and split () supports only whitespace and fixed strings. As you might expect, there is also a module-level re.split () function.


Split (string [, Maxsplit = 0])


Fragments a string through a regular expression. If the capture brackets are used in the RE, their contents are returned as part of the result list. If the maxsplit is Non-zero, then only maxsplit fragments can be separated.




You can limit the number of slices by setting the Maxsplit value. When Maxsplit is not zero, only maxsplit fragments are allowed, and the remainder of the string is returned as the last part of the list. In the following example, the delimiter can be an arbitrary sequence of non-numeric alphabetic characters.


#!python
>>> p = re.compile (R ' \w+ ')
>>> p.split (' This was a test, short and sweet, of Split (). ')
[' This ', ' are ', ' a ', ' test ', ' short ', ' and ', ' Sweet ', ' of ', ' split ', ']
>>> p.split (' A ' test, sho RT and Sweet, of Split (). ', 3 '
[' This ', ' are ', ' a ', ' test, short and sweet, of Split (). ']


Sometimes you are not only interested in the text between delimiters, but you also need to know what the delimiter is. If the capture brackets are used in the RE, their values are also returned as part of the list. Compare the following calls:


#!python
>>> p = re.compile (R ' \w+ ')
>>> p2 = re.compile (R ' (\w+) ')
>>> p.split ( ' This ... is a test. ')
[' This ', ' is ', ' a ', ' test ', ' ']
>>> p2.split (' This ... is a test. ')
[' This ', ' ...] ', ' is ', ', ' a ', ', ' ' Test ', '. ', '


Module-level Functions Re.split () the re as the first parameter, the other is the same.


#!python
>>> re.split (' [\w]+ ', ' Words, Words, Words. ')
[' Words ', ' Words ', ' Words ', ']
>>> Re.split (' ([\w]+) ', ' Words, Words, Words. ')
[' Words ', ', ', ' Words ', ', ', ' Words ', '. ', ']
>>> re.split (' [\w]+ ', ' Words, Words, Words. ', 1)
[' Words ', ' Words, Words. ']


[ Edit ] Search and replace



Another common use is to find all pattern-matching strings and replace them with different strings. The sub () method provides a replacement value, either a string or a function, and a string to be processed.


Sub (replacement, string[, Count = 0])


The returned string is substituted in the string with the left-most-repeated match in the RE. If the pattern is not found, the character is returned unchanged.




The optional parameter count is the maximum number of times that a pattern match is replaced, and the count must be a non-negative integer. The default value of 0 means that all matches are replaced.




Here's a simple example of using the sub () method. It replaces the color name with the word "colour".


#!python
>>> p = re.compile (' (blue|white|red) ')
>>> p.sub (' Colour ', ' blue socks and red shoes ')
' colour socks and colour shoes '
>>> p.sub (' Colour ', ' blue socks and red shoes ', count=1)
' colour Socks and Red shoes '


The Subn () method acts the same, but returns a two-tuple containing the new string and the number of substitution executions.


#!python
>>> p = re.compile (' (blue|white|red) ')
>>> p.subn (' Colour ', ' blue socks and red shoe S ')
(' Colour socks and colour shoes ', 2)
>>> p.subn (' colour ', ' no colours at all ")
(' No Colours at All ', 0)


Empty matches are replaced only if they are not next to the previous match.


#!python
>>> p = re.compile (' x* ')
>>> p.sub ('-', ' abxd ')
'-a-b-d-'


If the replacement is a string, any backslash in it will be processed. "\ n" will be converted to a newline character, "\ R" converted to carriage return, and so on. Unknown escapes, such as "\j", remain intact. A reverse reference, such as "\6", is replaced by the corresponding group in the RE and the quilt string. This allows you to insert part of the original text in the replaced string.




This example matches the word "section" enclosed by "{" and "}" and replaces "section" with "subsection".


#!python
>>> p = re.compile (' section{([^}]*)} ', Re. VERBOSE)
>>> p.sub (R ' subsection{\1} ', ' Section{first} Section{second} ')
' Subsection{first} Subsection{second} '


You can also specify whether to use (? P<name> ...) A named group of syntax definitions. "\g<name>" is matched by a substring through the group name "name", and "\g<number>" uses the appropriate group number. So "\g<2>" equals "\2", but can be ambiguous in the replacement string, such as "\g<2>0". ("\20" is interpreted as a reference to group 20, not a reference to group 2 followed by a letter "0".) )


#!python
>>> p = re.compile (' section{? p<name> [^}]*]} ', re. VERBOSE)
>>> p.sub (R ' subsection{\1} ', ' Section{first} ')
' Subsection{first} '
>>> P.sub (R ' Subsection{\g<1>} ', ' Section{first} ')
' Subsection{first} '
>>> p.sub (R ' subsection{ \G<NAME>} ', ' Section{first} ')
' Subsection{first} '


A replacement can also be a function that even gives you more control. If the substitution is a function, the function will be invoked by each distinct match in the pattern. At each call, the function is used as a match for ' matchobject ' and can use this information to compute the expected string and return it.




In the following example, the substitution function translates decimal into 16:


#!python
>>> def hexrepl (match):
...   " Return to the hex string for a decimal number "
...   value = Int (Match.group ())
...   return hex (value) ...
>>> p = re.compile (R ' \d+ ')
>>> p.sub (HEXREPL, ' Call 65490 for printing, 49152 for user code. ')
' Call 0xffd2 to printing, 0xc000 for user code. '


When using the module-level re.sub () function, the pattern is used as the first parameter. The pattern may be a string or a ' regexobject '; If you need to specify a regular expression flag, you must either use ' regexobject ' to make the first argument, or use a pattern inline modifier, such as Sub ("(?)." i) B + "," X "," BBBB bbbb ") Returns ' X x '.



[ Edit ] FAQ



Regular expressions are a powerful tool for some applications, but in some cases it is not intuitive and sometimes they don't work as you expect. This section identifies some of the most common errors that are most likely to be made.



[ Edit ] using string Methods



Sometimes it is a mistake to use the RE module. If you match a fixed string or a single character class, and you do not use any of the RE's functions like the IGNORECASE flag, then there is no need to use regular expressions. There are some methods for manipulating fixed strings, which are usually much faster because they are all optimized C-loops to replace large, more versatile regular expression engines.




Give an example of replacing another with a fixed string, such as, you can replace "deed" with "word". Re.sub () seems like the ' function to ' use for this, but consider the ' replace () method. Note that replace () can also be replaced in the word, you can "swordfish" into "sdeedfish", but also can be done RE. (to avoid replacing a part of a word, the pattern will be written as \bword\b to require a word boundary on both sides of word.) This is a job beyond the ability to replace.




Another common task is to remove a single character from a string or replace it with another character. You might be able to do that like re.sub (' \ n ', ', S '), but translate () can accomplish both tasks and faster than any regular expression.




In conclusion, before using the RE module, consider whether your problem can be solved in a faster, simpler string method.



[ Edit ] match () vs search ()



The match () function checks only whether the RE matches at the beginning of the string, and search () scans the entire string. It is important to remember this distinction. Remember, match () only reports a successful match, it starts at 0, and if the match does not start at 0, match () will not report it.


#!python
>>> Print re.match (' super ', ' superstition '). Span ()
(0, 5)
>>> print Re.match ( ' Super ', ' insuperable ')
None


Search (), on the other hand, scans the entire string and reports the first match it finds.


#!python
>>> Print re.search (' super ', ' superstition '). Span ()
(0, 5)
>>> Print Re.search (' super ', ' insuperable '). Span ()
(2, 7)


Sometimes you may prefer to use Re.match () only in the earlier part of the RE. Please try not to do so, preferably using re.search () instead. The regular expression compiler does some profiling of REs so that the processing speed can be increased when the lookup matches. One such analysis opportunity indicates what the first character of the match is; for example, the pattern Crow must start with "C". The analyzer allows the engine to quickly scan the string to find the start character and start all matches only after "C" has been discovered.



Adding. * Will cause the optimization to fail, which will be scanned to the tail of the string and then traced back to find a match for the remainder of the RE. Use Re.search () instead.



[ Edit ] Greed vs No greed



When you repeat a regular expression, such as a *, the result of the operation is to match the pattern as much as possible. This fact often bothers you when you try to match a pair of symmetric delimiters, such as the angle brackets in the HTML tag. The pattern that matches a single HTML flag does not work because. * The essence is "greedy"


#!python
>>> s = ' 


RE matches the "<" in "",. * Consumes the remainder of the substring string. Keep more left in the RE, although > cannot match at the end of the string, the regular expression must backtrack a character by character until it finds a match for >. The final match from "<" in "



In this case, the solution is to use the not greedy qualifier *?, +?、?? or {m,n}?, match as little text as possible. In the above example, ">" is immediately tried after the first "<", and when it fails, the engine adds one character at a time and retries ">" at each step. This process will get the correct result:


#!python
>>> Print re.match (' <.*?> ', s). Group ()


Note Parsing HTML or XML with regular expressions is painful. Chaotic patterns will handle common situations, but HTML and XML will obviously break the special case of regular expressions; When you write a regular expression to handle all possible situations, the pattern becomes very complex. Tasks such as this are HTML or XML parsers.



[ Edit ] no re. VERBOSE



Now you may notice that the expression of regular expressions is very compact, but they are very difficult to read. A medium complex REs can become a long set of backslashes, parentheses, and metacharacters so that they are difficult to read.




In these REs, the re is specified when the regular expression is compiled. The VERBOSE flag is helpful because it allows you to edit the format of the regular expression to make it clearer.




Re. The VERBOSE sign has so many functions. Whitespace characters that are not in the character class in a regular expression are ignored. That means like dog | The expressions such as cat are the same as the dog|cat of readability, but [a] will match the character "a", "B", or a space. Alternatively, you can put the annotation in the RE, and the comment is from "#" to the next line. When you use a triple quote string, you can make the REs format cleaner:


#!python
Pat = Re.compile (r "" "
\s*         # Skip leading whitespace
(?) p


This is much harder to read:


#!python
Pat = Re.compile (R) \s* (?) p


[ Edit ] Feedback



Regular expressions are a complex subject. Can this article help you understand? Are those parts not clear, or do you not find the problems you have encountered here? If that is the case, please send the proposal to the author for improvement.



Describe the most comprehensive book of regular Expressions Jeffrey Friedl written "proficient in regular expression", the book by O ' Reilly published. Unfortunately, the book focuses on Perl and Java-style regular expressions, without any Python material, so it's not enough to be used as a reference for Python programming. (The first edition contains a regex module that is now obsolete in Python and is of little use.)



The third edition of proficiency in regular expression already has some regular expressions using Python instructions, and another PHP style is a separate chapter description. --why





Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.