Regular Expressions (regexp)
A regular expression is, in a sense, the highest level in a string operation, not because of the complexity of its syntax, but because of its flexibility. Understanding this requires understanding the nature of the regular expression, no matter how complex the regular expression, its essence is the string, the purpose is to record other strings of the law. It may seem a bit abstract, but it is easy to understand that most people use a DOS command when using a wildcard character, such as listing all PDF documents in a directory, by Dir *.pdf--the * represents wildcard, which can represent any string, This command lists all files that match the following name: any string + ". pdf" suffix; some substitution rules and formatting rules in Doc and Excel documents also involve this simple regular wildcard. What we call a regular is actually something like that, a more complex string matching rule--unless you need to match the string rules easily. On a Linux platform, regular expressions are required, and many commands are based on regular expressions, and if there is no such efficient string-matching logic, it is very tragic for users.
Before discussing the Python regular module, the regular expression itself must be discussed clearly-in fact a deep understanding of the regular expression, the use of Python is much easier, but do not want to think of the regular expression at once can understand, it is impossible, Even a genius needs practical action to really learn. I do not deny that there are many geniuses in the world, but I always recommend learning regular expressions, at least 3 times. The completion of this part of the study is not expected to become a regular master, this is not the purpose of this part of learning, and the master is not to read the textbook can learn. This part of the study, want to be able to "get started." It's enough to know what kind of thinking to think and when you need help knowing how to query the relevant information. Every time I use the regular is also follow-up with the use, rarely do related things, fortunately, the relevant information online more surprisingly, enough a rookie to become a regular master. The following begins the introduction and discussion of regular expressions.
In Python, the regular expression module is re, and our discussion process uses this module as an example for testing, and some online regular expression tests can be tested, such as JS's regular expression test site: http://regexpal.com/.
Character
The characters here and the characters in the string are the same, most characters match itself, except for some metacharacters, we can simply think of meta-characters, which are some of the keywords in the regular grammar, they have special meanings, some represent special matches, and some represent other grammatical rules. Simple character matching is as follows:
The basic use of RE is simple, specifying a regular expression matching string, compiling, matching. The Re.match method returns a matching object if it matches to, or null if no match is reached. We can see that the regular expression ' abc ' can be matched to itself. In fact, in addition to the metacharacters, other characters are matched to its own.
Metacharacters
There are many meta characters in the regular, which can be found in Links: http://msdn.microsoft.com/zh-cn/library/ae5bf541 (v=vs.80). aspx.
Special characters in meta-characters include:. ^$*+? {[]\| ()
In the regular syntax, they have other meanings and functions--some are used alone, some are combined. Let's first look at one of the most commonly used--[], which represents a collection of characters.
Character
The character set contained in [], there are two functions, one is used to match the range, where the scope is to match any one of the characters belonging to this range, and the other is to contain some metacharacters--most of the metacharacters in [] will lose its own special meaning, become ordinary characters, in addition to the ^[]\ four characters, Because they are character set syntax.
Where ^ is placed at the beginning of [], the expression is reversed, and the character set in [] does not match, placed in the middle of [], just like the other metacharacters, indicating that it matches itself. In a simple example, [^ABCD] means matching any character except ABCD.
In the character set, one-by-one enumeration is of course there is a solution, you can use-to represent a range, such as [0-9] match 0 to 9 of the number, [a-za-z] matches all characters.
\ There are two functions in the regular, the first one is escaped, the second is a special match.
Escape character
As discussed above, some meta characters do not match themselves, so how do you match these metacharacters? It is possible to add \ to escape before them. This is the same as the path in the string and needs to be escaped. All characters with special meanings in regular expressions need to be escaped to match themselves. Of course, just like the previous path, we can also add an R to the string to represent the original string, or if we want to match \ itself, we need ' \\\\ ' to be a troublesome regular expression, but can be represented directly by using R ' \ '.
Special matches
Special matches include some with \, and some other special characters. Commonly used are the following:
******************************************************************************
X|y |
Match x or Y (branch statement). For example, ' Z|food ' matches ' z ' or ' food '. ' (z|f) Ood ' Match "zood" or "food" |
\b |
Match a word boundary (start or end), for example, "er\b" matches "er" in "never", but does not match "er" in "verb", "\ber\b" matches "er" |
\b |
Non-word boundary match, and \b opposite. "er\b" matches "er" in "verb", but does not match "er" in "Never". |
\d |
numeric character matching. equivalent to [0-9] |
\d |
Non-numeric character matching. equivalent to [^0-9] |
\ n |
Line break matches. Equivalent to \x0a and \CJ |
\s |
Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v] |
\s |
Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v] |
\ t |
TAB matches. Equivalent to \x09 and \CI |
\w |
Matches any character, including underscores. Equivalent to "[a-za-z0-9_]" |
\w |
Matches any non-word character. Equivalent to "[^a-za-z0-9_]" |
. |
Matches any character except for newline \ n, equivalent to [^\n] |
^ |
Outside the character set, match the start of the string * |
$ |
Outside the character set, match the end of the string * |
*: where ^ and $ are used in the case of an exact match, each match begins and ends, if the whole of the exact match can use the ^pattern$ format, such as exactly match 123, written as ^123$.
******************************************************************************
Repeat
So far, the basic elements in regular expressions have been discussed almost simply, and one important part of the basic grammatical elements is repetition. Since regular expressions are used to record string rules, duplicate rules are simply ubiquitous in strings. such as telephone number or QQ number, is the repetition of the number, the word is the repetition of the letter, there are more repeated combinations and so on. In regular expressions, the repeated syntax includes the following:
* |
Repeat 0 or more times, unlimited number of times |
+ |
Repeat one or more times |
? |
Repeat 0 or 1 times. |
N |
Repeat n times |
{N,} |
Repeat N or more times |
{N,m} |
Repeat N to M times |
? |
When this character follows any other qualifier (*, + 、?、 {n}, {n,}, {n,m}), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible string searched, while the default "greedy" pattern matches the string that is searched for as long as possible. |
The last one to explain is that the regular expression, by default, has a greedy principle, which is to match long strings as much as possible. For example, in the string "AAAA", "A +?" Only a single "A" is matched, and "A +" matches all "a", or "AAAA". This occurs in repetition, with a duplicate identifier followed by A, which represents a non-greedy match, matching to the shortest string. You can simply write the following code:
The RE module handles the basic regular
The basic regular expression syntax has been discussed, with the flexibility to use these grammars, you can do most of the general matching operations. Here are some simple examples to discuss the use of the RE module and familiarize yourself with the basic regular syntax that has been discussed above. Later, the advanced operations of some regular expressions are discussed.
Like the ideas discussed earlier, the RE module matches strings in roughly 3 steps:
- Use Re.compile to get a compiled pattern object, support two parameters, one is regular expression, is required, the second is flag, default is 0, other supported flags include:
RE.S Dotall. Matches any character, not including \ n
RE.I IGNORECASE indicates ignoring case
RE.L LOCALES says to make \w,\w,\b,\b and current locale consistent
RE.M MULTILINE represents a multiline matching pattern that affects only ^ and $
Re.x VERBOSE represents VERBOSE mode, increasing the readability of regular expressions
- Use the various methods of the pattern object to perform the operation on the matching string;
- If the method of pattern object returns a Match object object, the matching string information can be obtained from it;
Parren object commonly used methods include Match,search,findall, where match is matched from the beginning of the string, seach is to scan the whole string to find the first match, FindAll is used to search for all matches in the string, and is returned in tuples.
Parren object commonly used regular methods include match,search,sub and FindAll (or Finditer), where match is matched from the beginning of the string, Seach is the first match to scan the whole string to find, FindAll is used to search for all matching content in a string and is returned in tuples. Note that the regular method is only for Parrent object, there is also a matching method, which is for the object that has been matched to, that is, the match object.
The following are examples of these common regular methods:
Match (Str,[pos,[endpos]): Matching method, given match interval, attention to the greedy problem of the regular, examples are as follows:
Search (Str,[pos,[endpos]): Searching method, return to the first searched value position within a given match interval, using the same method as match, but not starting from the first bit, this is different from match, the example is as follows:
Sub (repl, string[, Count = 0])--newstring: Used to find and replace, replace the matched string with the specified string, SUBN returns the new string and the number of substitutions as a tuple, as follows:
FindAll (string[, pos[, Endpos])--list: Find a method that finds all matching collections within a given match interval and returns. As follows:
Finditer returns a interator to traverse all the match object:
The common methods of match object include Start,end,span,group, which are used to denote the start bit, end bit, interval, and match string of the matched region, and the common properties are pos,endpos,string and re, Examples of the pos,endpos,string and patter object instances in the pattern object, the ending bit, the matched string and the matching object, that is, match (Str,[pos,[endpos]), respectively, are as follows:
Common methods:
Common Properties:
Regular expression Advanced grouping
The first occurrence of the group is to solve the problem of repetition in the regular, the repetition of a single character can be directly after the character with the qualification, the repetition of multiple characters need to be grouped; Of course, there can be multiple groupings in a regular, which we can call sub-expressions. Group in the actual application of the role of a relatively large, network on a more commonly used example is the IP address matching, one of the most coarse is (\d{1,3}\.) {3}\d{1,3}, matching four three-digit numbers, in the middle. Separate, this is the simplest application of grouping. Of course, the IP address of the regular is not so to match and calculate, because the regular expression itself can not go to the calculation of the number, so the IP address match a little more trouble, but this commonly used is easy to find, ((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?].
Python handles grouping using the group () method in the Match object, a simple example is to remove a matching field from a simple inventory information:
Grouping, group (), Start (), End (), span () can be passed through the index to get the corresponding value of the respective group, that is, each group corresponding to the location of the end of the location, are independent of each other-it means that for the nesting of sub-expressions, the matching principle is consistent. Like what:
We usually use the method of grouping the string to the specific format of the value, in general, for more elements, the order of the set of indeterminate, the more appropriate way is a dictionary rather than a list, then the group can pass the key value instead of the index to take the value? Of course this is possible. The mechanism of the group name is provided in the regular grouping, not only the final value is convenient, but also the reference to the inside of the regular expression is more important. Or just the inventory example, re-write:
Index mode:
example, use the? p<id>.* instead of. *, so that the matching after grouping can be used to support the method of the keyword to take the value. In fact, this method of the regular is named for the group, and for complex regularization it may be necessary to repeat what has been matched before, and the way of group name is clear and convenient. Other than that? The notation for P is an extension in Python, with the uppercase p representing Python, and the other regular notation in the standard, which does not have the same letter.
This extension is derived from the expansion of regular expressions in Perl, which is in fact supported by most of the RE modules since perl5.0. Regular expansion is also generally used to identify the perf, which is used in the ID, because it is not escaped, and can be used directly (after the identification. In Python, take the same approach, and then add the uppercase p to the logo, eliminating the hassle of escaping.
No capturing group
No capturing group is easy compared to grouping and group names. There is no difference between the way it is matched and the efficiency and the grouping, and even the sub-expression can be repeated, usually when we are not interested in the content of the matching group, we need to use the no capturing group. Since not to capture, then what is the meaning of it? In general, the significance of a no-capturing group is not in itself, but in the effects on other groupings-similar to the role of a placeholder. When the expression changes or group changes, you can easily modify the expression to group the match, Python used?: Identity. Look at a simple example of a no-capturing group:
Backward reference
The first few examples use grouping to value strings, but the root cause of the grouping was mentioned earlier in order to handle the repetition of the string. Backward referencing is used for repeating matches and searching for grouped content that has been previously matched, as well as indexed and group names, but it is better to use group names, although they are slightly more written, but clearly not prone to problems, especially in the seemingly complex logic of the regular. Here is a classic example that matches two duplicate words:
Complete the above match with the index and group name, and note the format of the group name Backward reference: (? P=name)
0 Wide Assertion
A 0 wide assertion, literally understood, is an assertion. The assertion in the regular is also an identity, in fact, the previous ^$\b of these identity locations is an assertion. The purpose of the 0-wide assertion is to match the position of the string, not the string and the text itself, and basically the assertions in the regular are matched locations. In layman's terms, this assertion is also to find a location where the position satisfies certain conditions, which is the 0 wide assertion. The 0 wide assertion is divided into four situations according to the matching direction and whether it is definitely:
Define forward:
The order is definitely matched (? =exp), which indicates that the matching text to the right is matched with an exp expression, as in the following example:
Forward negation defines:
The sequential negative match (?!), which represents the exp that does not match later, is shown in the following example:
To define backwards:
The reverse order must match (? <=exp), which means that the left side matches the EXP expression, and note that all backward-delimited matching text must be fixed long! , examples are as follows:
Backward negation defines:
Reverse negation match (?<), which indicates that the left does not match exp, is illustrated below:
Balance groups and recursion
In regular expressions, the balance group and recursion belong to something slightly more complicated, but as if Python does not provide support, I understand that the. NET Framework provides this regular support, some languages are not supported, some syntax is different, Python is not yet found. They use more complex text in matching HTML, and generally can be solved by combining groups. This is a slightly higher level of things, not discussed here for the time being.
Python basic--windows Automation chapter (ix)-Regular expression