Regular Learning (1) Summary of---Basic rules

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the use of regular expressions, I have been in the guerrilla, it is time to come up with a summary. Because of the irresistible reason of stupidity, can only choose to simply say, only write down the ASCII encoding character matching and PHP-related, and other Unicode and other different genres, after encountering the next study on the line, to cope with the usual should be enough.

First of all need to accept the concept of the regular, it is used to find the text of a rule, simple text retrieval, such as String lookup Strpos, just simple to find a string of characters appear, as long as a slight change in demand, such as to find such 5 characters, the front two is a number, the following 3 is a letter, and ignore the case, If you simply use string lookup and other methods to do, it is very cumbersome, and the use of regular expressions is extremely simple. Regular is very strong and flexible, of course, the general rules of flexible things are a bit more. In the ordinary life, unconsciously will use the regular similar things, such as search all the text files under the F disk: *.txt, or in some editors such as sublime text, to the regular super-search to provide support, or in the Linux command also often used: find. -name ' *.log '-print find the file with the extension log in the current directory, or cat 2015-10-18-log |grep ' show_list ' when querying the log (this is not, can only be said): Print out rows that record the Show_list interface, etc. , and the Apache rewrite module Controls access links, which are similar or are used in regular expressions.

In PHP, there are three kinds of engines used to support the regular: Preg, Ereg, Mb_ereg, the so-called engine simple understanding is the bottom of the PHP in the regular matching to find the support of the library and interface, different libraries have different names, now use more is preg, It is stronger in function and speed than the other two. The Preg engine (or kit), or "Perl Regular expression" (Perl Regular expression), stems from the performance dissatisfaction of some Daniel's Ereg suite, and wants to make a better library, so look at the source code for Perl's regular processing But before him another Daniel, also encountered this problem, this Daniel (the latter) studied Perl regular source code, feel cumbersome and complex, so he wrote a set of Perl-compatible regular library Pcre (Perl Compatible Regular Expression), writing a clear , excellent efficiency, complete documentation, and then the previous Daniel to rewrite it into PHP, it became a preg, has been gradually improved to today. So preg compatible with Pcre,pcre-compatible Perl's regular, Preg is Perl regular relatives, the above is history.

An example of a simple regular expression: '/abc/', which matches the ABC three characters in a string, is exactly equivalent to a normal string lookup method, is a string, but a special symbol has a special meaning, and the string is generally referred to as the regular strings: pattern (pattern).

1. Delimiter

When defining a regular expression, there must first be a delimiter, which means that the expression starts here, ends there, and most commonly used is/, such as the '/abc/' that matches ABC, starting with a and ending with C. In PHP, you can also use other, such as #,! , {} (left with {, right with}), etc., all by personal habits.

2. Atoms

The atom is the most basic unit of regular expression and is subdivided into 5 classes, which I will record with its function.

First, you can use the most common characters as atoms, such as a, B, C, 1, 2, _ And so on

    '/9527/'    /misson failed/'    /php_version/'

Some special characters and metacharacters (metacharacter) can also, any kind of symbol can be used in the pattern, if not with its own special meaning of the symbol of the conflict, said before, the regular is very flexible and powerful, The power is to have these special meaning symbols of the auxiliary (what special role to say), if the regular to match the special meaning of the symbol here, need to escape (escape), it is good to understand, and double quotation marks in the string to escape the same. These special symbols include., * 、?、 +, ', ', \,/And so on, in fact single quotes, double quotation marks sometimes do not have to turn, the whole look at you write the expression with a single quote string or double quote string, in PHP in order to avoid error write regular general use single quote string. There is no conflict with their own delimiter, or the system for the expression to end prematurely and error.

    '/a\.b\?c\+/'    /ab\/123/'

Non-printable characters are atomic, non-printable characters refer to white space characters, that is, space, horizontal tab \ t, vertical tab \v, carriage return \ r, newline character \ n, etc., note that this is easy to bury bug,php use single quote string, reduce error

    '/\r\n/'

Previously said, can match the general single character, such as 9527,9 followed by a 5, followed by a 2, then 7, now matching is not limited to these numbers, such as I want to match two numbers, as long as the number is the line, for this kind of character with a general character, The regular also has the representation of a class of words Fu, such as the representation of a number with \d, a decimal number of 0 to 9, and \d (uppercase) represents a non-decimal number, \w to match a word , in the regular school of PHP, the word is defined as the capital letter A-Z, lowercase letters A-Z, numbers 0-9 and underscores _, the corresponding \w uppercase form indicates the opposite meaning.

    '/\d\d/'     //  match two digits    '/\d/'       //  match an arbitrary character except a number    '/\w/'         match a word    '/\w/'       / / match an arbitrary character except the word    '/\s/'/       / match a white space character    ' /\s/'       //  match a non-whitespace character

3. Meta-Characters

Metacharacters are characters that have special meanings, such as *, + 、?、. , |, ^, $, etc., generally cannot appear alone, only in the decoration of a (or some) atom in front of it to show its own special meaning, that is, it should be in line with the above-mentioned atomic use is meaningful. It has a special meaning, but when it comes to matching it, it has to use the escape character \, which is the normal character.

?: quantifier, with or without

Now to match a word meaning color, there are two ways of writing, colour or color, the middle of you have either 1, or no , metacharacters? It is appropriate, and, if there is no parentheses limit, the meta-character quantifier this type of meta-character only to the one atom in front of it to function, here? Only acting on U

    '/colou?r/'

+: quantifier, one or more

For example, to match one or more numbers, note that a meta-character-modified atom may be a sequence, but no parentheses restrict only one valid

    '/\d+/'

*: quantifier, 0, one or more, i.e. any number

    '/\d*/'

Interval: Specifies the number of repetition occurrences

The previous limit is more rigid after all, a live, interval with a pair of {}, {n} means n times, {n,} means greater than or equal to n times, {m,n} indicates at least m times, at most n times, preferably m less than N, do not deliberately embarrass the system ~

    '/auth{0,1}/'    //  appear 0 times to 1 times, ie?    ' /auth{1,}/'     //  appears at least 1 times, i.e. +    '/auth{3}/'      //  H to appear 3 times

Match any character

., dot, match any character, in PHP, by default, except for line breaks, it can match any one by one characters, and in the other case, it really matches any character, including the line break, and later.

Multi-Select structure

|, indicating or, depending on it can generate a multi-select branch, it should be noted that | The lowest priority in the regular, the following is not the function of the R and C, but | The left and right sub-expression, even if an expression is written as '/\d+\s*abc{2,5}|ack?\d/', It still acts on its front and back subexpression \d+\s*abc{2,5} and ack?\d, as long as there is no parentheses limit

    '/color|colour/'    //  match color or colour

Character groups

Match one of several characters, now to match any one of the characters in ABC, you can do this: [ABC], wrapped in a bunch of brackets inside, add to match uppercase A to uppercase Z any character, you can: [A-z], the middle of a hyphen in the character group inside a special meaning metacharacters to what, Other numbers [0-9], or just a number, a part of the letter value: [2-5], [c-h], there are a few points to note:

1. Hyphens-only within the group of characters, and between two characters are considered valid metacharacters, out of the character group is the same as a normal character;

2. If the character group does need to match-it is best to put it in the first character group, such as [-a-z], matching lowercase letters or-, in PHP can also be placed in the last, such as [c-k-] (incredibly no error-_-#), but on the front or the most insurance;

3. For the system is not recognized by-identified by-the sequential character, for PHP will be reported compilation failed warning, such as [a-9], [9-0]

4. The order of the character groups is generally small to large, not [9-0], the case is generally separate write [a-za-z], but in PHP, at least I this 5.5 version can be mixed write [A-z], to match the letters from A to Z and a to Z, do not advocate to do so, other language schools Ken can not support

    '/c[efimt]o/'    /[a-za-z0-9_]/'/    /  equivalent to \w, indicating the word    '/[a-z]+/'          //  Match one or more lowercase letters

Excluded character groups

The preceding matches any one of the characters in the [], now the opposite, does not match any one character [] characters, such as [^a-z]: does not match any lowercase letter, the first in the character group and the ^ to indicate the inverse. An interesting example:

    '/abc[^abc]/    '    abck ' abc '       //  match?

String 1 Is it obvious that string 2 matches? Note The character group: it matches characters that do not appear . Here the ABC is followed by a character other than a, B, or C, but no, this is the pit of the character set.

Word boundaries

Sometimes in order to match a complete word, the general word to the impression is left and right with a space, specifically there is a word of this boundary meta-character \b, note that it does not match a space, but a position , so the separate ' ABC ' is also considered a word, although it has no space around it, Tell me more about the matching position. But in regular, at least the current definition of the word is not so powerful, that is, uppercase and lowercase letters, numbers underlined, and the opposite of the word boundary is \b, as long as not the word boundary can match

     '/\b\w+\b/'      //  match a word     ' Hello World '    //  match Hello (match only once)     ' abc '            //  match ABC

Bracket Qualification

The first powerful feature of parentheses is the scope of the qualifying metacharacters, dividing the expression into a sub-expression, such as '/col (ou) r/', with parentheses, the meta-character object is the OU and no longer just u, '/abc|def/' refers to the match of ABC or DEF, but a (bc|de F matches the ABCF or adef, like the programming language bracket operator, in parentheses is a small unit, equivalent to a large atom, of course, it also has an important role: grouping, and later.

The beginning and end of a line

Each line of string has a beginning and an end, a position at the beginning and the end, although we describe a string of strings that will say ... The end, but the regular does not refer to the specific character, matches the position. For example '/^a.+/', matching a string starting with a, ^ representing the beginning, $ for the end, here are a few notable examples

    '/^/'                 //  matches the beginning of a line, as long as a line string has the beginning, even if the empty string, meaningless    '/^$/'                //  match a line beginning, followed by the end of the line, That is, match a blank line    '/^hello$/'           //  match A string that begins with Hello, and then the row ends with a line that matches only the Hello string with no more characters    '/^hello.*hello$/'    //  matches the beginning of a line, followed by a Hello, then possibly several other characters, followed by a Hello, followed by the end of the line, which matches a hello beginning, a hello ending, A string with several characters in the middle

Note By default, the caret ^ and dollar sign $ represent the beginning and end of a string, such as "Hello", ending next to the right of O, then "Hello\nabc" (in PHP double-quoted string, \ n is a meta sequence, which represents a newline character, but the single quote string for PHP \ n is just a normal character, a \ One n, which is to be considered a line or two? The answer is that in Preg, by default, it is still considered a line, and it ends next to the right side of C and can be verified under

 <? php      $pattern  = '/hello$/'; //     End with Hello    $subject  = "HELLO\NABC"; //     matches the end of the string, not the end of the logical line  preg_match  ( $pattern ,  $subject ,  $match  );     echo  ' match=><pre> ' ;          var_dump  ( $match ); //  No matching content

We call "abc\n" a logical line, and "abc\ndef\n" is called a two logical line, because logically it has two lines of string, but preg default does not match multiple logical lines (probably pits), tube you a few line break character just as a line of string to see, $ directly match to the last position of the string. Can PHP's regular match multiple logical lines? Of course, you need to use the pattern modifier, and then you can say it later.

In addition to the line beginning with ^, $ represents the end of the line, in Preg, \a also represents the beginning, \z and \z both represent the end, except that \z can match to the last newline character, and \z does not.

3. Grouping and capturing

A simple example that matches a string that begins with a single or double quotation mark, and the end needs to be a corresponding double or single quotation mark (there are no escaped double or single quotes in the middle)

    ‘/["\‘].*["\‘]/‘

If this is the case, the first character group matches the double quotation marks, it is not guaranteed that the second character group matches the same double quotation mark, that is, the current requirement is that the preceding character group matches what, the subsequent character group also needs to match the same character, not to the same regular expression on the line.

The first big function of parentheses, which limits the scope of certain metacharacters, and the other big function of parentheses is the grouping capture, which is the characteristic of the regular character. For example '/ab (CD) EF (GH) ij/', with two brackets, the Preg engine will record the matching text in each parenthesis, with the left parenthesis left to the right, numerically numbered, \1 the contents of the first opening parenthesis, and \2 the contents of the second left parenthesis inside. The nth left parenthesis corresponds to a maximum of 4,096 records (\n,preg). ), here \1 corresponds to cd,\2 for GH. Parentheses play the role of grouping, and similar to \1, \2...\n is called a reverse reference. Note: The reference is a regular match to the text, not the regular expression of the reference.

    $pattern = '/(\w) (\d) (. *)/';    // match a word, a decimal number, and some arbitrary character    $subject = ' a57h ';     Preg_match ($pattern$subject$match);     Echo ' match=><pre> ';     Var_dump ($match);

Results:

The Preg_match method places the captured text in the $match parameter, and the elements corresponding to 1, 2, and 3 of the array index are captured \1, \2, \ 33 grouped text (index 0 lists the text to which the entire expression matches).

For nested brackets, see only the relative order of the left parenthesis, such as '/(HIJ) (k)/', valid opening parenthesis is 4, \1 corresponds to the entire enclosing contents of the left parenthesis of ABC (ABCDEFGHIJ), also includes nested parentheses inside the matching content, \2 corresponding Def, \3 corresponds to hij,\4 corresponding to K.

So the example that corresponds to the upper quotation mark can be this: '/([' \ '].*) \1/', followed by the same character as the first match (note that the pattern is not consistent).

But the new variant came again, some want to group capture, but some just want to divide a group, limit the scope, do not want to capture,?: Come, (?: ...), add a question mark to the front of the grouping parenthesis, and it means that the text capture is canceled. So for '/(?: ABC (DE) FG (?: hij))/', \1 captures the DE, without \2.

In addition to digital capture (\ followed by a number to identify the capture content), you can also use named captures, which give the captured text a name, using the (? P<name> ...), add parentheses? P<name>,name is the name of the book, and some of it says (? P=name), 5.5 Pro test not.

    $pattern = '/\w (? p<key1>\w\w) \s+ (? p<key2>\d+)/';     $subject = ' abcd 233 ';     Preg_match ($pattern$subject$match);     Echo ' match=><pre> ';     Var_dump ($match);

Results:

In the name of the PHP capture, the original digital form of capture also recorded, Key1 and \1 corresponding, Key2 and \2 corresponding, anyway, more than one is not too much.

Another great use of capture is the reference to the matched text , such as the expression '/123 (\d\d\d) ok ([a-z][a-z]) end/' in the substitution operation, which we want to extract, with the parentheses capturing three numbers and two lowercase letters, What to do, for preg, use a simple replacement method: Preg_replace. An example

  $pattern  = '/123 (\d\d\d) ok ([a-z][a-z]) end/'      $subject  = ' 123233okhiend '   $replacement  = ' $---$ '; //      $ret  = preg_replace  ( $pattern ,  $replacement ,  $subject     );     echo  ' ret=><pre> ' ;  var_dump  ( $ret );

Brackets Snap to 233 and Hi, if it's a reverse reference we know it's \1 and \2, replace the string $replacement, use $, $ to catch it, note that \1, \2 is used in the original regular expression, here is the replacement string, of course, the substitution operation is a copy of the string.

Results:

If you use a named capture to manipulate, using the corresponding $name to reference, the pro-test seems to be not possible, the estimate is afraid to conflict with the context of the variable, but it said that, in the name of the capture, the digital index is still valid, so in the name of the capture, we can still replace the string by $ to refer to the matching text

A problem to note: Suppose I captured the 1th group, the replacement text is ' $ $ ', 1 followed by a number, error prone, in order to improve the efficiency of parsing, you can use {} to enclose the number, such as ' ${1}5 ', and this is still invalid for named capture.

4. Surround

The front line starts and ends with a match to the position, and the Lookaround is also a match for the position, and for this position match, perhaps a more familiar name is: 0 wide assertion . ^, $, \a, \z, \b, \b are all counted 0 wide assertions, including the surround look structure, and of course the beginning and end of the line are also called anchor points . For example, I have done a small project, there is such a requirement, using a framework built by others, each model class processing a table, the table name according to the model class name of the class is determined, its class name is generally the case Externallinks extends model{...}, The subclass name is such that externallinks, whose table name is Gw_external_links,gw_, does not say, from externallinks to external_links need to be lowercase on the left, uppercase letters on the right The place to insert a _, and then to lowercase, and the class name is not necessarily the only such, such as Abcdefghi, may be 3 of the size of the word, the number is not sure, with the replacement of the string is more troublesome, because the number is not sure, with the regular easy to solve. Whatever the left side of this place is, what is on the right, the obvious position required, you can try it with a look around. There are four types of surround look:

Sure to look around: (? = ...), match a location, and the right side of it is ...

Definitely reverse look around: (? <= ...), match a location, its left side is ...

Negative order Look around: (?! ...), match a location, its right is not ...

Negative reverse look: (?<!...), match a position, its left is not ...

Don't look ... Is the definition of the symbol to look around, the difference between order and reverse is that one looks from left to right and one from right to left. Surround look is useful in position detection. For example, to match ABC now, it must have a number on the right, so you can

    '/abc (? =\d)/'

To tell me the problem above, find the left is the lowercase letter, the right is the position of capital letters, you can

    '/(? =[a-z]) (? <=[a-z])/'

If both sides of the left and right need to be detected, like this, the order to look around and reverse the order of the order of the left side of the sequence without the tube, which put the left side of which put on either side of the line (equivalent to '/(? <=[a-z]) (? =[a-z]). In general, the reverse look at the left, the order to look at the right side, a good habit is the location of the center, imagine this position on the left, what should be on the right. The above pattern finds the position with _ replace (actually insert) off the problem basically solved. Another example

    $pattern = '/\w+ (? =\d)/';     $subject = ' abcd 233 ';     Preg_match ($pattern$subject$match);     Echo ' match=><pre> ';     Var_dump ($match);

Results:

Think about why? Match one or more words, its right must be a number, only ' 23 ' matches, its right is exactly a number, 2 of the front is a space, does not belong to the \w range, and ' 233 ' behind nothing.

Note: Matches the position only, does not match the actual text, this sometimes has the very big difference when the multiple matches.

For word boundary character \b, can also be used to look out, the beginning of the word boundary is the left non-word to the right of the word: (? <!\w) (? =\w), the end of the word is the left side of the word non-word: (? <=\w) (?! \w), together is: (? <!\w) (? =\w) | (? <=\w) (?! \w) (|, or the left and right expression is the whole, it has the lowest priority).

5. Possession priority vs ignoring priority

Standard quantifiers are match-first, such as *, +,?, and Interval quantifiers {m,n}, and they always match as many characters in their own range as possible, and for {m,n} (M<n) there is no match for M. For example

    $pattern = '/\w+ (? =\d)/';     $subject = ' hello123 ';     Preg_match ($pattern$subject$match);     Echo ' match=><pre> ';     Var_dump ($match);

Results:

The previous example always matches the word as much as possible, so the \w+ will always match to the end of the string at 3, but then the look-around check appears on the right is not a number (nothing), so return a word, and then see the right is not the number, That's it (hello12) then the match is returned successfully, there is a backtracking process (next write), not fine tangle, some books call this match the greedy mode , that is, the standard quantifier is greedy, as much as possible to match, Then there is the non-greedy pattern corresponding to it.

Add a greeting after the quantifier, that is, the non-greedy mode, that is, ignore the priority, such as *, +?、??、 {m,n}?, the non-greedy mode of quantifiers always match as few characters as possible, or the above example, if the $pattern changed to the following?

    $pattern = '/\w+? (? =\d)/';    // Ignore Precedence

The result of the match is ' hello ', the process is first +? Make \w each match a word to check the right is not the number, is the word match this stop, return the result, to ensure that the match to the character is best, not the words continue to match, until the right is the number immediately stop the return. Ignoring precedence can return results in advance, which of course may not be optimal, improving the matching speed (next summary).

A column with a stubborn, non-returning wood? Of course, this is the possession of priority, it in the standard quantifier after adding a +, such as *+, + +,? +, {m,n}+, the above example to change, intentionally to the results-_-#

    $pattern = '/\w++ (? =\d)/';    // occupy priority

Remember the priority features: Do not return! \w+ has been matched to the end of hello123, after discovering nothing, disappointed, and then stop matching, report matching failed, yes this time it was a match failed! Occupy priority still belongs to match priority, as much as possible to match, just match to finally if found a condition does not match, embarrassed, he disobeyed, direct stop. Therefore , a priority can be reported in advance failure, improve the matching speed .

Ignore the next take on the first temper, to see a new thing: Curing group. The way of expression: (...) (It should belong to a group capture, I think it's better to put it here), as with the priority function, it is also a priority match, and does not return. So the above $pattern changed to look like this, but also to match the failure drops.

    $pattern = '/(? >\w+) (? =\d)/';    // Curing Group

6. Pattern modifiers

mode modifier (pattern modifier), placed at the end of the pattern, after the end of the delimiter, such as '/....../imx ', I, M, x represents a different modifier, you can use more than one at a time. Pattern modifiers are a moderating effect on the overall effects of regular expressions.

I: matching is not case-sensitive, such as '/abc/i ', to match to the ABC of any case can be;

M: Think of a string as more than one logical line, do you remember the beginning ^ and end of the match line string? The default Preg treats the string as a line, even if there is a valid newline character, if the modifier m is added, the line is followed by line breaks, which are called logical lines, ^ matches the beginning of each logical line, and the $ matches the end of each logical line;

S: Remember the meta-character that was previously said to match any character. (point number)? Preg by default, the dot number matches any character, except for line breaks, and the S modifier is added, and the dot is not rejected as a newline character;

X: This modifier is useful for complex regular expressions because it can add comments to the regular!!! And it ignores whitespace in the regular, and of course includes line breaks, to illustrate one example:

    // using the X modifier    $pattern = '/(\w+ \d)    (?:                 # Here's what to match        "\w" | "\d"        |               # Here's how it should be understood        [-a-z]+         # Here's your own guess    ]/x';     $subject = ' wcwieu2832z28 ';     Preg_match ($pattern$subject$match);

E: It is useful in certain places, that is, the Preg_replace method, the Preg Preg_replace method prototype is as follows

    Mixed Preg_replace (mixed$patternmixed$replacementmixed$subject$limit =-1 [, int &$count ]])

The second parameter, $replacement, will match the text to it, using the E modifier, $replacement not only can use simple text, not only the captured text can be referenced in a single-quote, but more to the drawback, can write PHP code, In the form of a string (a bit of the feeling of Eval), example

    $ret Preg_replace ('/\d ([a-z]+)/E ', ' Strtolower ("$") ', ' 5BBC ')    ; Echo ' ret=><pre> ';     Var_dump ($ret);

But when I was ready to see the results, error: the/e modifier is deprecated, use preg_replace_callback instead, Nani! The pattern character e is deprecated, so using the Preg_replace_callback method, which uses an adjustable function (called asynchronously) to handle the matching result, is obviously much better than putting the code directly in the string, and how to use it to refer to the manual.

D: If this modifier is set, $ in the pattern will only match the end of the string (Eos,end of String), not the line break before the EOS, i.e. it will not be considered as multiple lines. However, if the M modifier is set, this option is ignored;

U: Reversed * and *? , the original match priority becomes ignore priority, the original ignore priority becomes the match first. I don't think there is a bird to use, except to make people more egg ache.

Above, there are some powerful things, such as regular expression recursion, days, forget it.

Forget to take another look >>> escape

Regular Learning (1) Summary of---Basic rules

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More