Explain what regular expressions are and how to use them

Source: Internet
Author: User
Tags php regular expression expression engine
1. What is a regular expression

The regular expression (regular expression) describes a pattern of string matching that can be used to: contain a

(1) Check whether a string contains a string that conforms to a rule, and can get the string;

(2) flexible substitution of strings based on matching rules.

Regular expressions are very simple to learn, but few of the more abstract concepts are easy to understand. A lot of people feel that the regular expression is more complex, on the one hand because most of the documents are not easy to explain, the concept did not pay attention to the sequence, to the understanding of the difficulties; On the other hand, the various engines bring their own documents generally have to introduce its unique features, However, this particular feature is not the first thing we need to understand.

Related courses: Boolean education regular Expression video tutorial

2. How to use regular expressions

2.1 Ordinary characters

Letters, numbers, Chinese characters, underscores, and punctuation marks that are not specifically defined in the following chapters are ordinary characters. An ordinary character in an expression that matches the same character when matching a string.

Example 1: Expression C, when matching string abcdef, the result of the match is: success; match to: C; Match to position: start at 2, end at 3. (Note: The subscript starts at 0 or starts at 1, and may vary depending on the current programming language).

Example 2: The expression BCD, when matching string abcde, matches the result: success; matches to: BCD; match to position: start at 1, end at 4.

2.2 Simple escape characters

Some characters that are inconvenient to write, adopt the method of adding \ in front. These characters are in fact familiar to us.

There are other punctuation marks that have special uses in the following chapters, which represent the symbol itself before adding \. For example: ^,$ have special meanings, and if you want the ^ and $ characters in the string, the regular expression needs to be written in \^ and \$.

The matching methods of these escape characters are similar to those of ordinary characters. is also a match with the same character.

For example: expression \ $d, when matching string abc$de, the result of the match is: success; match to: $d; match to position: start at 3, end at 5.

2.3 Expressions that match ' multiple characters '

Some representations in a regular expression can match any one of the characters in a variety of characters. For example, an expression \d can match any number. Although it is possible to match any of these characters, it can only be one, not multiple. This is like playing poker, the size of the king can replace any card, but can replace a card.

Example 1: Expression \d\d, when matching abc123, the matching result is: success, matching to the content is: 12; the match to the position is: starting at 3, ending at 5.

Example 2: Expression a.\d, in matching aaa100, the matching result is: success; match to: Aa1; match to: start at 1, end at 4.

2.4 Customizing expressions that match ' multiple characters '

Use square brackets [] to contain a series of characters that match any one of these characters. With [^] contains a series of characters, it is able to match any character other than the character. The same truth, although can match any one of them, but can only be one, not more.

Example 1: When the expression [BCD][BCD] matches the abc123, the result of the match is: success; match to: BC; Match to position: start at 1, end at 3.

Example 2: When the expression [^ABC] matches the abc123, the result of the match is: success; match to: 1; match to position: start at 3, end at 4.

2.5 Special symbols for modifying the number of matches

The expressions mentioned in the previous chapters, whether they are expressions that match only one character, or expressions that match multiple characters, can only be matched once. If you use an expression plus a special symbol that modifies the number of matches, you can repeat the match without repeating the expression.

Use the method: "Number decoration" is placed after the decorated expression. For example: [BCD][BCD] can be written as [bcd]{2}.

Example 1: Expression \d+/.? \d* matches the IT costs $12.5, the result of the match is: success; match to: 12.5; The match to is: start at 10, end at 14.

Example 2: Expression go{2, when 8}gle matches the Ads by Goooooogle, the result of the match is: success; match to: Goooooogle; match to position: start at 7, end at 17.

2.6 Other symbols that represent abstract meanings

Some symbols represent the special meaning of abstraction in an expression:

Further textual explanations are still more abstract, so for example to help you understand.

Example 1: expression ^aaa when matching xxx aaa xxx, the result of the match is: failed. Because ^ requires matching with the beginning of the string, so only if the AAA is at the beginning of the string, ^AAA can match, for example: AAA xxx xxx.

Example 2: expression aaa$ when matching xxx aaa xxx, the match result is: failed. Because the $ requirement matches where the string ends, aaa$ can match only when AAA is at the end of the string, for example: xxx xxx aaa.

Example 3: expression. \b. When matching the @@ ZZFCTHOTFIXZ, the matching result is: success; match to: @a; match to: start at 2, end at 4.

Further notes: \b, like ^ and $, itself does not match any character, but it requires that it be on both sides of the position in the match result, one side is the \w range and the other side is a non-\w range.

Example 4: Expression \bend\b when matching weekend,endfor,end, the matching result is: success; match to: end; match to position: start at 15, end at 18.

Some symbols can affect the relationship between the expressions ' inner sub-expressions:

Example 5: Expression tom| When Jack matches the string I ' m tom,he is Jack, the result of the match is: success; match to: Tom; match to is: start at 4, end at 7. When the next match is matched, the match is: Jack; the match to the location is: start at 15, Ended at 19.

Example 6: Expression (go\s*) + in match let's Go Go go! , the result of the match is: success; match to: Go Go go; Match to position: start at 6, end at 14.

Example 7: Expression ¥ (\d+\.? \d) When matching $10.9,¥20.5, the result of the match is: success; match to: ¥20.5; match to position: start at 6, end at 10. The separately obtained parenthesis range matches to the content: 20.5.

3. Some high-level usages in regular expressions

3.1 Matching times of greed and non-greed

Greedy mode:

When using special symbols that modify the number of matches, there are several representations that can simultaneously match the same expression to a different number of times, such as: "{m, n}", "{m,}",? , *, +, the exact number of matches depends on the string being matched. This repeated match of an indefinite number of expressions is always matched as much as possible during the matching process. For example, for text dxxxdxxxd, here are examples:

This shows that \w+ always matches as many of the characters that match its rules as possible. Although the second example does not match the last D, it is also for the entire expression to match successfully. Similarly, expressions with * and "{m, n}" are as many matches as possible, with The expression in the match can not match the depending on, also as far as possible "to match." This matching principle is called greedy mode.

non-greedy mode :

After the special symbol of the matching number of modifiers? , you can match a variable number of expressions as little as possible, making it possible to match an expression that does not match, as if it were "mismatched". This matching principle is called non-greedy mode, also known as the reluctant mode. If a few matches will cause the entire regular expression match to fail, similar to greedy mode, the non-greedy pattern will be matched to a minimum, so that the entire regular expression matches successfully. For example, for the text "Dxxxdxxxd" Example:

More information, for example:

Example 1: Expression <td> (. *) </td> and string <td><p>aa</p></td><td><p>bb</p> </td> match, the matching result is: success; match to: <td><p>aa</p></td><td><p>bb</p></ Td> the entire string, <td> in the expression will match the last </td> in the string.

Example 2: By contrast, the expression <td> (. *) </td> Match example 1 in the same string is, will only get <td><p>aa</p></td>, again matches the next time, Can get a second <td><p>bb</p></td>.

3.2 Reverse Reference \1,\2 ...

When an expression matches, the expression engine records the string to which the parentheses () contain the expression. In obtaining a taste of the matching result, the parentheses contain an expression that matches the string to a single fireball. This, in the previous example, has been shown many times. In a practical application, when a boundary is used to find the content that is requested and does not contain a boundary, you must use parentheses to specify the range you want. such as the front of the <td> (. *?) </td>.

In fact, the parentheses contain the string that the expression matches to, not only after the match is over, but also in the matching process. After the expression, you can refer to the preceding "substring within the parentheses that matches the string". The reference method is \ plus a number. \1 references the 1th pair of strings that match within parentheses, \2 references the string that matches within the 2nd pair of parentheses ... And so on, if a pair of parentheses contains another pair of parentheses, the outer brackets are numbered first. In other words, which pair of left brackets (in front of the pair, the first row ordinal number.

Example 1: expression (' | ') (.*?) (/1) When matching ' Hello ', ' world ', the result of the match is: success; the match is: ' Hello '. Once again the next match can be matched to "world".

Example 2: Expression (\w) \1{4,} When matching AA bbbb ABCDEFG ccccc 111121111 999999999, the match result is: success; match to: CCCCCC. When you match the next again, you get 999999999. This expression requires that the \w range of characters be repeated at least 5 times, noting the difference between \w{5 and}.

Example 3: Expression < (\w+) \s* (\w+ (= (' | "). *?\4) \s*) *>.*?<//1> when matching <td id= ' td1 ' style= ' bgcolor:white ' ></td>, the matching result is: success. If <td> does not pair with </td>, the match fails, and if you change to another pairing, you can match the success.

3.3 Pre-search, mismatch; reverse pre-search, mismatch

In the preceding chapters, I talked about a few special symbols that represent abstract meanings: ^, $, \b. They all have one thing in common: they do not match any characters themselves, but attach a condition to the "two ends of a string" or "a gap between characters." After understanding this concept, this section will continue to introduce a more flexible approach to additional conditions for "two" or "gap".

forward Pre-search : (? =xxxxx), (?! xxxxx

Format: (? =xxxxx), in the matching string, it is in the "gap" or "two" attached condition is: the right side of the gap, must be able to match the expression on the part of XXXXX. Because it is only the condition attached to this gap, it does not affect the character after which the subsequent expression actually matches the gap. This is similar to \b, which itself does not match any characters. \b Just takes the character before and after the gap to make a judgment, and does not affect the expression behind it to really match.

Example 1: Expression Windows (? =nt| XP) When matching Windows 98, Windows NT, Windows 2000, will only match Windows NT Windows, other Windows words are not matched.

Example 2: The expression (\w) ((=\1\1\1) (\1)) + matches the string AAA FFFFFF 9999999999, will be able to match 6 F of the first 4, can match 9 9 of the first 7. This expression can be interpreted as: repeat more than 4 times the number of alphanumeric, then match the rest of the last 2 bits. Of course, this expression can not be written in this way, in order to be used as a demonstration.

Format: (?! XXXXX), on the right side of the gap, must not match the xxxxx part of the expression.

Example 3: An expression (?! \bstop\b).) + When matching Fdjka ljfdl stop Fjdsla FDJ, it will match from the beginning to the position before stop, and if there is no stop in the string, match the entire string.

Example 4: An expression do (?! \w) When matching string done,do,dog, only do is matched. In the example of this article, do the use behind the (?! \w) and use the \b effect is the same.

Reverse Pre-search : (? <=xxxxx), (? <!xxxxx)

The concept of these two formats and forward pre-search is similar, the requirement for reverse pre-search is: "left side" of the gap, the two formats must be able to match and must not match the specified expression, rather than to judge the right. As with forward pre-search, they are both an additional and occasionally to the slot in which they do not match any characters.

4. Other general rules

4.1 Rule One

expression, you can use \xxx and \uxxxx to represent a character (X is a hexadecimal number)

4.2 Rule Two

In Expressions \s, \d, \w, and \b represent special meanings, the corresponding uppercase letters indicate the opposite meaning.

4.3 Rule Three

There is a special meaning in an expression that needs to be added \ to match the character summary of the character itself

4.4 Rule Four

Sub-expressions within parentheses (), if you want the match results to not be recorded for later use, you can use the (?: XXXXX) format.

Example 1: An expression (?:( \w) \1) + Match "a BBCCDD EFG", the result is "BBCCDD". The matching results of the brackets (?:) range are not recorded, so (\w) is referenced by using \1.

4.5 Rule Five

Introduction to Common Expression property settings: Ignorecase, Singleline, Multiline, Global

Related articles:

How does PHP match parentheses with regular expressions

Summary of common functions used in PHP regular expressions

Simple code example for PHP regular expression matching Chinese characters

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.