Regular expression Reference Document-Regular expression Syntax Reference.
[Original article, reprint please retain or indicate source: http://www.regexlab.com/zh/regref.htm]
Introduction
Regular expressions (regular expression) use a "string" to describe a feature and then verify that another "string" conforms to that feature. For example, the expression "ab+" describes the characteristics of "a" and any "B", then ' AB ', ' ABB ', ' abbbbbbbbbb ' all conform to this feature.
The regular expression can be used to: (1) Verify that the string conforms to the specified characteristics, such as verifying that it is a legitimate e-mail address. (2) to find a string, it is more flexible and convenient to find a string from a long text that matches a specified feature than to find a fixed string. (3) to replace, more powerful than the normal replacement.
Regular expressions are very simple to learn, but few of the more abstract concepts are easy to understand. Many people feel that the regular expression is more complex, on the one hand, because most of the documents are not easy to explain, the concept does not pay attention to the sequence, to the reader's understanding brings difficulties; On the other hand, the various engines bring their own documents generally to introduce its unique features, However, this particular feature is not the first thing we need to understand.
Each example in the article can be tested by clicking into the test page. Talk less, start.
1. Regular expression Rule 1.1 ordinary characters
Letters, numbers, Chinese characters, underscores, and punctuation marks that are not specifically defined in the following chapters are "normal characters." An ordinary character in an expression that matches the same character when matching a string.
Example 1: Expression "C", when matching the string "ABCDE", the result of the match is: success, matching to the content is: "C"; the match to the position is: start at 2, end at 3. (Note: The subscript starts at 0 or starts at 1, and may vary depending on the current programming language)
Example 2: The expression "BCD", when matching the string "ABCDE", the result of the match is: "BCD", matching to the position is: start at 1, end at 4.
1.2 Simple escape characters
Some characters that are inconvenient to write, adopt the method of adding "\" in front. These characters are in fact familiar to us.
An expression |
Can match |
\ r, \ n |
Represents carriage return and line break |
\ t |
Tabs |
\\ |
Represents "\" itself |
There are other punctuation marks that have a special use in the following chapters, preceded by "\", which represents the symbol itself. For example: ^, $ has a special meaning, if you want to match the string "^" and "$" characters, the expression needs to be written as "\^" and "\$".
An expression |
Can match |
\^ |
Match ^ symbol itself |
\$ |
Matches the $ symbol itself |
\. |
Match the decimal point (.) itself |
The matching methods for these escape characters are similar to "normal characters". is also a match with the same character.
Example 1: expression "\ $d", when matching the string "Abc$de", the result of the match is: "$d", the match to the position is: starting at 3, ending at 5.
1.3 Expressions that match ' multiple characters '
Some representations in a regular expression can match any one of the characters in the ' multiple characters '. For example, an expression "\d" can match any number. Although it is possible to match any of these characters, it can only be one, not multiple. This is like playing poker, the size of the king can replace any card, but can only replace a card.
An expression |
Can match |
\d |
Any number, any one of the 0~9 |
\w |
Any one letter or number or underscore, i.e. any of the a~z,a~z,0~9,_ |
\s |
Any of the whitespace characters, including spaces, tabs, page breaks, and so on |
. |
The decimal point can match any character except the line break (\ n) |
Example 1: expression "\d\d", when matching "abc123" , the result of the match is: success; matches to: "12"; Match to Position: start at 3, end at 5.
Example 2: expression "a.\d", when matching "aaa100" , the result of the match is: success; the match to is: "Aa1"; the matching position is: start at 1, end at 4.
1.4 Customizing expressions that match ' multiple characters '
Use square brackets [] to contain a series of characters that match any one of these characters. With [^] contains a series of characters, it is able to match any character other than the character. The same truth, although can match any one of them, but can only be one, not more.
An expression |
Can match |
[[email protected]] |
Match "A" or "B" or "5" or "@" |
[^ABC] |
Matches any character except "A", "B", "C" |
[F-k] |
Matches any one of the letters between "F" ~ "K" |
[^a-f0-3] |
Matches any of the characters "a" ~ "F", "0" ~ "3" |
Example 1: When the expression "[Bcd][bcd]" matches "abc123" , the result of the match is: success; match to: "BC"; match to position: start at 1, end at 3.
Example 2: When the expression "[^abc]" matches "abc123" , the result of the match is: "1", matching to the position: starting at 3, ending at 4.
1.5 Special symbols for modifying the number of matches
The expression mentioned in the previous section, whether it is an expression that matches only one character, or an expression that matches any of several characters, can only be matched once. If you use an expression plus a special symbol that modifies the number of matches, you can repeat the match without repeating the expression.
Use the method: the number modifier is placed behind the decorated expression. For example: "[BCD][BCD]" can be written as "[Bcd]{2}".
An expression |
Role |
N |
The expression repeats n times, for example: "\w{2}" equals "\w\w"; "A{5}" equals "AAAAA" |
{M,n} |
The expression repeats at least m times and repeats up to n times, for example: "ba{1,3}" can match "ba" or "Baa" or "baaa" |
{m,} |
The expression repeats at least m times, for example: "\w\d{2,}" can match "A12", "_456", "M12344" ... |
? |
Match expression 0 or 1 times, equivalent to {0,1}, for example: "A[CD]?" Can match "a", "AC", "AD" |
+ |
Expression appears at least 1 times, equivalent to {1,}, for example: "A+b" can match "AB", "AaB", "Aaab" ... |
* |
expression does not appear or appear any time, equivalent to {0,}, for example: "\^*b" can match "B", "^^ ^b" ... |
Example 1: expression "\d+\."? \d* "in matching It costs $12.5" , the result of the match was: "12.5", the match to the position is: start at 10, end at 14.
Example 2: When the expression "go{2,8}gle" matches "Ads by Goooooogle" , the result of the match is: "Goooooogle", the match to the position is: start at 7, end at 17.
1.6 Other special symbols that represent abstract meanings
Some symbols represent the special meaning of abstraction in an expression:
An expression |
Role |
^ |
Matches where the string starts, does not match any characters |
$ |
Matches where the string ends, does not match any characters |
\b |
Matches a word boundary, which is the position between a word and a space, and does not match any character |
Further textual explanations are still more abstract, so for example to help you understand.
Example 1: expression "^aaa" when matching "xxx aaa xxx" , the match result is: failed. Because "^" requires matching with the beginning of the string, the "^aaa" can be matched only when "AAA" is at the beginning of the string, for example: "AAA xxx xxx".
Example 2: expression "aaa$" when matching "xxx aaa xxx" , the match result is: failed. Because "$" requires a match to the end of the string, "aaa$" can be matched only when "AAA" is at the end of the string, for example: "xxx xxx aaa".
Example 3: Expression ". \b." When matching the "@@ ZZFCTHOTFIXZ" , the matching result is: "@a"; match to the position: start at 2, end at 4.
Further explanation: "\b" is similar to "^" and "$", but it does not match any characters, but it requires it to be in the left and right sides of the position in the match result, one side is the "\w" range, the other side is the range of non-"\w".
Example 4: When the expression "\bend\b" matches "Weekend,endfor,end ", the result of the match is: "End" and the match to the position: start at 15 and end at 18.
Some symbols can affect the relationship between sub-expressions within an expression:
An expression |
Role |
| |
The "or" relationship between the left and right sides of the expression |
( ) |
(1). The expression in parentheses can be modified as a whole when the number of matches is modified. (2). When matching results are obtained, the contents of the expressions in parentheses can be individually |
Example 5: expression "tom| When Jack "matches the string" I ' m Tom, he is Jack , the match is: success; the match is: "Tom"; the matching position is: start at 4, end at 7. When the next match is matched, the result is: "Jack", the match to the position: starting at 15, ending at 19.
Example 6: Expression "(go\s*) +" when matching "Let's Go Go go!" , the match result is: success; match to content is: "Go Go Go"; The matching position is: start at 6, end at 14.
Example 7: expression "¥" (\d+\.? \d*) "When matching" $10.9,¥20.5 " , the result of the match is: success; match to:" ¥20.5 "; match to position: start at 6, end at 10. The separately obtained parenthesis range matches to the content: "20.5".
2. Some advanced rules in regular expressions 2.1 greedy and non-greedy in the number of matches
When using special symbols with decorated matches, there are several representations that allow the same expression to match a different number of times, such as: "{m,n}", "{m,}", "?", "*", "+", and the number of occurrences matched by the matched string. This repeated match of an indefinite number of expressions is always matched as much as possible during the matching process. For example, for the text "Dxxxdxxxd", examples are as follows:
An expression |
Match Results |
(d) (\w+) |
"\w+" will match all characters after the first "D" "Xxxdxxxd" |
(d) ( \w+) (d) |
"\w+" will match all characters "xxxdxxx" between the first "D" and the Last "D". Although "\w+" can match the last "D", but in order for the entire expression to match successfully, "\w+" can "yield" the last "D" that it could have matched. |
Thus, the "\w+" is always matched as many characters as possible to match its rules. Although the second example does not match the last "D", it is also intended to allow the entire expression to match successfully. Similarly, expressions with "*" and "{M,n}" are as many matches as possible, and expressions with "?" are as close as possible to match when they can be mismatched. This matching principle is called "greedy" mode.
Non-greedy mode:
When you add a "?" number after the special symbol of the matching number, you can match the mismatched expression as little as possible, making it possible to match an expression that does not match and "does not match". This matching principle is called "non-greedy" mode, also known as "reluctant" mode. If fewer matches cause the entire expression to fail, similar to greedy mode, the non-greedy pattern will be matched to a minimum to make the entire expression match successfully. For example, for the text "Dxxxdxxxd" Example:
An expression |
Match Results |
(d) (\w+?) |
"\w+?" will match as few characters as possible after the first "D", as a result: "\w+?" matches only one "X" |
(d) (\w+?) (d) |
In order for the entire expression to match successfully, "\w+" had to match "xxx" to allow the "D" to match, thus making the entire expression match successfully. So the result is: "\w+?" Match "XXX" |
More information, for example:
Example 1: expression "<td> (. *) </td>" vs. String "<td><p>aa</p></td> <td><p>bb< /p></td> " match, the result of the match is: success; match to the content is" <td><p>aa</p></td> <td><p>bb </p></td> "The entire string," </td> "in the expression will match the last" </td> "in the string.
Example 2: By contrast, expression "<td> (. *?) </td> "matches the same string in Example 1 , will only get" <td><p>aa</p></td> ", once again the next time, you can get the second" <td> <p>bb</p></td> ".
2.2 Reverse reference \1, \2 ...
When an expression matches, the expression engine records the string to which the expression that contains the parentheses "()" is matched. When you get a matching result, the parentheses that contain the expression match to the string that can be obtained separately. This, in the previous example, has been shown many times. In a practical application, when a boundary is used to find the content that does not contain a boundary, you must use parentheses to specify the range you want. such as the "<td> (. *?) in front. </td> ".
In fact, the parentheses contain the string that the expression matches to, not only after the match is over, but also in the matching process. After the expression, you can refer to the preceding "substring within the parentheses that matches the string". The reference method is "\" plus a number. "\1" refers to the 1th pair of parentheses to match the string, "\2" refers to the 2nd pair of parentheses within the matching string ... And so on, if a pair of parentheses contains another pair of parentheses, the outer brackets are numbered first. In other words, which pair of left parenthesis "(" in front, then this pair first row ordinal number.
Examples are as follows:
Example 1: Expression "(' |") (.*?) (\1) "in matching" ' Hello ', " World", the match result is: success; the match is: "' Hello '". Once again, the next match can be matched to "world".
Example 2: Expression "(\w) \1{4,}" when matching AA bbbb ABCDEFG ccccc 111121111 999999999 " , the match result is: success; the match to is" CCCCC ". When you match the next again, you get 999999999. This expression requires that the "\w" range of characters be repeated at least 5 times, noting the difference between "\w{5,}".
Example 3: Expression "< (\w+) \s* (\w+ (= (' |"). *?\4) \s*) *>.*?</\1> "Match" <td id= ' TD1 ' style= " bgcolor:white" ></td> ", the matching result is successful. If <td> is not paired with </td>, the match fails, and if you change to another pairing, you can match the success.
2.3 Pre-search, mismatch; reverse pre-search, mismatch
In the previous chapters, I talked about a few special symbols that represent abstract meanings: "^", "$", "\b". They all have one thing in common: they do not match any characters themselves, but attach a condition to the "two ends of a string" or "a gap between characters." After understanding this concept, this section will continue to introduce a more flexible representation of additional conditions for "two" or "gap".
Forward pre-search: "(? =xxxxx)", "(?! XXXXX) "
Format: "(? =xxxxx)", in the matched string, it attaches to the "gap" or "two" conditions are: the right side of the gap, must be able to match the expression on the part of XXXXX. Because it is only the condition attached to this gap, it does not affect the following expression to really match the character after the gap. This is similar to "\b", which itself does not match any characters. "\b" simply takes the character before and after the gap to make a judgment, and does not affect the expression behind it to really match.
Example 1: expression "Windows (? =nt| XP) "When matching Windows 98, Windows NT, Windows 2000" will only match "Windows" in "Windows NT", the other "windows" words are not matched.
Example 2: Expression "(\w) ((=\1\1\1) (\1)) +" When matching string "AAA ffffff 999999999" , will be able to match 6 "F" of the first 4, can match 9 "9" of the first 7. This expression can be read as: Repeat more than 4 times the number of alphanumeric, then match the rest of the last 2 bits. Of course, this expression can not be written in this way, in order to be used as a demonstration.
Format: "(?! XXXXX) ", on the right side of the gap, you must not match the xxxxx part of the expression.
Example 3: Expression "(?! \bstop\b).) + "When matching Fdjka ljfdl stop Fjdsla FDJ" , it will match from the beginning to the position before "stop", and if there is no "stop" in the string, the entire string is matched.
Example 4: Expression "do(?! \w) "When matching string" Done, do, dog " can only match" do ". In the example of this article, "Do" is used behind "(?! \w) "and use the" \b "effect is the same.
Reverse pre-search: "(? <=xxxxx)", "(? <!xxxxx)"
The concept of these two formats and forward pre-search is similar, the requirement for reverse pre-search is: "left side" of the gap, the two formats must be able to match and must not match the specified expression, rather than to judge the right. As with forward pre-search, they are an additional condition to the slot in which they do not match any characters.
Example 5: Expression "(? <=\d{4}) \d+ (? =\d{4})" matches "1234567890123456" when it matches the median 8 digits except the first 4 digits and the last 4 digits. Because JSCRIPT.REGEXP does not support reverse pre-search, this article does not provide an example of a demonstration. Many other engines can support reverse pre-search, such as: Java 1.4 + Java.util.regex package,. NET System.Text.RegularExpressions namespace, and the most simple-to-use DEELX regular engine recommended by the site.
3. Other general rules
There are also some common rules between the various regular expression engines, which are not mentioned in the previous tutorial.
3.1 Expressions, you can use "\xxx" and "\uxxxx" to represent one character ("X" denotes a hexadecimal number)
Form |
Character Range |
\xxx |
Characters that are numbered in the range 0 to 255, such as: spaces can be represented with "\x20" |
\uxxxx |
Any character can be represented using "\u" plus its numbered 4-digit hexadecimal number, for example: "\u4e2d" |
3.2 In the expression "\s", "\d", "\w", "\b" for special meanings, the corresponding uppercase letters indicate the opposite meaning.
An expression |
Can match |
\s |
Matches all non-whitespace characters ("\s" matches individual whitespace characters) |
\d |
Match all non-numeric characters |
\w |
Matches all characters except letters, numbers, and underscores |
\b |
Match a non-word boundary, that is, the left and right sides are "\w" or both sides are not "\w" range of the character gap |
3.3 There is a special meaning in an expression, you need to add "\" to match the character summary of the character itself
Matches the starting position of the input string. To match the "^" character itself, use "\^"
character |
Description |
^ |
$ |
To match the end position of the input string. To match the "$" character itself, use "\$" |
() |
To mark the beginning and end of a sub-expression. To match the parentheses, use the "\ (" and "\") " |
[] |
To customize expressions that match ' multiple characters '. To match the brackets, use "\[" and "\]" |
{} |
To decorate the symbol for the number of matches. To match curly braces, use "\{" and "\}" |
. |
Matches any character except a newline (\ n). To match the decimal point itself, use "\." |
? | The
Adornment matches 0 or 1 times. To match the "?" character itself, use "\?" |
+ |
Cosmetic matches are at least 1 times. To match the "+" character itself, use "\+" |
* |
Adornment matches 0 or any number of times. To match the "*" character itself, use "\*" |
| |
The or relationship between the left and right expressions. Match "|" itself, use "\|" |
3.4 parentheses "()" in the sub-expression, if you want to match the result is not recorded for later use, you can use the "(?: XXXXX)" format
Example 1: Expression "(?:( \w) \1) + "a BBCCDD EFG" , the result is "BBCCDD". The matching result of the brackets "(?:)" Range is not recorded, so "(\w)" is referenced using "\1".
3.5 Introduction to commonly used expression property settings: Ignorecase,singleline,multiline,global
Expression properties |
Description |
Ignorecase |
By default, the letters in an expression are case-sensitive. Configuration as Ignorecase makes matching case insensitive. There is an expression engine that extends the capitalization concept to the case of a UNICODE range. |
Singleline |
By default, the decimal point "." matches characters other than line break (\ n). Configure to Singleline to make the decimal point match all characters, including line breaks. |
Multiline |
By default, the expression "^" and "$" only match the start ① and end ④ positions of the string. Such as: ①xxxxxxxxx②\n ③xxxxxxxxx④ Configured as Multiline can make "^" match ①, and can also match a newline character after the next line starts before the ③ position, so that "$" matches the ④ outside, and also matches the line break before the end of the ② position. |
Global |
It works primarily when an expression is used for substitution, and is configured to replace all matches with the Global representation. |
4. Other Tips
4.1 If you want to understand that the advanced regular engine also supports those complex regular syntax, see the documentation for the DEELX regular engine on this site.
4.2 You can use "^" and "$" at the end of an expression, for example: "^\d+$" requires the entire string to be numeric only, if you want to require that the expression match the entire string, rather than looking for a part from the string.
4.3 If the match is required to be a complete word and not part of the word, use "\b" at the end of the expression, for example: use "\b (If|while|else|void|int ...) \b "to match the keywords in the program .
The 4.4 expression does not match an empty string. Otherwise the match will always be successful, and the result does not match anything. For example, to write an expression that matches "123", "123.", "123.5", ". 5", Integer, Decimal, and fractional numbers can be omitted, but do not write the expression: "\d*\." \d* ", because if there is nothing, this expression can also be matched successfully. A better formulation is: "\d+\.? \d*|\.\d+ ".
4.5 Sub-matches that can match an empty string do not loop indefinitely. If each part of the subexpression within the parentheses can be matched 0 times, and the parentheses can match an infinite number of times, then the situation may be more serious than the previous one, and the matching process may die in a loop. Although there are some regular expression engines that have been adopted to avoid this, such as the regular expression of. NET, we should still try to avoid this situation. If we encounter a dead loop while writing an expression, we can also start with this point and find out if this is the reason for this article.
4.6 Reasonable choice of greedy and non-greedy mode, see topic discussion.
4.7 or "|" on both sides of a character preferably only one side can match, so, not because "|" on both sides of the expression because of the exchange location and different.
[Reprint] Regular Expression Reference Document-Regular expression Syntax Reference.