Introduction
A regular expression (Regular Expression) describes a feature using a "string", and then verifies whether another "string" matches this feature. For example, if the expression "AB +" describes the features of "a" and any 'B' ", 'AB', 'abbbbbbbbbbbbb' meets this feature.
Regular Expressions can be used to: (1) verify whether the string meets the specified features, for example, verify whether it is a valid email address. (2) It is used to search for strings that match the specified features from a long text, which is more flexible and convenient than searching for fixed strings. (3) It is more powerful than normal replacement.
Regular Expressions are actually very simple to learn. A few abstract concepts are also easy to understand. Many people feel that regular expressions are complicated. On the one hand, most documents do not have to be explained in a simple and in-depth manner. On the other hand, they do not have to pay attention to the order of order in terms of concept, which makes readers difficult to understand. On the other hand, the documentation provided by various engines generally describes its unique features. However, we do not need to understand these features first.
For every example in the article, you can click to enter the test page for testing. Start.
1. Regular Expression rules 1.1 common characters
Letters, numbers, Chinese characters, underscores, and punctuation marks not defined in the subsequent sections are all "common characters ". A common character in an expression. It matches the same character when it matches a string.
Example 1: expression "C", when matching the string "ABCDE", the matching result is: Successful; the matched content is: "C"; the matched position is: start at 2 and end at 3. (Note: The subscript starts from 0 or 1, and may vary depending on the current programming language)
Example 2: expression "BCD", when matching the string "ABCDE", the matching result is: Successful; the matched content is: "BCD"; the matched position is: start at 1 and end at 4.
1.2 simple escape characters
For characters that are inconvenient to write, add "\" to the front. We are familiar with these characters.
Expression |
Matching |
\ R, \ n |
Returns the carriage return and line break. |
\ T |
Tab |
\\ |
Represents "\" itself |
There are some other punctuation marks that are particularly useful in later chapters. After "\" is added, it indicates the symbol itself. For example, ^ and $ have special meanings. If you want to match the "^" and "$" characters in a string, you must write the expressions as "\ ^" and "\ $ ".
Expression |
Matching |
\ ^ |
Match ^ symbol itself |
\ $ |
Match $ symbol itself |
\. |
Match the decimal point (.) itself |
The matching methods for these escape characters are similar to those for "common characters. It also matches the same character.
Example 1: expression "\ $ D". When the string "ABC $ de" is matched, the matching result is successful. The matched content is: "$ D "; the matched position is starting at 3 and ending at 5.
1.3 expressions that can match 'Multiple characters'
Some Expression Methods in the regular expression can match any of the 'Multiple characters' characters. For example, the expression "\ D" can match any number. Although it can match any character, it can only be one, not multiple. This is like playing a poker card, the king of the size can replace any card, but can only replace one card.
Expression |
Matching |
\ D |
Any number ranging from 0 ~ Any one of the 9 |
\ W |
Any letter, number, or underline, that is, ~ Z, ~ Z, 0 ~ 9, _ Any |
\ S |
Any of the spaces, tabs, and page breaks |
. |
The decimal point can match any character except the line break (\ n ). |
Example 1: expression "\ D". When "ABC123" is matched, the matching result is successful. The matching content is: "12 "; the matched position is starting at 3 and ending at 5.
Example 2: expression ". \ D ", when matching" aaa100 ", the matching result is: Successful; the matched content is:" aa1 "; the matched position is: Starting from 1, end at 4.
1.4 custom expressions that can match multiple characters
Square brackets [] can contain a series of characters that match any of them. If [^] is used to contain a series of characters, it can match any character other than the characters. In the same way, although it can match any of them, it can only be one, not multiple.
Expression |
Matching |
[AB5 @] |
Match "A" or "B" or "5" or "@" |
[^ ABC] |
Match any character other than "a", "B", "C" |
[F-K] |
Matching "F "~ Any letter between "K" |
[^ A-F0-3] |
Match ""~ "F", "0 "~ Any character other than "3" |
Example 1: When expression "[BCD] [BCD]" matches "ABC123", the matching result is: Successful; the Matching content is: "BC "; the matched position is: Starting from 1 and ending from 3.
Example 2: When expression "[^ ABC]" matches "ABC123", the matching result is: Successful; the Matching content is: "1"; the matching position is: start at 3 and end at 4.
1.5 modify the special symbol of the number of matches
The expressions mentioned in the previous chapter can only match one character or any one of multiple characters. If you use an expression with a special symbol that modifies the number of matches, you can repeat the match without having to repeat the expression.
The usage is as follows: "Frequency modifier" is placed after "modified expression. For example, "[BCD] [BCD]" can be written as "[BCD] {2 }".
Expression |
Function |
{N} |
The expression is repeated n times. For example, "\ W {2}" is equivalent to "\ W"; "a {5}" is equivalent to "AAAAA" |
{M, n} |
The expression must be repeated at least m times and N times at most. For example, "BA {}" can match "ba" or "baa" or "baaa" |
{M ,} |
The expression must be repeated at least m times. For example, "\ W \ D {2,}" can match "A12", "_ 456", "m12344 "... |
? |
Matches expression 0 or 1 times, equivalent to {0, 1}, for example: "A [cd]? "Can match" A "," AC "," ad" |
+ |
The expression appears at least once, equivalent to {1,}. For example, "A + B" can match "AB", "AAB", "aaab "... |
* |
The expression does not appear or appears any time, which is equivalent to {0,}. For example, "\ ^ * B" can match "B", "^ B "... |
Example 1: expression "\ D + \.? \ D * "when matching" it costs $12.5 ", the matching result is: Successful; the matched content is:" 12.5 "; the matched position is: start at 10 and end at 14.
Example 2: When expression "go {} GLE" matches "ads by goooooogle", the matching result is: Successful; the Matching content is: "goooooogle "; the matched position is 7 and 17.
1.6 other special symbols representing abstract meanings
Some symbols represent the special meaning of abstraction in expressions:
Expression |
Function |
^ |
Matches the start point of the string and does not match any character. |
$ |
Matches the end of the string, but does not match any character. |
\ B |
Matches a word boundary, that is, the position between a word and a space. It does not match any character. |
Further text descriptions are still abstract.
Example 1: When expression "^ AAA" matches "xxx aaa xxx", the matching result is: failed. Because "^" requires matching with the start of the string, "^ AAA" can be matched only when "AAA" is at the beginning of the string, for example: "AAA xxx ".
Example 2: When expression "AAA $" matches "xxx aaa xxx", the matching result is: failed. Because "$" must match the end of the string, "AAA $" can be matched only when "AAA" is at the end of the string, for example: "xxx AAA ".
Example 3: expression ". \ B. "When matching" @ ABC ", the matching result is: Successful; the matched content is:" @ A "; the matched position is: 2, end at 4.
Further note: "\ B" is similar to "^" and "$". It does not match any character, but requires it to be on both sides of the left and right sides of the matching result, one side is the range of "\ W" and the other side is not the range of "\ W.
Example 4: when the expression "\ bend \ B" matches "weekend, endfor, end", the matching result is successful, and the matching content is: "end "; the matched position is: starting at 15 and ending at 18.
Some symbols can affect the relationship between subexpressions within the expression:
Expression |
Function |
| |
The relationship between the expressions on both sides of the left and right matches the relationship between the expressions on the left and right. |
() |
(1). When the matching times are modified, the expressions in brackets can be modified as a whole. (2) When the matching result is obtained, the matching content of the expression in brackets can be obtained separately. |
Example 5: when the expression "Tom | Jack" matches the string "I'm Tom, he is Jack", the matching result is successful. The matched content is "Tom "; the matched position is starting at 4 and ending at 7. When the next match is performed, the match result is: Successful; the matched content is: "Jack"; the matched position starts at 15 and ends at 19.
Example 6: expression "(go \ s *) +" matches "Let's go! ", The matching result is: Successful; the matched content is:" Go "; the matched position is: 6, and 14.
Example 7: expression "$ (\ D + \.? \ D *) "When matching" $10.9, ¥20.5 ", the matching result is successful. The matched content is:" ¥20.5 ". The matched position is: start at 6 and end at 10. Obtain the Matching content of the bracket range separately: "20.5 ".
2. Greedy and non-greedy in regular expressions with some advanced rules 2.1 matching times
When you use a special symbol to modify the number of matches, there are several representation methods that allow the same expression to match different times, such as: "{m, n}", "{M ,} ","? "," * "," + ", The number of matching times varies with the string to be matched. This type of expressions with an indefinite number of repeat matches as many as possible during the matching process. For example, for the text "dxxxdxxxd", the example is as follows:
Expression |
Matching result |
(D) (\ W +) |
"\ W +" will match all characters after the first "D" "xxxdxxxd" |
(D) (\ W +) (d) |
"\ W +" will match all characters "xxxdxxx" between the first "D" and the last "D ". Although "\ W +" can match the last "D", to make the entire expression match successfully, "\ W +" can "let out" the last "D" that can be matched" |
It can be seen that "\ W +" always matches as many characters as possible to comply with its rules. Although the second example does not match the last "D", it is also used to make the entire expression match successfully. Similarly, the expressions with "*" and "{m, n}" Both match as much as possible, "? "When the expression can be matched but not matched, it is also" to match "as much as possible ". This matching principle is called the "greedy" pattern.
Non-Greedy mode:
Add "? "Number, the number of matching expressions can be as few as possible, so that the non-matching expressions can be matched, as far as possible" not matching ". This matching principle is called "non-greedy" mode, or "barely" mode. If there is a small match, the entire expression will fail to match. Similar to greedy mode, non-Greedy mode will be matched to a minimum to make the entire expression match successful. For example, for the text "dxxxdxxxd:
Expression |
Matching result |
(D) (\ W + ?) |
"\ W +? "Match as few characters as possible after the first" D ". The result is:" \ W +? "Only matches one" X" |
(D) (\ W + ?) (D) |
To make the entire expression match successfully, "\ W +? "Must match" XXX "to make the" D "behind the expression match, so that the entire expression matches successfully. Therefore, the result is: "\ W +? "Matching" XXX" |
For more information, see the following example:
Example 1: expression "<TD> (. *) </TD> "match with string" <TD> <p> AA </P> </TD> <p> BB </P> </TD> ", the matching result is successful; the matched content is "<TD> <p> AA </P> </TD> <p> BB </P> </TD>" the entire string, the "</TD>" in the expression matches the last "</TD>" in the string.
Example 2: In contrast, the expression "<TD> (.*?) </TD> "when matching the same string in Example 1, only" <TD> <p> AA </P> </TD> "is obtained. When matching the Next string again, you can get the second "<TD> <p> BB </P> </TD> ".
2.2 reverse reference \ 1, \ 2...
When an expression matches, the expression engine records the matching strings of the expressions contained in parentheses. When obtaining the matching result, strings matching the expressions in parentheses can be obtained separately. This is already shown in the previous example. In actual application scenarios, when a boundary is used for search and the obtained content does not contain the boundary, parentheses must be used to specify the desired range. For example, the preceding "<TD> (.*?) </TD> ".
In fact, "the string matched by the expression contained in parentheses" can be used not only after the matching is complete, but also during the matching process. The part behind the expression can be referenced in the preceding section "matching matched strings with subscripts in parentheses ". The reference method is to add a number. "\ 1" references the string matching 1st pairs of brackets, and "\ 2" references the string matching 2nd pairs of brackets ...... Similarly, if a pair of parentheses contains another pair of parentheses, the outer brackets are sorted first. In other words, Which pair of left parentheses "(" in front, then this pair is sorted first.
Example:
Example 1: expression "('| ")(.*?) (\ 1) "When" 'hello', "world" "is matched, the matching result is successful, and the matching content is:" 'hello '". You can match "world" when matching the next one again "".
Example 2: When the expression "(\ W) \ 1 {4,}" matches "aa bbbb abcdefg CCCCC 111121111 999999999", the matching result is successful; the matched content is "CCCCC ". If you match the next one, 999999999 is returned. This expression requires that the characters in the "\ W" range be repeated at least five times. Note the difference between the expressions and "\ W {5.
Example 3: expression "<(\ W +) \ s * (\ W + (= ('| ").*? \ 4 )? \ S *) *> .*? </\ 1> "when matching" <TD id = 'td1 'style = "bgcolor: White"> </TD> ", the matching result is successful. If "<TD>" and "</TD>" are not paired, the matching fails. If it is changed to another pair, the matching succeeds.
2.3 Pre-search, unmatched; reverse pre-search, unmatched
In the previous chapter, I mentioned several special symbols that represent abstract meanings: "^", "$", "\ B ". They all have one thing in common, that is, they do not match any character, but they only append a condition to "the two ends of the string" or "the gap between characters. After understanding this concept, this section will continue to introduce another more flexible expression method that adds conditions to "two ends" or "gaps.
Forward pre-Search :"(? = XXXXX )","(?! XXXXX )"
Format :"(? = XXXXX) ", in the matched string, it attaches the condition to the" gap "or" two ": the right side of the gap, must be able to match the expression above XXXXX. Because it is only used as a condition attached to this gap, it does not affect the following expression to truly match the character after this gap. This is similar to "\ B" and does not match any character. "\ B" only determines the characters before and after the gap, and does not affect the true matching of the following expressions.
Example 1: expression "windows (? = Nt | XP) "only" Windows "in" Windows NT "matches" Windows 98, Windows NT, and Windows 2000 ", other words "Windows" are not matched.
Example 2: expression "(\ W )((? = \ 1 \ 1 \ 1) (\ 1) + "when matching string" AAA ffffff 999999999 ", the first four of the six" F "can be matched, it can match the first 7 of the 9 "9. This expression can be interpreted as: Repeat more than four letters and numbers to match the remaining two digits. Of course, this expression can not be written in this way. The purpose of this expression is to be used for demonstration.
Format :"(?! XXXXX) ", the right side of the gap must not match the expression of XXXXX.
Example 3: expression "((?! \ Bstop \ B ).) + "when matching" fdjka ljfdl stop fjdsla FDJ ", the entire string is matched from the beginning to the position before" stop ". If the string does not contain" stop ", the entire string is matched.
Example 4: expression "Do (?! \ W) "only" do "can be matched when the string" done, do, dog "is matched ". In this example, "do" is used later "(?! \ W) "is the same as" \ B.
Reverse pre-Search :"(? <= XXXXX )","(? <! XXXXX )"
The concepts of these two formats are similar to those of forward pre-search. The condition for reverse pre-search is: "Left" of the gap ", the two formats must be able to match and must not match the specified expression, rather than determining the right side. Like "Forward pre-search", they are all additional conditions for the gap and do not match any characters.
Example 5: expression "(? <= \ D {4}) \ D + (? = \ D {4}) "When matching" 1234567890123456 ", it will match the middle eight digits except the first four digits and the last four digits. As JScript. Regexp does not support reverse pre-search, this example cannot be demonstrated. Many other engines Support reverse pre-search, such as Java 1.4 or above. util. regEx package ,. net System. text. regularexpressions namespace, as well as the deelx regular engine that is the most simple and easy to use on this site.
3. Other general rules
There are also some rules that are more common among the regular expression engines, which were not mentioned in the previous sections.
In the 3.1 expression, "\ XXX" and "\ uxxxx" can be used to represent a character ("X" indicates a hexadecimal number)
Form |
Character range |
\ Xxx |
The ID ranges from 0 ~ 255 characters in the range. For example, a space can be expressed as "\ x20 ". |
\ Uxxxx |
Any character can be expressed by "\ U" plus the 4-digit hexadecimal number of its number, for example: "\ u4e2d" |
3.2 When the expressions "\ s", "\ D", "\ W", and "\ B" indicate special meanings, the corresponding uppercase letters indicate the opposite meanings.
Expression |
Matching |
\ S |
Match all non-blank characters ("\ s" can match each blank character) |
\ D |
Match all non-numeric characters |
\ W |
Match All characters other than letters, numbers, and underscores |
\ B |
Match non-word boundary, that is, when both sides are in the "\ W" range or when both sides are not in the "\ W" Range |
3.3 It has special meaning in the expression. You need to add "\" to match the character summary.
Character |
Description |
^ |
Matches the start position of the input string. To match the character "^", use "\ ^" |
$ |
Matches the end position of the input string. To match the "$" character, use "\ $" |
() |
Mark the start and end positions of a subexpression. To match parentheses, use "\ (" and "\)" |
[] |
Use a custom expression that can match multiple characters. To match brackets, use "\ [" and "\]" |
{} |
Symbol of the number of matches. To match braces, use "\ {" and "\}" |
. |
Match any character except the linefeed (\ n. To match the decimal point, use "\." |
? |
Modifies the number of matches to 0 or 1. To match "? "Character itself, please use "\? " |
+ |
Modify the number of matches to at least one. To match the "+" character, use "\ +" |
* |
Modifies the number of matches to 0 or any times. To match the "*" character, use "\*" |
| |
The relationship between the expressions on both sides. Match "|", please use "\ |" |
The subexpression in parentheses "()". If you want to keep the matching results for future use, you can use "(? : XXXXX) "Format
Example 1: expression "(? (\ W) \ 1) + "match" A bbccdd EFG ", the result is" bbccdd ". Parentheses "(? :) "The matching result of the range is not recorded, so" (\ W) "uses" \ 1 "for reference.
3.5 introduction to common expression attribute settings: ignorecase, singleline, multiline, and global
Expression attributes |
Description |
Ignorecase |
By default, the letters in the expression are case-sensitive. Configured as ignorecase makes the matching case insensitive. Some expression engines extend the "case" concept to the case of Unicode. |
Singleline |
By default, the decimal point "." matches characters other than the line break (\ n. Configured with singleline, the decimal point can match all characters including line breaks. |
Multiline |
By default, the expressions "^" and "$" only match the start and end positions of the string. For example: ① XXXXXXXXX ② \ n ③ XXXXXXXXX ④ When multiline is configured, it can make "^" Match ①, match linefeed, and match ③ before the next line, so that "$" matches ④, or match before linefeed, the end position of a row. |
Global |
It mainly takes effect when expression is used for replacement. If it is set to global, all matches are replaced. |
4. Other prompts
4.1 if you want to know that the advanced Regular Expression Engine supports complex regular expressions, refer to the deelx Regular Expression Engine instructions on this site.
4.2 if you want the expression to match the entire string instead of finding a part of the string, you can use "^" and "$" at the beginning and end of the expression, for example: "^ \ D + $" requires that the entire string contain only numbers.
4.3 If the Matching content is a complete word instead of a part of the word, use "\ B" at the beginning and end of the expression, for example: use "\ B (if | while | else | void | int ......) \ B "to match the keywords in the program.
4.4 expressions do not match null strings. Otherwise, the matching is always successful, and nothing is matched. For example, to write a match "123" or "123. "," 123.5 ",". 5 "in these expressions, the integer, decimal point, and decimal number can be omitted, but do not write the expression as" \ D *\.? \ D * ", because if there is nothing, this expression can also be matched successfully. Better Syntax: "\ D + \.? \ D * | \. \ D + ".
4.5 do not loop through an infinite number of submatches that can match null strings. If each part of the subexpression in the parentheses can match 0 times, and the parentheses can match infinitely, the situation may be more serious than the previous one, an endless loop may occur during the matching process. Although some regular expression engines have already tried to avoid this situation, such as. net regular expressions, we should try to avoid this situation. If we encounter an endless loop when writing an expression, we can start with this to find out if it is the reason described in this article.
4.6 properly select greedy mode and non-Greedy mode. For more information, see the topic.
4.7 or "|" on both sides, it is best to match only one side of a character, so that the expressions on both sides are not different because of the switching position.