1. Period
Assume that you are playing an English spelling game and want to find words with three letters. These words must start with a letter "t" and end with a letter "n. To construct this regular expression, you can use a wildcard, the period symbol ".". In this way, the complete expression is "t. n. It matches "tan", "ten", "tin", "ton", and "t # n", "tpn", and even "t n ", there are many other meaningless combinations. This is becauseThe periods match all characters, including spaces, Tab characters, and even line breaks..
2. square brackets
To solve the problem that the period matching range is too wide, you can specify meaningful characters in square brackets. In this case, only the characters specified in square brackets are involved in the matching. That is to say, the regular expression "t [aeio] n" only matches "tan", "Ten", "tin", and "ton ". However, "Toon" does not match becauseOnly one character can be matched in square brackets.
In addition, square brackets can also use "-" to represent the span. For example, "[a-g]" matches lowercase letters a to g in the alphabet, "[1-9]" matches numbers ranging from 1 to 9.
You can use "&" to take the intersection, for example, "[1-9 & [^ 456]" to match numbers other than 4, 5, and 6 from 1 to 9.
3. "or" symbol
If you want to match "toon" in addition to all the words matched above, you can use the "|" operator. The basic meaning of the "|" operator is the "or" operation. To match "toon", use the regular expression "t (a | e | I | o | oo) n. Square brackets cannot be used here, because square brackets can only match a single character. square brackets () must be used here. Parentheses can also be used for grouping.
4. symbol indicating the number of matches
? 0 or 1 time
+ 1 or multiple times, that is, at least 1 time
* Zero or multiple times, meaning multiple times
{N} Exactly n times
{N, m} n to m times
Suppose we want to search for American Social Security numbers in text files. The number is in the format of 999-99-9999. Shows the regular expression used to match it. In a regular expression, A hyphen (-) has a special meaning. It represents a range, for example, from 0 to 9. Therefore, when matching a hyphen in a social security number, it must be preceded by an escape character "\".
Assume that you want to enable or disable a hyphen when searching. That is, 999-99-9999 and 999999999 are in the correct format. In this case, you can add "?" After the hyphen. Quantity limit symbol, as shown in:
5. "No"
The "^" symbol is called the "no" symbol. If it is used in square brackets, "^" indicates the character that you do not want to match. For example, the regular expression matches all words, except words starting with an "X" letter.
6. parentheses
(1) define the scope of the quantifiers
Example
(Cat )? Matches 0 or 1 "cat"
(Cat) + match one or more "cat"
(2) limit the range of multiple-choice Structures
Example
(Cat | dog) matches "cat" or "dog"
(3) group capture
The pattern matching content between parentheses will be captured. When the pattern contains nested parentheses, the variable numbers will be carried out once according to the positions where parentheses appear.
Example
([A-Za-z] (\ d {2}) (-) \ d {2}) matches the A22-33 as follows:
The regular expression in the first parentheses is: [A-Za-z] (\ d {2 })
Matching result Group 1: A22
The regular expression in the second parentheses is \ d {2}
Matching result Group 2: 22
The regular expression in the third parentheses is: (-) \ d {2}
Matching Result Group 3:-33
The Fourth Regular Expression in parentheses is :-
Matching Result Group 4 :-
(4) group not captured
When parentheses and "? : "Indicates a non-capturing group. The content of the parentheses is not used as the capturing object. When the content in the parentheses is not the object to be captured, the non-capturing parentheses can improve the matching efficiency.
Example
(\ W (? : \ D {2 }))((? :-) \ D {2}) match the A22-33 as follows:
Group 1: A22
Group 2:-33
Note: The "22" matching \ d {2} is not captured, and the "-" matching "is not captured.
(5) reverse reference capture text
Example
([AB]) \ 1, "[AB]" in parentheses can match "a" or "B", and "\ 1" in the back represents, if the text references match above. If the capture group matches "a", the reverse reference can only match "a". Similarly, if the capture group matches "B ", then, reverse references can only match "B ". Due to the backward reference of "\ 1", two identical characters must be required. Here, "aa" or "bb" can be matched successfully.
(6) foresight
When parentheses and "? = "When used in combination, it indicates forward-looking, indicating that the expression will be used as a matching validation, but will not appear in the matching result string.
Example
(John )(? = Resig:
John: No matching, because it is not followed by Resig
John Backus: mismatch, followed by not Resig
John Reisg: Matching. John is followed by Resig. However, the matching result is "John" instead of "JohnReisg ".
7. Pre-Defined Characters
\ D numeric character: [0-9]
\ D non-numeric characters: [^ 0-9]
\ S blank character: [\ t \ n \ x0B \ f \ r]
\ S non-blank characters: [^ \ s]
\ W word character: [a-zA-Z_0-9]
\ W non-word characters: [^ \ w]
8. boundary matching characters
^ Beginning of Line
$ End of line
\ B word boundary
\ B Non-word boundary
\
End of a match on \ G
The end of the \ Z input. It is only used for the final terminator (if any)
\ Z input end
9. Matching Mode
Regular expressions have three matching modes: greedy, reluctant, and possessive)
Greedy attackers
X? X ?? X? + Match X zero times or once
X *? X * + matches X zero or multiple times
X +? X ++ matches X once or multiple times
X {n }? X {n} + match X n times
X {n ,}x {n ,}? X {n ,}+ matches X at least n times
X {n, m} X {n, m }? X {n, m} + matches X at least n times, but not more than m times
Example
Assume that the string to be analyzed is: xfooxxxxxxfoo
Mode:. * foo (Greedy mode ):
The pattern consists of the Child pattern p1 (. *) and child pattern p2 (foo). The quantifiers matching method in p1 uses the default pattern (Greedy pattern ). When the match starts, all characters xfooxxxxxxfoo are input to match the child pattern p1. The match is successful, but no string is used to match the sub-pattern p2. This round of matching failed; second round: Reduce the number of Matching Parts in p1, spit out the last character, split the string into xfooxxxxxxfo and o Sub-strings s1 and s2. S1 matches p1, but s2 does not. This round of matching failed. In the third round, the p1 part is reduced again, and two characters are spit out. The string is divided into xfooxxxxfo and oo parts. The result is the same as above. In the fourth round, the p1 matching volume is reduced again. The string is divided into two parts: xfooxxxxxx AND foo. This time, s1/s2 matches with p1/p2 respectively. Stop the attempt and return a successful match.
Mode :.*? Foo (barely mode): The minimum matching method. The first attempt to match. p1 is ignored because it is 0 or any time. It fails to match p2 with a string. The second attempt is to read the first character x and try to match p1 ,, matched successfully. The first three characters in the remaining part of the string fooxxxxfoo match p2. Therefore, stop the attempt and return a successful match. In this mode, if you continue to search for the substring that matches the pattern for the remaining string, you will find another xfoo at the end of the string. In greedy mode, because the first matched substring is already all characters, there is no second matched substring.
Mode:. * + foo (occupied mode): Also called occupied mode. When the match starts, all strings are read. The match with p1 is successful, but no other strings are matched with p2. Therefore, the matching fails. .
To put it simply, the greedy mode is different from the possession mode,Greedy mode reduces the number of successful matching modes from more to less, and leaves the characters to other modes for matching; the occupy mode occupies all successfully matched parts and is never reserved for other parts.
Currently, the occupied mode is only supported by java and is usually less used.
Java Regular Expression API
In JDK, classes related to regular expressions are located inJava. util. regexPackage. There are two classes: Pattern and Matcher.
Pattern class
Method Abstract:
Static PatternCompile(String regex)
Compile the given regular expression into the mode.
Static PatternCompile(String regex, int flags)
Compile the given regular expression into a pattern with the given flag.
MatcherMatcher(CharSequenceinput)
Create a match that matches the specified input and the pattern. CharSequence is an interface. String, StringBuffer, and StringBuilder are all their implementation classes.
Static booleanMatches(String regex, CharSequence input)
Compile the given regular expression and try to match the given input.
String []Split(CharSequenceinput)
Splits a given input sequence for matching in this mode.
String []Split(CharSequenceinput, int limit)
Splits a given input sequence for matching in this mode.
The Pattern class defines the alternative compile method, which is used to accept the flag set that affects the Pattern matching method. The flag parameter is a bitmask, which can be any of the following public static fields:
Pattern. CANON_EQ
Enabling a specification is equivalent. After this flag is specified, two characters are considered to be matched only when its full specification is decomposed and matched. For example, when expression a \ u030A [8] specifies this flag, it matches the string "\ u00E5" (that is, the character token ). By default, regular equivalence is not used for matching. Specifying this flag may affect the performance.
Pattern. CASE_INSENSITIVE
Enable case-insensitive matching. By default, match only characters in the US-ASCII character set. Unicode-aware (Unicode-aware) is case-insensitive and can be enabled by specifying the UNICODE_CASE Flag along with this flag. Case-insensitive matching can also be performed using nested sign expressions (? (I. Specifying this flag may affect the performance.
Pattern. COMMENTS
In this mode, blank spaces and comments are allowed. In this mode, the embedded comments starting with # and ending with # are ignored. The annotation mode can also use an embedded flag expression (? X.
Pattern. DOTALL
Enable dotall mode. In dotall mode, the expression matches any character including the row Terminator. By default, the expression does not match the row Terminator. The dotall mode also uses an embedded flag expression (? X. [S is a single-line notation, which is the same as that in Perl.]
Pattern. LITERAL
Enable mode literal analysis. After this flag is specified, the input string of the specified mode is treated as a literal character sequence. The metacharacters and escape characters in the input sequence do not have any special meaning. When CASE_INSENSITIVE and UNICODE_CASE are used with this flag, the matching will be affected. Other signs become redundant. Enable literal analysis without nested flag expressions.
Pattern. MULTILINE
Enable multiline mode. In multi-row mode, the expressions ^ and $ match the beginning of the row Terminator and the beginning of the row Terminator respectively. By default, the expression only matches the start and end of the entire input sequence. The multi-row mode can also use an embedded flag expression (? M.
Pattern. UNICODE_CASE
Enable foldable Unicode (Unicode-awarecase folding. After this flag is specified, it must be enabled using the CASE_INSENSITIVE flag. case-insensitive configuration will be completed in the Unicode Standard sense. By default, case-insensitive matching only matches characters in the US-ASCII character set. You can also use an embedded flag expression (? U. Specifying this flag may affect the performance.
Pattern. UNIX_LINES
Enable the Unix line mode. In this mode, the., ^, and $ actions only recognize the row Terminator "\ n. In Unix line mode, you can use an embedded flag expression (? D.
Matcher class
Method Abstract:
IntEnd()
Returns the Offset after the last matched character.
IntEnd(Intgroup)
Returns the Offset after the last character of the subsequence captured by the given group during the previous matching operation.
BooleanFind()
Try to find the next subsequence of the input sequence that matches the pattern.
BooleanFind(Intstart)
Reset this check box, and then try to find the next sub-sequence of the input sequence that matches the mode and starts from the specified index.
StringGroup()
Returns the input subsequence matched by the previous matching operation.
StringGroup(Intgroup)
Returns the input subsequence captured by the given group during the previous matching operation.
IntGroupCount()
Returns the number of capture groups in this matching mode.
BooleanMatches()
Try to match the entire region with the pattern.
PatternPattern()
Returns the Pattern Interpreted by this vertex.
StringReplaceAll(Stringreplacement)
The replacement mode matches each sub-sequence of the input sequence of the given replacement string.
StringReplaceFirst(Stringreplacement)
The replacement mode is the first sub-sequence of the input sequence that matches the given replacement string.
MatcherReset()
Reset the check box.
MatcherReset(CharSequenceinput)
Reset the matching with the new input sequence.
IntStart()
Returns the original matched initial index.
IntStart(Intgroup)
Returns the initial index of the subsequence captured by the given group during the previous matching operation.
MatchResultToMatchResult()
Returns the matching status of this MatchResult.
MatcherUsePattern(PatternnewPattern)
Change this Matcher to find the Pattern of the match.
Instance
String regex = "(A +) (B (C)"; String input = "ABC-AABC-AAAAABCD"; Pattern pattern = Pattern. compile (regex); Matcher matcher = pattern. matcher (input); int groupCount = matcher. groupCount (); // regex is divided into four groups by parentheses, according to the appearance position of the parentheses, respectively: (A +) (B (C), (A +), (B (C), (C) // match, or search by A + BC, but after each successful match, the searched string is divided into four sub-strings according to the above four rules and stored as while (matcher. find () {System. out. print ("matching result:"); for (inti = 0; I <groupCount; I ++) {System. out. print ("group" + I + ":" + matcher. group (I) + "\ t");} System. out. println ();}
Output:
Matching result: group0: ABC group1: ABC group2: A group3: BC
Matching result: group0: AABC group1: AABC group2: AA group3: BC
Matching result: group0: AAAAABC group1: AAAAABC group2: AAAAA group3: BC