Go Syntax rules for Java regular expressions
Regular expressions are powerful and flexible text processing tools that use regular expressions to construct complex text patterns in a programmatic manner and to search for input strings. Once you have found the parts that match these patterns, you are able to handle them as you wish. Regular expressions provide a completely generic way to solve various string processing related problems: matching, selecting, editing, and validating.
First look at the complete set of constructs for the Java Express expression, or refer to the API description in Java.util.regex.Pattern.
Character |
X |
Character X |
\\ |
Backslash character |
\0n |
Character n with octal value 0 (0<=n<=7) |
\0nn |
Character nn with octal value 0 (0<=n<=7) |
\0mnn |
Character Mnn with octal value 0 (0<=m<=3, 0<=n<=7) |
\xhh |
Character hh with hexadecimal value of 0x |
\uhhhh |
Character with hexadecimal value of 0x HHHH |
\ t |
tab (' \u0009 ') |
\ n |
New lines (line break) (' \u000a ') |
\ r |
Carriage return character (' \u000d ') |
\f |
Page break (' \u000c ') |
\a |
Alarm (Bell) character (' \u0007 ') |
\e |
Escape character (' \u001b ') |
\cx |
The control that corresponds to X |
The above is the character of the regular expression, such as the character a in the regular expression is a, the regular expression of the backslash is \ \, so if you want to represent the normal \, the regular expression is \\\\. The backslash character (' \ ') is used to reference an escaped construct, as defined in the previous table, and also used to refer to other characters that will be interpreted as non-escaped constructs. Therefore, the expression \ \ Matches a single backslash, and \{matches the opening parenthesis.
It is an error to use backslashes before any alphabetic character that does not represent an escaped construct, which is reserved for future extended regular expression languages. You can use a backslash before a non-alphabetic character, regardless of whether the character is part of a non-escaped construct.
According to the requirements of Java Language Specification, the backslash in a string of Java source code is interpreted as Unicode escape or other character escaping. Therefore, you must use two backslashes in string literals to indicate that regular expressions are protected from being interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, while "\\b" matches the word boundary. The string literal "\ (hello\)" is illegal and will result in a compile-time error; to match the string (hello), you must use the string literal "\ \ (hello\\)".
The following is a regular expression of the character class, and the meaning of the expression is more straightforward.
Character class |
[ABC] |
A, B or C (simple Class) |
[^ABC] |
Any character except A, B, or C (negation) |
[A-za-z] |
A to Z or A to Z, the letters at both ends are included (range) |
[A-d[m-p]] |
A to D or M to P:[a-dm-p] (set) |
[A-z&&[def]] |
D, E or F (intersection) |
[A-Z&&[^BC]] |
A to Z, except B and C:[ad-z] (minus) |
[A-z&&[^m-p]] |
A to Z, not M to P:[a-lq-z] (minus) |
Java also pre-defined some character classes, which can be used directly in regular expressions, providing a simple and convenient way to use.
Predefined character classes |
Point number (. ) |
any character (with line Terminator may or may not match) |
\d |
Number: [0-9] |
\d |
non-numeric: [^0-9] |
\s |
whitespace characters: [\t\n\x0b\f\r] |
\s |
non-whitespace characters: [^\s] |
\w |
Word character: [a-za-z_0-9] |
\w |
non-word characters: [^\w] |
Where the line terminator is a sequence of one or two characters, marking the end of the line of the input character sequence. The following code is recognized as a line terminator:
· New lines (line break) (' \ n '),
· The carriage return immediately following the new line character ("\ r \ n"),
· A separate carriage return (' \ R '),
· The next line of characters (' \u0085 '),
· Line delimiter (' \u2028 ') or
· The paragraph delimiter (' \u2029 ').
If Unix_lines mode is activated, the new line character is the only line terminator that is recognized.
If the DOTALL flag is not specified, the regular expression point number (. ) can match any character (except the line terminator).
By default, regular expressions ^ and $ ignore line terminators, which match only the beginning and end of the entire input sequence. If MULTILINE mode is activated, the match occurs after the beginning of the input and after the line terminator (the end of the input). When in MULTILINE mode, $ is matched only before the line terminator or at the end of the input sequence. The Unix_lines, Dotall, and multiline above are constants defined in class pattern, and the compilation mode is specified in method compile(String regex,int flags).
Boundary Matching Device |
^ |
The beginning of the line |
$ |
End of Line |
\b |
Word boundaries |
\b |
Non-word boundary |
\a |
Start of input |
\g |
End of last match |
\z |
the end of the input, used only for the last End Character (if any) |
\z |
End of input |
Quantifiers describe how a pattern absorbs input text, which can be divided into: greedy, reluctant and possessive. Greedy expressions will find as many matches as possible for all possible patterns, with the reluctance to specify with question marks, match the minimum number of characters required to satisfy the pattern, the occupancy type is only available in the Java language, and when the regular expression is applied to a string, there is a significant amount of state to backtrack when the match fails. Quantifiers are always greedy unless other types are specified.
Greedy type |
Barely type |
Possession type |
How to Match |
X? |
X?? |
x?+ |
One or 0 x |
x* |
X*? |
x*+ |
0 or more X |
x+ |
X+? |
X + + |
One or more X |
X{n} |
X{n}? |
x{n}+ |
Exactly n times x |
X{n,} |
X{n,}? |
x{n,}+ |
At least n times x |
X{N,M} |
X{n,m}? |
x{n,m}+ |
X at least n times, and no more than m times |
Capturing groups can be numbered by calculating their opening brackets from left to right. For example, in an expression ((A) (B (C) )), there are four such groups:
1 |
((A) (B (C))) |
2 |
\a |
3 |
(B (C)) |
4 |
C |
Group 0 always represents an entire expression.
This is why the capturing group is named because in the match, each subsequence of the input sequence that matches these groups is saved. The captured subsequence can later be used in an expression through a \ Reference, or it can be obtained from the match after the match operation is complete. The capture input associated with a group is always a sub-sequence that matches the group most recently.
A group that begins with a (?) is a pure, non-capturing group that does not capture text and does not count against group totals.
logical operators |
Xy |
X followed by Y |
X | Y |
X or Y |
(X) |
x as Capturing groups , you can use \i to refer to the I capture group in an expression
|
Syntax rules for Java regular expressions