Meaning of expression:
1. Character x character X. For example a denotes character a
\ backslash character. Write as \\\\ in writing. (Note: Because Java in the first parsing, the \\\\ parsing into a regular expression \ \, in the second resolution, and then resolved to \, so usually not 1.1 of the escape characters enumerated, including 1.1 of \, and with the write two times)
\0n with octal value 0 of the character n (0 <= n <= 7)
\0nn with octal value 0 of the character nn (0 <= n <= 7)
\0mnn characters with octal value 0 mnn (0 <= m <= 3, 0 <= n <= 7)
\XHH characters with hexadecimal value of 0x hh
\uhhhh characters with hexadecimal value of 0x HHHH
\ t tab (' \u0009 ')
\ n New Line (newline) character (' \u000a ')
\ r return character (' \u000d ')
\f page feed (' \u000c ')
\a Alarm (Bell) character (' \u0007 ')
\e Escape character (' \u001b ')
\CX corresponds to the control character of X
2. Character class [ABC] A, B, or C (simple Class). For example, [EGD] indicates that a character e, G, or D is included.
[^ABC] Any character except A, B, or C (negation). For example, [^EGD] means that no characters e, G, or D are included.
[A-za-z] A to Z or A to Z, and the letters at both ends are included (range)
[A-d[m-p]] A to D or M to P:[a-dm-p] (and set)
[A-z&&[def]] D, E or F (intersection)
[A-Z&&[^BC]] A to Z, except B and C:[ad-z] (minus)
[A-z&&[^m-p]] A to Z, not M to P:[a-lq-z] (minus)
3. Predefined character classes (note that the backslash will be written two times, for example, \d write as \\d) any character (May or may not match the line terminator)
\d number: [0-9]
\d Non-digit: [^0-9]
\s whitespace characters: [\t\n\x0b\f\r]
\s non-whitespace characters: [^\s]
\w Word characters: [a-za-z_0-9]
\w non-word characters: [^\w]
4.POSIX character class(Us-ascii only) (note that the backslash is written two times, for example, \p{lower} is written as \\p{lower})
\p{lower} lowercase alphabetic characters: [A-z].
\p{upper} uppercase characters: [A-z]
\P{ASCII} all ascii:[\x00-\x7f]
\p{alpha} alphabetic characters: [\p{lower}\p{upper}]
\p{digit} decimal digits: [0-9]
\p{alnum} alphanumeric characters: [\p{alpha}\p{digit}]
\P{PUNCT} punctuation:! " #$%& ' () *+,-./:;<=>?@[\]^_ ' {|} ~
\p{graph} visible characters: [\p{alnum}\p{punct}]
\p{print} printable characters: [\p{graph}\x20]
\p{blank} spaces or tabs: [\ t]
\p{cntrl} control character: [\x00-\x1f\x7f]
\p{xdigit} hexadecimal number: [0-9a-fa-f]
\p{space} whitespace characters: [\t\n\x0b\f\r]
5.java.lang.character Class (Simple Java character type) \p{javalowercase} is equivalent to Java.lang.Character.isLowerCase ()
\p{javauppercase} is equivalent to Java.lang.Character.isUpperCase ()
\p{javawhitespace} is equivalent to Java.lang.Character.isWhitespace ()
\p{javamirrored} is equivalent to java.lang.Character.isMirrored ()
6.Unicode blocks and categories of classes Characters in \p{ingreek} Greek blocks (simple blocks)
\p{lu} Capital Letter (Simple category)
\P{SC} currency symbol
\p{ingreek} All characters, except in the Greek block (negation)
[\p{l}&&[^\p{lu}]] All letters except uppercase letters (minus)
7. Boundary Matching Device ^ The beginning of the line, use ^ at the beginning of the regular expression. For example: ^ (ABC) represents a string that begins with ABC. Note that when compiling, you set the parameter MULTILINE, such as pattern P = pattern.compile (regex,pattern.multiline);
$ for the end of the line, use at the end of the regular expression. For example: (^BCA). * (abc$) represents a line that begins with BCA at the end of ABC.
\b The word boundary. For example, \b (ABC) indicates that the beginning or end of a word contains ABC, (ABCJJ, JJABC can match)
\b is not a word boundary. For example, \b (ABC) indicates that the middle of a word contains ABC, (JJABCJJ match and JJABC, ABCJJ mismatch)
\a the beginning of the input
\g the end of the previous match (personal feeling this parameter is useless). For example, \\Gdog to find dog at the end of the previous match if not, look up from the beginning, and note that if the beginning is not dog, it cannot match.
\z the end of the input, only for the last terminator (if any)
A line terminator is a sequence of one or two characters that marks the end of the line of the input character sequence.
The following code is recognized as a line terminator:
‐ New line (' \ n ') character,
‐ followed by a carriage return ("\ r \ n"), followed by a new line character,
‐ a separate carriage return (' \ R '),
‐ the next line of characters (' \u0085 '),
‐ line delimiter (' \u2028 ') or
‐ paragraph delimiter (' \u2029).
End of \z input
When compiling mode, you can set one or more flags, such as
Pattern pattern = pattern.compile (patternstring,pattern.case_insensitive + pattern.unicode_case);
The following Six Flags are supported:
‐case_insensitive: The matching character characters is not case-insensitive, which only considers us ASCII characters by default.
‐unicode_case: When combined with case_insensitive, use UNICODE letters to match
‐multiline:^ and $ match the start and end of a row, not the entire input
‐unix_lines: When matching ^ and $ in multiple-line mode, only ' \ n ' is treated as a line terminator
‐dotall: When this flag is used,. Symbols match all characters including line Terminators
‐CANON_EQ: Consider specification equivalence for Unicode characters
8.Greedy Quantity Word
X? X, not once or once
X* X, 0 or more times
x+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n times, but not more than m times
9.Reluctant Quantity Word
X?? X, not once or once
X*? X, 0 or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n times, but not more than m times
10.Possessive Quantity Word
x?+ X, once or once there is no
x*+ X, 0 or more times
x + + x., one or more times
x{n}+ X, exactly n times
x{n,}+ X, at least n times
x{n,m}+ X, at least n times, but not more than m times
The difference between the greedy,reluctant,possessive is that: (Note that only in the case of the fuzzy processing)
The greedy classifier is considered "greedy" because it first enrolled in the entire string of fuzzy matches. If the first matching attempt (the entire input string) fails, the match will back up one character in the matching string and try again, repeating the process until a match is found or no more remaining characters can be backed up. The last thing it tries to match is 1 or 0 characters, depending on the quantifier used in the expression.
However, the reluctant quantifiers take the opposite approach: they start at the beginning of the matched string and then progressively read one character search match at a time. What they finally tried to match was the entire input string.
Finally, the possessive classifier always reads the entire input string and tries to match it once (and only once). Unlike greedy quantifiers, possessive never retreats.
11.Logical operator
XY X followed by Y
X| Y X or Y
(x) x, as a capturing group. For example (ABC) means to capture ABC as a whole
12.Back Reference
\ n Any matching nth capture group
Capturing groups can be numbered by counting their open brackets from left to right. For example, in an expression ((A) (B (C)), there are four such groups:
1 ((A) (B (C)))
2 \a
3 (B (C))
4 (C)
In an expression, you can use \ n to refer to the appropriate group, for example (AB) 34\1 means ab34ab, (AB) (CD) \1\2 means AB34CDABCD.
13. Reference
\ Nothing, but the following characters are referenced
\q nothing, but refers to all characters until \e. The string between QE is used unaltered (except for 1.1 of the literal characters). For example, ab\\q{|} \\\\e
Can match ab{|} \\
\e nothing, but ending a reference starting from \q
14. Special construction (non-capture)
(?: x) x, as a non-capturing group
(? idmsux-idmsux) Nothing, but converts the matching flag from on to off. For example, an expression (? i), ABC (? i), when (? i) opens a case-insensitive switch, ABC matches
The idmsux description is as follows:
The ‐i case_insensitive:us-ascii character set is case-insensitive. (? i)
‐d unix_lines: Turn on UNIX line breaks
‐m MULTILINE: Multi-line mode (? m)
Unix swap behavior \ n
Windows change behavior \ r \ n (? s)
‐u Unicode_case:unicode is case-insensitive. (? u)
‐x COMMENTS: You can use annotations in pattern, ignore whitespace in pattern, and "#" until the end (#后面为注解). (? x) For example (? x) Abc#asfsdadsa can match the string ABC
(? idmsux-idmsux:x) X, as a on-off group with a given flag. Similar to the above, the above expression can be rewritten as: (? i:abc) def, or (? i) ABC (?-I:DEF)
(? =x) X, through a positive lookahead of 0 widths. 0-width forward assertion, which continues to match only if subexpression X is matched to the right of this position. For example, \w+ (? =\d) represents a letter followed by a number but does not capture a number (no backtracking)
(?! x) x, through a negative lookahead of 0 widths. 0 width Negative lookahead assertion. The match continues only if the subexpression X does not match to the right of this position. For example, \w+ (?!) \d) indicates that the letters are not followed by numbers, and no digits are captured.
(? <=x) X, through a positive lookbehind of 0 widths. 0 width is positive and then the assertion is made. The match continues only if the subexpression X matches the left side of this position. For example, (? <=19) 99 indicates that 99 is preceded by the number 19, but does not capture the preceding 19. (No backtracking)
(? (? >x) X, as a separate, non-capturing group (no backtracking)
The difference between (? =x) and (? >x) is that (? >x) is not retroactive. For example, the string being matched is abcm
When the expression is a (?: B|BC) m can be matched, and when the expression is a (?). >B|BC) is not matched because when the latter matches to B, it jumps out of the non-capture group and does not match the characters in the group again. can speed up.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.