Java Regular Expressions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Expression meaning:
1. Character
X characters x. For example, a indicates character.
\ Backslash character. When writing, enter \\\\. (Note: during the first parsing of java, \ is parsed into a regular expression \, and the second Parsing is parsed \, therefore, all escape characters not listed in "1.1", including "1.1" and "\" must be written twice)
\ 0n CHARACTER n with an octal value of 0 (0 <= n <= 7)
\ 0nn: nn (0 <= n <= 7) character with a octal value of 0)
\ 0mnn: mnn (0 <= m <= 3, 0 <= n <= 7)
\ Xhh character with hexadecimal value 0x hh
\ Uhhhh character with hexadecimal value 0x hhhh
\ T tab ('\ u0009 ')
\ N New Line (line feed) character ('\ u000a ')
\ R carriage return ('\ u000d ')
\ F form feed ('\ u000c ')
\ A alarm (bell) character ('\ u0007 ')
\ E escape character ('\ u001B ')
\ Cx control letter corresponding to x
2. character classes
[Abc] a, B, or c (simple class ). For example, [egd] indicates that it contains characters e, g, or d.
[^ Abc] any character except a, B, or c (NO ). For example, [^ egd] indicates that it does not contain characters e, g, or d.
[A-zA-Z] letters from a to z or from A to Z are included in the range)
[A-d [m-p] a to d or m to p: [a-dm-p] (union)
[A-z & [def] d, e, or f (intersection)
[A-z & [^ bc] a to z, except for B and c: [ad-z] (minus)
[A-z & [^ m-p] a to z, instead of m to p: [a-SCSI-z] (minus)
3. pre-defined character class (note that the backslash must be written twice, for example, \ d to \ d) any character
(It may or may not match the row Terminator)
\ D Number: [0-9]
\ D non-numeric: [^ 0-9]
\ S blank character: [\ t \ n \ x0B \ f \ r]
\ S non-blank characters: [^ \ s]
\ W word character: [a-zA-Z_0-9]
\ W non-word characters: [^ \ w]
4. POSIX character class(US-ASCII only) (note that the backslash should be written twice, for example \ p {Lower} to \ p {Lower })
\ P {Lower} lowercase letter: [a-z].
\ P {Upper} uppercase letter: [A-Z]
\ P {ASCII} All ASCII: [\ x00-\ x7F]
\ P {Alpha} letter: [\ p {Lower} \ p {Upper}]
\ P {Digit} decimal number: [0-9]
\ P {Alnum} alphanumeric characters: [\ p {Alpha} \ p {Digit}]
\ P {Punct} punctuation :! "# $ % & '() * +,-./:; <=>? @ [\] ^ _ '{| }~
\ P {Graph} visible characters: [\ p {Alnum} \ p {Punct}]
\ P {Print} printable character: [\ p {Graph} \ x20]
\ P {Blank} space or tab: [\ t]
\ P {Cntrl} Control Character: [\ x00-\ x1F \ x7F]
\ P {XDigit} hexadecimal number: [0-9a-fA-F]
\ P {Space} blank characters: [\ t \ n \ x0B \ f \ r]
5. java. lang. Character class (simple java Character type)
\ P {javaLowerCase} is equivalent to java. lang. Character. isLowerCase ()
\ P {javaUpperCase} is equivalent to java. lang. Character. isUpperCase ()
\ P {javaWhitespace} is equivalent to java. lang. Character. isWhitespace ()
\ P {javaMirrored} is equivalent to java. lang. Character. isMirrored ()
6. Unicode block and category class
Characters in \ p {InGreek} Greek blocks (simple blocks)
\ P {Lu} uppercase letters (simple type)
\ P {SC} currency symbol
\ P {InGreek} All characters except (NO) in the Greek Block)
[\ P {L} & [^ \ p {Lu}] All letters, except uppercase letters (minus)
7. Border Matters
^ Indicates the beginning of a row. Use ^ at the beginning of a regular expression. For example, ^ (abc) indicates a string starting with abc. Note that you must set the MULTILINE parameter during compilation, for example, Pattern p = Pattern. compile (regex, Pattern. MULTILINE );
$ The end of the row. Use it at the end of the regular expression. For example: (^ bca). * (abc $) indicates the row starting with bca and ending with abc.
\ B word boundary. For example, \ B (abc) indicates that the start or end of a word contains abc. (BOTH abcjj and jjabc can be matched)
\ B is not a word boundary. For example, \ B (abc) indicates that the word's medium contains abc (jjabcjj matches but jjabc and abcjj do not)
\
The end of a match on \ G (I personally think this parameter is useless ). For example, \ Gdog indicates to search for a dog at the end of the previous match. If no dog is found at the end of the previous match, it starts with a dog.
The end of the \ Z input. It is only used for the final terminator (if any)
A row Terminator is a sequence of one or two characters that marks the end of a row in the input character sequence.
The following code is recognized as a line terminator:
-New Line (line feed) characters ('\ n '),
-The carriage return ("\ r \ n") followed by the new line "),
-Separate carriage return ('\ R '),
-Next line of characters ('\ u0085 '),
-Line separator ('\ u2028') or
-Paragraph separator ('\ u2029 ).
\ Z input end
When compiling mode, you can set one or more flag, such
Pattern pattern = Pattern. compile (patternString, Pattern. CASE_INSENSITIVE + Pattern. UNICODE_CASE );
The following six logos are supported:
-CASE_INSENSITIVE: the matching character is case-insensitive. By default, this flag only takes us ascii characters into account.
-UNICODE_CASE: when combined with CASE_INSENSITIVE, use Unicode letters to match
-MULTILINE: ^ matches the start and end of a row with $, instead of the entire input.
-UNIX_LINES: when matching ^ and $ in multi-row mode, only '\ n' is considered as the row Terminator.
-DOTALL: When this flag is used, the. symbol matches all characters including the line terminator.
-CANON_EQ: Considering the standardized equivalence of Unicode characters
8. Greedy quantifiers
X? X, neither once nor once
X * X, zero or multiple times
X + X, once or multiple times
X {n} X, EXACTLY n times
X {n,} X, at least n times
X {n, m} X, at least n times, but not more than m times
9. Reluctant quantifiers
X ?? X, neither once nor once
X *? X, zero or multiple times
X ++? X, once or multiple times
X {n }? X, EXACTLY n times
X {n ,}? X, at least n times
X {n, m }? X, at least n times, but not more than m times
10. Possessive quantifiers
X? + X, neither once nor once
X * + X, zero or multiple times
X ++ X, once or multiple times
X {n} + X, EXACTLY n times
X {n,} + X, at least n times
X {n, m} + X, at least n times, but not more than m times
The difference between Greedy, Reluctant, and Possessive is: (Note that only fuzzy processing such)
Greedy is regarded as "greedy" because it reads the entire fuzzy match string for the first time. If the first matching attempt (the entire input string) fails, the matcher will return the last character in the matched string and attempt again to repeat the process, wait until a match is found or no more remaining characters can be removed. Based on the quantifiers used in the expression, the final content it tries to match is 1 or 0 characters.
However, the reluctant quantifiers adopt the opposite method: they start from the start of the matched string, and then gradually read a character for search and match at a time. The final content they try to match is the entire input string.
Finally, possessive quantifiers always read the entire input string and try to match once (and only once. Unlike greedy quantifiers, possessive never reverts.
11. Logical operators
Xy x followed by Y
X | y x or Y
(X) X, used as the capture group. For example, abc indicates that abc is captured as a whole.
12. Back Reference
\ N any matching nth capture group
The capture group can be numbered from left to right by calculating its parentheses. For example, in expression (A) (B (C), there are four such groups:
1 (A) (B (C )))
2 \
3 (B (C ))
4 (C)
In the expression, \ n can be used to reference the corresponding group. For example, (AB) 34 \ 1 indicates ab34ab, (AB) 34 (cd) \ 1 \ 2 indicates ab34cdabcd.
13. Reference
\ Nothing, but references the following characters
\ Q Nothing, but references all characters until \ E. Strings between QE will be unblocked (except for 1.1 escape characters ). For example, AB \ Q {|}\\ \ E
Can match AB {| }\\
\ E Nothing, but end reference starting from \ Q
14. Special Structure (non-capturing)
(? : X) X, used as a non-capturing Group
(? Idmsux-idmsux) Nothing, but changes the matching flag from on to off. For example, an expression (? I) abc (? -I) def at this moment ,(? I) Enable case-insensitive switch and abc match
Idmsux:
-I CASE_INSENSITIVE: The US-ASCII character set is case insensitive. (? I)
-D UNIX_LINES: enables UNIX line breaks
-M MULTILINE: MULTILINE mode (? M)
Change Behavior in UNIX \ n
Change Behavior in WINDOWS \ r \ n (? S)
-U UNICODE_CASE: Unicode is case insensitive. (? U)
-X COMMENTS: You can use annotations in pattern to ignore whitespace in pattern and "#" until the end (# is followed by annotation ). (? X) for example (? X) abc # asfsdadsa can match the string abc
(? Idmsux-idmsux: X) X, used as a non-capturing group with the given flag on-off. Similar to the above expression, the above expression can be rewritten :(? I: abc) def, or (? I) abc (? -I: def)
(? = X) X, through the zero-width positive lookahead. Assertion with zero width is performed first. The assertion is continued only when the subexpression X matches the right side of the position. For example, \ w + (? = \ D) indicates that the letter is followed by a number, but no number is captured (no backtracking)
(?! X) X, using a zero-width negative lookahead. Assertion with Zero Width and negative first. The matching continues only when the child expression X does not match the right side of the position. For example, \ w + (?! \ D) indicates that the letter is not followed by a number and no number is captured.
(? <= X) X, using a zero-width positive lookbehind. Assertion after the width is zero. The matching continues only when the child expression X matches on the left side of the position. For example ,(? <= 19) 99 indicates that 99 is preceded by a number 19, but 19 is not captured. (Do not trace back)
(? (?> X) X, as an independent non-capturing group (not backtracking)
(? = X) and (?> X) The difference is (?> X) does not backtrack. For example, the matched string is abcm.
When the expression is (? : B | bc) m can be matched, while when the expression is a (?> B | bc) is not matched, because when the latter matches B, the non-capturing group exists because it has already matched, instead of matching the characters in the group again. It can speed up.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More