Syntax rules for Java regular expressions

Last Update:2015-11-30 Source: Internet

Author: User

Tags alphabetic character character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Go Syntax rules for Java regular expressions

Regular expressions are powerful and flexible text processing tools that use regular expressions to construct complex text patterns in a programmatic manner and to search for input strings. Once you have found the parts that match these patterns, you are able to handle them as you wish. Regular expressions provide a completely generic way to solve various string processing related problems: matching, selecting, editing, and validating.

First look at the complete set of constructs for the Java Express expression, or refer to the API description in Java.util.regex.Pattern.

Character
X	Character X
\\	Backslash character
\0n	Character n with octal value 0 (0<=n<=7)
\0nn	Character nn with octal value 0 (0<=n<=7)
\0mnn	Character Mnn with octal value 0 (0<=m<=3, 0<=n<=7)
\xhh	Character hh with hexadecimal value of 0x
\uhhhh	Character with hexadecimal value of 0x HHHH
\ t	tab (' \u0009 ')
\ n	New lines (line break) (' \u000a ')
\ r	Carriage return character (' \u000d ')
\f	Page break (' \u000c ')
\a	Alarm (Bell) character (' \u0007 ')
\e	Escape character (' \u001b ')
\cx	The control that corresponds to X

The above is the character of the regular expression, such as the character a in the regular expression is a, the regular expression of the backslash is \ \, so if you want to represent the normal \, the regular expression is \\\\. The backslash character (' \ ') is used to reference an escaped construct, as defined in the previous table, and also used to refer to other characters that will be interpreted as non-escaped constructs. Therefore, the expression \ \ Matches a single backslash, and \{matches the opening parenthesis.

It is an error to use backslashes before any alphabetic character that does not represent an escaped construct, which is reserved for future extended regular expression languages. You can use a backslash before a non-alphabetic character, regardless of whether the character is part of a non-escaped construct.

According to the requirements of Java Language Specification, the backslash in a string of Java source code is interpreted as Unicode escape or other character escaping. Therefore, you must use two backslashes in string literals to indicate that regular expressions are protected from being interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, while "\\b" matches the word boundary. The string literal "\ (hello\)" is illegal and will result in a compile-time error; to match the string (hello), you must use the string literal "\ \ (hello\\)".

The following is a regular expression of the character class, and the meaning of the expression is more straightforward.

Character class
[ABC]	A, B or C (simple Class)
[^ABC]	Any character except A, B, or C (negation)
[A-za-z]	A to Z or A to Z, the letters at both ends are included (range)
[A-d[m-p]]	A to D or M to P:[a-dm-p] (set)
[A-z&&[def]]	D, E or F (intersection)
[A-Z&&[^BC]]	A to Z, except B and C:[ad-z] (minus)
[A-z&&[^m-p]]	A to Z, not M to P:[a-lq-z] (minus)

Java also pre-defined some character classes, which can be used directly in regular expressions, providing a simple and convenient way to use.

Predefined character classes
Point number (. ）	any character (with line Terminator may or may not match)
\d	Number: [0-9]
\d	non-numeric: [^0-9]
\s	whitespace characters: [\t\n\x0b\f\r]
\s	non-whitespace characters: [^\s]
\w	Word character: [a-za-z_0-9]
\w	non-word characters: [^\w]

Where the line terminator is a sequence of one or two characters, marking the end of the line of the input character sequence. The following code is recognized as a line terminator:

· New lines (line break) (' \ n '),

· The carriage return immediately following the new line character ("\ r \ n"),

· A separate carriage return (' \ R '),

· The next line of characters (' \u0085 '),

· Line delimiter (' \u2028 ') or

· The paragraph delimiter (' \u2029 ').

If Unix_lines mode is activated, the new line character is the only line terminator that is recognized.

If the DOTALL flag is not specified, the regular expression point number (. ) can match any character (except the line terminator).

By default, regular expressions ^ and $ ignore line terminators, which match only the beginning and end of the entire input sequence. If MULTILINE mode is activated, the match occurs after the beginning of the input and after the line terminator (the end of the input). When in MULTILINE mode, $ is matched only before the line terminator or at the end of the input sequence. The Unix_lines, Dotall, and multiline above are constants defined in class pattern, and the compilation mode is specified in method compile(String regex,int flags).

Boundary Matching Device
^	The beginning of the line
$	End of Line
\b	Word boundaries
\b	Non-word boundary
\a	Start of input
\g	End of last match
\z	the end of the input, used only for the last End Character (if any)
\z	End of input

Quantifiers describe how a pattern absorbs input text, which can be divided into: greedy, reluctant and possessive. Greedy expressions will find as many matches as possible for all possible patterns, with the reluctance to specify with question marks, match the minimum number of characters required to satisfy the pattern, the occupancy type is only available in the Java language, and when the regular expression is applied to a string, there is a significant amount of state to backtrack when the match fails. Quantifiers are always greedy unless other types are specified.

Greedy type	Barely type	Possession type	How to Match
X?	X??	x?+	One or 0 x
x*	*X?**	*x+**	0 or more X
x+	X+?	X + +	One or more X
X{n}	X{n}?	x{n}+	Exactly n times x
X{n,}	X{n,}?	x{n,}+	At least n times x
X{N,M}	X{n,m}?	x{n,m}+	X at least n times, and no more than m times

Capturing groups can be numbered by calculating their opening brackets from left to right. For example, in an expression ((A) (B (C) )), there are four such groups:

1	((A) (B (C)))
2	\a
3	(B (C))
4	C

Group 0 always represents an entire expression.

This is why the capturing group is named because in the match, each subsequence of the input sequence that matches these groups is saved. The captured subsequence can later be used in an expression through a \ Reference, or it can be obtained from the match after the match operation is complete. The capture input associated with a group is always a sub-sequence that matches the group most recently.

A group that begins with a (?) is a pure, non-capturing group that does not capture text and does not count against group totals.

logical operators
Xy	X followed by Y
X \| Y	X or Y
(X)	x as Capturing groups , you can use \i to refer to the I capture group in an expression

Syntax rules for Java regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More