Regular Expressions (recommended in syntax) and regular expression syntax

Source: Internet
Author: User
Tags character classes alphanumeric characters

Regular Expressions (recommended in syntax) and regular expression syntax

Constructor of a regular expression

Construct matching

Character

X characters x
\ Backslash character
\ 0n CHARACTER n with an octal value of 0 (0 <= n <= 7)
\ 0nn: nn (0 <= n <= 7) character with a octal value of 0)
\ 0mnn: mnn (0 <= m <= 3, 0 <= n <= 7)
\ Xhh character with hexadecimal value 0x hh
\ Uhhhh character with hexadecimal value 0x hhhh
\ T tab ('\ u0009 ')
\ N New Line (line feed) character ('\ u000a ')
\ R carriage return ('\ u000d ')
\ F form feed ('\ u000c ')
\ A alarm (bell) character ('\ u0007 ')
\ E escape character ('\ u001B ')
\ Cx control letter corresponding to x

Character class

[Abc] a, B, or c (simple class)
[^ Abc] any character except a, B, or c (NO)
[A-zA-Z] letters from a to z or from A to Z are included in the range)
[A-d [m-p] a to d or m to p: [a-dm-p] (union)
[A-z & [def] d, e, or f (intersection)
[A-z & [^ bc] a to z, except for B and c: [ad-z] (minus)
[A-z & [^ m-p] a to z, instead of m to p: [a-SCSI-z] (minus)

Predefined character classes

. Any character (may or may not match the line terminator)
\ D Number: [0-9]
\ D non-numeric: [^ 0-9]
\ S blank character: [\ t \ n \ x0B \ f \ r]
\ S non-blank characters: [^ \ s]
\ W word character: [a-zA-Z_0-9]
\ W non-word characters: [^ \ w]

POSIX character class (US-ASCII only)
\ P {Lower} lowercase letter: [a-z]
\ P {Upper} uppercase letter: [A-Z]
\ P {ASCII} All ASCII: [\ x00-\ x7F]
\ P {Alpha} letter: [\ p {Lower} \ p {Upper}]
\ P {Digit} decimal number: [0-9]
\ P {Alnum} alphanumeric characters: [\ p {Alpha} \ p {Digit}]
\ P {Punct} punctuation :! "# $ % & '() * +,-./:; <=>? @ [\] ^ _ '{| }~
\ P {Graph} visible characters: [\ p {Alnum} \ p {Punct}]
\ P {Print} printable character: [\ p {Graph} \ x20]
\ P {Blank} space or tab: [\ t]
\ P {Cntrl} Control Character: [\ x00-\ x1F \ x7F]
\ P {XDigit} hexadecimal number: [0-9a-fA-F]
\ P {Space} blank characters: [\ t \ n \ x0B \ f \ r]

Java. lang. Character class (simple java Character type)
\ P {javaLowerCase} is equivalent to java. lang. Character. isLowerCase ()
\ P {javaUpperCase} is equivalent to java. lang. Character. isUpperCase ()
\ P {javaWhitespace} is equivalent to java. lang. Character. isWhitespace ()
\ P {javaMirrored} is equivalent to java. lang. Character. isMirrored ()

Unicode block and category class
Characters in \ p {InGreek} Greek blocks (simple blocks)
\ P {Lu} uppercase letters (simple type)
\ P {SC} currency symbol
\ P {InGreek} All characters except (NO) in the Greek Block)
[\ P {L} & [^ \ p {Lu}] All letters, except uppercase letters (minus)

Boundary

^ Beginning of a row
$ End of a row
\ B word boundary
\ B Non-word boundary
\
End of a match on \ G
The end of the \ Z input. It is only used for the final terminator (if any)
\ Z input end

Greedy quantifiers

X? X, neither once nor once
X * X, zero or multiple times
X + X, once or multiple times
X {n} X, EXACTLY n times
X {n,} X, at least n times
X {n, m} X, at least n times, but not more than m times

Reluctant quantifiers

X ?? X, neither once nor once
X *? X, zero or multiple times
X ++? X, once or multiple times
X {n }? X, EXACTLY n times
X {n ,}? X, at least n times
X {n, m }? X, at least n times, but not more than m times

Possessive quantifiers

X? + X, neither once nor once
X * + X, zero or multiple times
X ++ X, once or multiple times
X {n} + X, EXACTLY n times
X {n,} + X, at least n times
X {n, m} + X, at least n times, but not more than m times

Logical operators

Xy x followed by Y
X | y x or Y
(X) X, used as the capture group

Back Reference

\ N any matching nth capture group

Reference
\ Nothing, but references the following characters
\ Q Nothing, but references all characters until \ E
\ E Nothing, but end reference starting from \ Q

Special Structure (non-capturing)

(? : X) X, used as a non-capturing Group
(? Idmsux-idmsux) Nothing, but will match the flag I d m s u x on-off
(? Idmsux-idmsux: X) X, with the given flag I d m s u x on-off
Non-capturing group (? = X) X, through the zero-width positive lookahead
(?! X) X, using a zero-width negative lookahead
(? <= X) X, using a zero-width positive lookbehind
(? <! X) X, using a zero-width negative lookbehind
(?> X) X, used as an independent non-capturing Group

-------------------------------------------------

Backslash, escape, and reference

The backslash ('\') is used to reference the escape structure, as defined in the preceding table. It is also used to reference other characters that will be interpreted as non-escape structures. Therefore, the expression \ matches a single backslash, and \ {matches the left parenthesis.

It is wrong to use the backslash before any letter characters that do not represent escape structures; they are reserved for future extension of the Regular Expression Language. You can use a backslash before a non-letter character, regardless of whether the character is a part of a non-escape structure.

According to the requirements of Java Language Specification, the backslash in the string of Java source code is interpreted as Unicode escape or other character escape. Therefore, two backslashes must be used in the string literal value, indicating that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal value "\ B" matches a single backspace character, while "\ B" matches the word boundary. The string literal value "\ (hello \)" is invalid and may cause compilation errors. It must match the string (hello, the string literal value "\ (hello \)" must be used \\)".

Character class

Character classes can appear in other character classes, and can contain Union operators (implicit) and intersection operators (&&). The Union operator indicates a class that contains at least all characters in an operand class. The intersection operator represents a class that contains all characters in both its two operand classes.

The priority of character-class operators is as follows:

1-character nominal value escape \ x
2 groups [...]
The value range is a-z.
4. Union [a-e] [I-u]
5 intersection [a-z & [aeiou]

Note that different sets of metacharacters are actually inside the character class, rather than outside the character class. For example, a regular expression loses its special meaning in the character class, and the expression-becomes the range that forms metacharacters.

Line terminator

A row Terminator is a sequence of one or two characters that marks the end of a row in the input character sequence. The following code is recognized as a line terminator:

New Line (line feed) character ('\ n '),
The carriage return ("\ r \ n "),
Separate carriage returns ('\ R '),
The next line of characters ('\ u0085 '),
Line separator ('\ u2028') or
Paragraph separator ('\ u2029 ).
If UNIX_LINES mode is activated, the new line is a unique row Terminator.

If the DOTALL flag is not specified, the regular expression can match any character (except the line terminator.

By default, regular expressions ^ and $ ignore the row Terminator, which only match the beginning and end of the entire input sequence. If the MULTILINE mode is activated, ^ matches the start of the input and the end of the line (the end of the input. In MULTILINE mode, $ matches only before the row terminator or the end of the input sequence.

Group and capture

The capture group can be numbered from left to right by calculating its parentheses. For example, in expression (A) (B (C), there are four such groups:

1 (A) (B (C )))
2 \
3 (B (C ))
4 (C)

Group zero always represents the entire expression.

The reason for naming the capture group is that each sub-sequence of the input sequence that matches these groups is saved in the match. The captured sub-sequence can be used in the expression through Back reference later, or it can be obtained from the matcher after the matching operation is completed.

The capture input associated with the group is always the child sequence that is most recently matched with the group. If the group is calculated again due to quantification, the previously captured value (if any) will be retained when the second calculation fails. For example, convert the string "aba" with the expression (a (B )?) + If it matches, the second group is set to "B ". At the beginning of each match, all captured input is discarded.

Take (?) The group at the beginning is a pure non-capturing group, which does not capture text or count the combination.

The above is all the regular expressions (recommended for syntax) provided by the editor. I hope you can provide more support for the regular expressions ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.