Regular Expression, Regular Expression

Source: Internet
Author: User

Regular Expression, Regular Expression

1. Period

Assume that you are playing an English spelling game and want to find words with three letters. These words must start with a letter "t" and end with a letter "n. To construct this regular expression, you can use a wildcard, the period symbol ".". In this way, the complete expression is "t. n. It matches "tan", "ten", "tin", "ton", and "t # n", "tpn", and even "t n ", there are many other meaningless combinations. This is becauseThe periods match all characters, including spaces, Tab characters, and even line breaks..


2. square brackets

To solve the problem that the period matching range is too wide, you can specify meaningful characters in square brackets. In this case, only the characters specified in square brackets are involved in the matching. That is to say, the regular expression "t [aeio] n" only matches "tan", "Ten", "tin", and "ton ". However, "Toon" does not match becauseOnly one character can be matched in square brackets.

In addition, square brackets can also use "-" to represent the span. For example, "[a-g]" matches lowercase letters a to g in the alphabet, "[1-9]" matches numbers ranging from 1 to 9.

You can use "&" to take the intersection, for example, "[1-9 & [^ 456]" to match numbers other than 4, 5, and 6 from 1 to 9.


3. "or" symbol

If you want to match "toon" in addition to all the words matched above, you can use the "|" operator. The basic meaning of the "|" operator is the "or" operation. To match "toon", use the regular expression "t (a | e | I | o | oo) n. Square brackets cannot be used here, because square brackets can only match a single character. square brackets () must be used here. Parentheses can also be used for grouping.


4. symbol indicating the number of matches

? 0 or 1 time

+ 1 or multiple times, that is, at least 1 time

* Zero or multiple times, meaning multiple times

{N} Exactly n times

{N, m} n to m times

Suppose we want to search for American Social Security numbers in text files. The number is in the format of 999-99-9999. Shows the regular expression used to match it. In a regular expression, A hyphen (-) has a special meaning. It represents a range, for example, from 0 to 9. Therefore, when matching a hyphen in a social security number, it must be preceded by an escape character "\".


Assume that you want to enable or disable a hyphen when searching. That is, 999-99-9999 and 999999999 are in the correct format. In this case, you can add "?" After the hyphen. Quantity limit symbol, as shown in:



5. "No"

The "^" symbol is called the "no" symbol. If it is used in square brackets, "^" indicates the character that you do not want to match. For example, the regular expression matches all words, except words starting with an "X" letter.



6. parentheses

(1) define the scope of the quantifiers

Example

(Cat )? Matches 0 or 1 "cat"

(Cat) + match one or more "cat"

 

(2) limit the range of multiple-choice Structures

Example

(Cat | dog) matches "cat" or "dog"

 

(3) group capture

The pattern matching content between parentheses will be captured. When the pattern contains nested parentheses, the variable numbers will be carried out once according to the positions where parentheses appear.

Example

([A-Za-z] (\ d {2}) (-) \ d {2}) matches the A22-33 as follows:

 

The regular expression in the first parentheses is: [A-Za-z] (\ d {2 })

Matching result Group 1: A22

 

The regular expression in the second parentheses is \ d {2}

Matching result Group 2: 22

 

The regular expression in the third parentheses is: (-) \ d {2}

Matching Result Group 3:-33

 

The Fourth Regular Expression in parentheses is :-

Matching Result Group 4 :-

 

(4) group not captured

When parentheses and "? : "Indicates a non-capturing group. The content of the parentheses is not used as the capturing object. When the content in the parentheses is not the object to be captured, the non-capturing parentheses can improve the matching efficiency.

Example

(\ W (? : \ D {2 }))((? :-) \ D {2}) match the A22-33 as follows:

Group 1: A22

Group 2:-33

Note: The "22" matching \ d {2} is not captured, and the "-" matching "is not captured.

 

(5) reverse reference capture text

Example

([AB]) \ 1, "[AB]" in parentheses can match "a" or "B", and "\ 1" in the back represents, if the text references match above. If the capture group matches "a", the reverse reference can only match "a". Similarly, if the capture group matches "B ", then, reverse references can only match "B ". Due to the backward reference of "\ 1", two identical characters must be required. Here, "aa" or "bb" can be matched successfully.

 

(6) foresight

When parentheses and "? = "When used in combination, it indicates forward-looking, indicating that the expression will be used as a matching validation, but will not appear in the matching result string.

Example

(John )(? = Resig:

John: No matching, because it is not followed by Resig

John Backus: mismatch, followed by not Resig

John Reisg: Matching. John is followed by Resig. However, the matching result is "John" instead of "JohnReisg ".


7. Pre-Defined Characters

\ D numeric character: [0-9]

\ D non-numeric characters: [^ 0-9]

\ S blank character: [\ t \ n \ x0B \ f \ r]

\ S non-blank characters: [^ \ s]

\ W word character: [a-zA-Z_0-9]

\ W non-word characters: [^ \ w]


8. boundary matching characters

^ Beginning of Line

$ End of line

\ B word boundary

\ B Non-word boundary

\

End of a match on \ G

The end of the \ Z input. It is only used for the final terminator (if any)

\ Z input end


9. Matching Mode

Regular expressions have three matching modes: greedy, reluctant, and possessive)

Greedy attackers

X? X ?? X? + Match X zero times or once

X *? X * + matches X zero or multiple times

X +? X ++ matches X once or multiple times

X {n }? X {n} + match X n times

X {n ,}x {n ,}? X {n ,}+ matches X at least n times

X {n, m} X {n, m }? X {n, m} + matches X at least n times, but not more than m times

Example

Assume that the string to be analyzed is: xfooxxxxxxfoo

Mode:. * foo (Greedy mode ):

The pattern consists of the Child pattern p1 (. *) and child pattern p2 (foo). The quantifiers matching method in p1 uses the default pattern (Greedy pattern ). When the match starts, all characters xfooxxxxxxfoo are input to match the child pattern p1. The match is successful, but no string is used to match the sub-pattern p2. This round of matching failed; second round: Reduce the number of Matching Parts in p1, spit out the last character, split the string into xfooxxxxxxfo and o Sub-strings s1 and s2. S1 matches p1, but s2 does not. This round of matching failed. In the third round, the p1 part is reduced again, and two characters are spit out. The string is divided into xfooxxxxfo and oo parts. The result is the same as above. In the fourth round, the p1 matching volume is reduced again. The string is divided into two parts: xfooxxxxxx AND foo. This time, s1/s2 matches with p1/p2 respectively. Stop the attempt and return a successful match.

Mode :.*? Foo (barely mode): The minimum matching method. The first attempt to match. p1 is ignored because it is 0 or any time. It fails to match p2 with a string. The second attempt is to read the first character x and try to match p1 ,, matched successfully. The first three characters in the remaining part of the string fooxxxxfoo match p2. Therefore, stop the attempt and return a successful match. In this mode, if you continue to search for the substring that matches the pattern for the remaining string, you will find another xfoo at the end of the string. In greedy mode, because the first matched substring is already all characters, there is no second matched substring.

Mode:. * + foo (occupied mode): Also called occupied mode. When the match starts, all strings are read. The match with p1 is successful, but no other strings are matched with p2. Therefore, the matching fails. .

 

To put it simply, the greedy mode is different from the possession mode,Greedy mode reduces the number of successful matching modes from more to less, and leaves the characters to other modes for matching; the occupy mode occupies all successfully matched parts and is never reserved for other parts.

Currently, the occupied mode is only supported by java and is usually less used.


Java Regular Expression API

In JDK, classes related to regular expressions are located inJava. util. regexPackage. There are two classes: Pattern and Matcher.

Pattern class

Method Abstract:

Static PatternCompile(String regex)

Compile the given regular expression into the mode.

Static PatternCompile(String regex, int flags)

Compile the given regular expression into a pattern with the given flag.

MatcherMatcher(CharSequenceinput)

Create a match that matches the specified input and the pattern. CharSequence is an interface. String, StringBuffer, and StringBuilder are all their implementation classes.

Static booleanMatches(String regex, CharSequence input)

Compile the given regular expression and try to match the given input.

String []Split(CharSequenceinput)

Splits a given input sequence for matching in this mode.

String []Split(CharSequenceinput, int limit)

Splits a given input sequence for matching in this mode.

 

The Pattern class defines the alternative compile method, which is used to accept the flag set that affects the Pattern matching method. The flag parameter is a bitmask, which can be any of the following public static fields:

Pattern. CANON_EQ

Enabling a specification is equivalent. After this flag is specified, two characters are considered to be matched only when its full specification is decomposed and matched. For example, when expression a \ u030A [8] specifies this flag, it matches the string "\ u00E5" (that is, the character token ). By default, regular equivalence is not used for matching. Specifying this flag may affect the performance.

Pattern. CASE_INSENSITIVE

Enable case-insensitive matching. By default, match only characters in the US-ASCII character set. Unicode-aware (Unicode-aware) is case-insensitive and can be enabled by specifying the UNICODE_CASE Flag along with this flag. Case-insensitive matching can also be performed using nested sign expressions (? (I. Specifying this flag may affect the performance.

Pattern. COMMENTS

In this mode, blank spaces and comments are allowed. In this mode, the embedded comments starting with # and ending with # are ignored. The annotation mode can also use an embedded flag expression (? X.

Pattern. DOTALL

Enable dotall mode. In dotall mode, the expression matches any character including the row Terminator. By default, the expression does not match the row Terminator. The dotall mode also uses an embedded flag expression (? X. [S is a single-line notation, which is the same as that in Perl.]

Pattern. LITERAL

Enable mode literal analysis. After this flag is specified, the input string of the specified mode is treated as a literal character sequence. The metacharacters and escape characters in the input sequence do not have any special meaning. When CASE_INSENSITIVE and UNICODE_CASE are used with this flag, the matching will be affected. Other signs become redundant. Enable literal analysis without nested flag expressions.

Pattern. MULTILINE

Enable multiline mode. In multi-row mode, the expressions ^ and $ match the beginning of the row Terminator and the beginning of the row Terminator respectively. By default, the expression only matches the start and end of the entire input sequence. The multi-row mode can also use an embedded flag expression (? M.

Pattern. UNICODE_CASE

Enable foldable Unicode (Unicode-awarecase folding. After this flag is specified, it must be enabled using the CASE_INSENSITIVE flag. case-insensitive configuration will be completed in the Unicode Standard sense. By default, case-insensitive matching only matches characters in the US-ASCII character set. You can also use an embedded flag expression (? U. Specifying this flag may affect the performance.

Pattern. UNIX_LINES

Enable the Unix line mode. In this mode, the., ^, and $ actions only recognize the row Terminator "\ n. In Unix line mode, you can use an embedded flag expression (? D.

Matcher class

Method Abstract:

IntEnd()

Returns the Offset after the last matched character.

IntEnd(Intgroup)

Returns the Offset after the last character of the subsequence captured by the given group during the previous matching operation.

BooleanFind()

Try to find the next subsequence of the input sequence that matches the pattern.

BooleanFind(Intstart)

Reset this check box, and then try to find the next sub-sequence of the input sequence that matches the mode and starts from the specified index.

StringGroup()

Returns the input subsequence matched by the previous matching operation.

StringGroup(Intgroup)

Returns the input subsequence captured by the given group during the previous matching operation.

IntGroupCount()

Returns the number of capture groups in this matching mode.

BooleanMatches()

Try to match the entire region with the pattern.

PatternPattern()

Returns the Pattern Interpreted by this vertex.

StringReplaceAll(Stringreplacement)

The replacement mode matches each sub-sequence of the input sequence of the given replacement string.

StringReplaceFirst(Stringreplacement)

The replacement mode is the first sub-sequence of the input sequence that matches the given replacement string.

MatcherReset()

Reset the check box.

MatcherReset(CharSequenceinput)

Reset the matching with the new input sequence.

IntStart()

Returns the original matched initial index.

IntStart(Intgroup)

Returns the initial index of the subsequence captured by the given group during the previous matching operation.

MatchResultToMatchResult()

Returns the matching status of this MatchResult.

MatcherUsePattern(PatternnewPattern)

Change this Matcher to find the Pattern of the match.

Instance

String regex = "(A +) (B (C)"; String input = "ABC-AABC-AAAAABCD"; Pattern pattern = Pattern. compile (regex); Matcher matcher = pattern. matcher (input); int groupCount = matcher. groupCount (); // regex is divided into four groups by parentheses, according to the appearance position of the parentheses, respectively: (A +) (B (C), (A +), (B (C), (C) // match, or search by A + BC, but after each successful match, the searched string is divided into four sub-strings according to the above four rules and stored as while (matcher. find () {System. out. print ("matching result:"); for (inti = 0; I <groupCount; I ++) {System. out. print ("group" + I + ":" + matcher. group (I) + "\ t");} System. out. println ();}

Output:

Matching result: group0: ABC group1: ABC group2: A group3: BC

Matching result: group0: AABC group1: AABC group2: AA group3: BC

Matching result: group0: AAAAABC group1: AAAAABC group2: AAAAA group3: BC


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.