Java Regular Expression getting started

Source: Internet
Author: User
Tags expression engine
Introduction

The regular expression describes a string matching mode. It can be used to: (1) check whether a string contains a child string that complies with a specific rule and obtain the child string; (2) flexibly replace strings according to matching rules.

Regular Expressions are actually very simple to learn. A few abstract concepts are also easy to understand. Many people feel that regular expressions are complex. On the one hand, most documents are not explained in depth, and they do not pay attention to order in concept, which brings difficulties to readers. On the other hand, the documentation provided by various engines generally describes its unique functions. However, we do not need to understand these features first.

For every example in the article, you can click to enter the test page for testing. Start.

 

1. Regular Expression rules 1.1 common characters

Letters, numbers, Chinese characters, underscores, and punctuation marks not defined in the subsequent sections are all "common characters ". A common character in an expression. It matches the same character when it matches a string.

Example 1: expression "C", when matching the string "ABCDE", the matching result is: Successful; the matched content is: "C"; the matched position is: start at 2 and end at 3. (Note: The subscript starts from 0 or 1, and may vary depending on the current programming language)

Example 2: expression "BCD", when matching the string "ABCDE", the matching result is: Successful; the matched content is: "BCD"; the matched position is: start at 1 and end at 4.

1.2 simple escape characters

For characters that are inconvenient to write, add "/" to the front. We are familiar with these characters.

Expression

Matching

/R,/n

Returns the carriage return and line break.

/T

Tab

//

Represents "/" itself

There are other punctuation marks that are particularly useful in later chapters. After "/" is added, it indicates the symbol itself. For example, ^ and $ have special meanings. If you want to match the "^" and "$" characters in a string, you must write the expressions as "/^" and "/$ ".

Expression

Matching

/^

Match ^ symbol itself

/$

Match $ symbol itself

/.

Match the decimal point (.) itself

The matching methods for these escape characters are similar to those for "common characters. It also matches the same character.

Example 1: expression "/$ D". When the string "ABC $ de" is matched, the matching result is: Successful; the matched content is: "$ D "; the matched position is starting at 3 and ending at 5.

1.3 expressions that can match 'Multiple characters'

Some Expression Methods in the regular expression can match any of the 'Multiple characters' characters. For example, the expression "/D" can match any number. Although it can match any character, it can only be one, not multiple. This is like playing a poker card, the king of the size can replace any card, but can only replace one card.

Expression

Matching

/D

Any number ranging from 0 ~ Any one of the 9

/W

Any letter, number, or underline, that is, ~ Z, ~ Z, 0 ~ 9, _ Any

/S

Any of the spaces, tabs, and page breaks

.

The decimal point can match any character except the line break (/N ).

Example 1: expression "/D". When "ABC123" is matched, the matching result is successful. The matching content is: "12 "; the matched position is starting at 3 and ending at 5.

Example 2: expression ". /D ", when matching" aaa100 ", the matching result is: Successful; the matched content is:" aa1 "; the matched position is: Starting from 1, end at 4.

1.4 custom expressions that can match multiple characters

Square brackets [] can contain a series of characters that match any of them. If [^] is used to contain a series of characters, it can match any character other than the characters. In the same way, although it can match any of them, it can only be one, not multiple.

Expression

Matching

[AB5 @]

Match "A" or "B" or "5" or "@"

[^ ABC]

Match any character other than "a", "B", "C"

[F-K]

Matching "F "~ Any letter between "K"

[^ A-F0-3]

Match ""~ "F", "0 "~ Any character other than "3"

Example 1: When expression "[BCD] [BCD]" matches "ABC123", the matching result is: Successful; the Matching content is: "BC "; the matched position is: Starting from 1 and ending from 3.

Example 2: When expression "[^ ABC]" matches "ABC123", the matching result is: Successful; the Matching content is: "1"; the matching position is: start at 3 and end at 4.

1.5 modify the special symbol of the number of matches

The expressions mentioned in the previous chapter can only match one character or any one of multiple characters. If you use an expression with a special symbol that modifies the number of matches, you can repeat the match without having to repeat the expression.

The usage is as follows: "Frequency modifier" is placed after "modified expression. For example, "[BCD] [BCD]" can be written as "[BCD] {2 }".

Expression

Function

{N}

The expression is repeated n times. For example, "/W {2}" is equivalent to "/W"; "a {5}" is equivalent to "AAAAA"

{M, n}

The expression must be repeated at least m times and N times at most. For example, "BA {}" can match "ba" or "baa" or "baaa"

{M ,}

The expression must be repeated at least m times. For example, "/W/D {2,}" can match "A12", "_ 456", "m12344 "...

?

Matches expression 0 or 1 times, equivalent to {0, 1}, for example: "A [cd]? "Can match" A "," AC "," ad"

+

The expression appears at least once, equivalent to {1,}. For example, "A + B" can match "AB", "AAB", "aaab "...

*

The expression does not appear or appears any time, which is equivalent to {0,}. For example, "/^ * B" can match "B", "^ B "...

Example 1: expression "/d + /.? /D * "when matching" it costs $12.5 ", the matching result is: Successful; the matched content is:" 12.5 "; the matched position is: start at 10 and end at 14.

Example 2: When expression "go {} GLE" matches "ads by goooooogle", the matching result is: Successful; the Matching content is: "goooooogle "; the matched position is 7 and 17.

1.6 other special symbols representing abstract meanings

Some symbols represent the special meaning of abstraction in expressions:

Expression

Function

^

Matches the start point of the string and does not match any character.

$

Matches the end of the string, but does not match any character.

/B

Matches a word boundary, that is, the position between a word and a space. It does not match any character.

Further text descriptions are still abstract.

Example 1: When expression "^ AAA" matches "xxx aaa xxx", the matching result is: failed. Because "^" requires matching with the start of the string, "^ AAA" can be matched only when "AAA" is at the beginning of the string, for example: "AAA xxx ".

Example 2: When expression "AAA $" matches "xxx aaa xxx", the matching result is: failed. Because "$" must match the end of the string, "AAA $" can be matched only when "AAA" is at the end of the string, for example: "xxx AAA ".

Example 3: expression ". /B. "When matching" @ ABC ", the matching result is: Successful; the matched content is:" @ A "; the matched position is: 2, end at 4.
Further note: "/B" is similar to "^" and "$". It does not match any character, but requires it to be on the left and right sides of the position in the matching result, one side is the "/W" range, and the other side is not the "/W" range.

Example 4: when the expression "/bend/B" matches "weekend, endfor, end", the matching result is successful, and the matching content is: "end "; the matched position is: starting at 15 and ending at 18.

Some symbols can affect the relationship between subexpressions within the expression:

Expression

Function

|

The relationship between the expressions on both sides of the left and right matches the relationship between the expressions on the left and right.

()

(1). When the matching times are modified, the expressions in brackets can be modified as a whole.
(2) When the matching result is obtained, the matching content of the expression in brackets can be obtained separately.

Example 5: when the expression "Tom | Jack" matches the string "I'm Tom, he is Jack", the matching result is successful. The matched content is "Tom "; the matched position is starting at 4 and ending at 7. When the next match is performed, the match result is: Successful; the matched content is: "Jack"; the matched position starts at 15 and ends at 19.

Example 6: expression "(go/S *) +" matches "Let's go! ", The matching result is: Successful; the matched content is:" Go "; the matched position is: 6, and 14.

Example 7: expression "$ (/d + /.? /D *) "When matching" $10.9, ¥20.5 ", the matching result is successful. The matched content is:" ¥20.5 ". The matched position is: start at 6 and end at 10. Obtain the Matching content of the bracket range separately: "20.5 ".

2. Greedy and non-greedy in regular expressions with some advanced rules 2.1 matching times

When you use a special symbol to modify the number of matches, there are several representation methods that allow the same expression to match different times, such as: "{m, n}", "{M ,} ","? "," * "," + ", The number of matching times varies with the string to be matched. This type of expressions with an indefinite number of repeat matches as many as possible during the matching process. For example, for the text "dxxxdxxxd", the example is as follows:

Expression

Matching result

(D) (/W +)

"/W +" will match all characters after the first "D" "xxxdxxxd"

(D) (/W +) (d)

"/W +" will match all characters "xxxdxxx" between the first "D" and the last "D ". Although "/W +" can match the last "D", to make the entire expression match successfully, "/W +" can "let out" the last "D" that can be matched"

It can be seen that "/W +" always matches as many characters as possible to comply with its rules. Although the second example does not match the last "D", it is also used to make the entire expression match successfully. Similarly, the expressions with "*" and "{m, n}" Both match as much as possible, "? "When the expression can be matched but not matched, it is also" to match "as much as possible ". This matching principle is called the "greedy" pattern.

Non-Greedy mode:

Add "? "Number, the number of matching expressions can be as few as possible, so that the non-matching expressions can be matched, as far as possible" not matching ". This matching principle is called "non-greedy" mode, or "barely" mode. If there is a small match, the entire expression will fail to match. Similar to greedy mode, non-Greedy mode will be matched to a minimum to make the entire expression match successful. For example, for the text "dxxxdxxxd:

Expression

Matching result

(D) (/W + ?)

"/W +? "Match as few characters as possible after the first" D ". The result is:"/W +? "Only matches one" X"

(D) (/W + ?) (D)

To make the entire expression match successfully, "/W +? "Must match" XXX "to make the" D "behind the expression match, so that the entire expression matches successfully. Therefore, the result is: "/W +? "Matching" XXX"

For more information, see the following example:

Example 1: expression "<TD> (. *) </TD> "match with string" <TD> <p> AA </P> </TD> <p> BB </P> </TD> ", the matching result is successful; the matched content is "<TD> <p> AA </P> </TD> <p> BB </P> </TD>" the entire string, the "</TD>" in the expression matches the last "</TD>" in the string.

Example 2: In contrast, the expression "<TD> (.*?) </TD> "when matching the same string in Example 1, only" <TD> <p> AA </P> </TD> "is obtained. When matching the Next string again, you can get the second "<TD> <p> BB </P> </TD> ".

2.2 reverse reference/1,/2...

When an expression matches, the expression engine records the matching strings of the expressions contained in parentheses. When obtaining the matching result, strings matching the expressions in parentheses can be obtained separately. This is already shown in the previous example. In actual application scenarios, when a boundary is used for search and the obtained content does not contain the boundary, parentheses must be used to specify the desired range. For example, the preceding "<TD> (.*?) </TD> ".

In fact, "the string matched by the expression contained in parentheses" can be used not only after the matching is complete, but also during the matching process. The part behind the expression can be referenced in the preceding section "matching matched strings with subscripts in parentheses ". The reference method is to add a number. "/1" references the string matching 1st pairs of brackets, "/2" references the string matching 2nd pairs of brackets ...... Similarly, if a pair of parentheses contains another pair of parentheses, the outer brackets are sorted first. In other words, Which pair of left parentheses "(" in front, then this pair is sorted first.

Example:

Example 1: expression "('| ")(.*?) (/1) "When" 'hello', "world" "is matched, the matching result is successful, and the matching content is:" 'hello '". You can match "world" when matching the next one again "".

Example 2: When the expression "(/W)/1 {4,}" matches "aa bbbb abcdefg CCCCC 111121111 999999999", the matching result is successful; the matched content is "CCCCC ". If you match the next one, 999999999 is returned. This expression requires that the characters in the "/W" range be repeated at least five times. Note the difference with "/W {5.

Example 3: expression "<(/W +)/S * (/W + (= ('| ").*? /4 )? /S *) *> .*? <// 1> "when matching" <TD id = 'td1 'style = "bgcolor: White"> </TD> ", the matching result is successful. If "<TD>" and "</TD>" are not paired, the matching fails. If it is changed to another pair, the matching succeeds.

2.3 Pre-search, unmatched; reverse pre-search, unmatched

In the previous chapter, I mentioned several special symbols that represent abstract meanings: "^", "$", "/B ". They all have one thing in common, that is, they do not match any character, but they only append a condition to "the two ends of the string" or "the gap between characters. After understanding this concept, this section will continue to introduce another more flexible expression method that adds conditions to "two ends" or "gaps.

Forward pre-Search :"(? = XXXXX )","(?! XXXXX )"

Format :"(? = XXXXX) ", in the matched string, it attaches the condition to the" gap "or" two ": the right side of the gap, must be able to match the expression above XXXXX. Because it is only used as a condition attached to this gap, it does not affect the following expression to truly match the character after this gap. This is similar to "/B" and does not match any character. "/B" only determines the characters before and after the gap, and does not affect the true matching of the following expressions.

Example 1: expression "windows (? = Nt | XP) "only" Windows "in" Windows NT "matches" Windows 98, Windows NT, and Windows 2000 ", other words "Windows" are not matched.

Example 2: expression "(/W )((? =/1/1/1) (/1) + "when matching string" AAA ffffff 999999999 ", the first four of the six" F "can be matched, it can match the first 7 of the 9 "9. This expression can be interpreted as: Repeat more than four letters and numbers to match the remaining two digits. Of course, this expression can not be written in this way. The purpose of this expression is to be used for demonstration.

Format :"(?! XXXXX) ", the right side of the gap must not match the expression of XXXXX.

Example 3: expression "((?! /Bstop/B ).) + "when matching" fdjka ljfdl stop fjdsla FDJ ", the entire string is matched from the beginning to the position before" stop ". If the string does not contain" stop ", the entire string is matched.

Example 4: expression "Do (?! /W) "only" do "can be matched when the string" done, do, dog "is matched ". In this example, "do" is used later "(?! /W) "and"/B "have the same effect.

Reverse pre-Search :"(? <= XXXXX )","(? <! XXXXX )"

The concepts of these two formats are similar to those of forward pre-search. The condition for reverse pre-search is: "Left" of the gap ", the two formats must be able to match and must not match the specified expression, rather than determining the right side. Like "Forward pre-search", they are all additional conditions for the gap and do not match any characters.

Example 5: expression "(? <=/D {4})/d + (? =/D {4}) "When matching" 1234567890123456 ", it will match the middle eight digits except the first four digits and the last four digits. As JScript. Regexp does not support reverse pre-search, this example cannot be demonstrated. Many other engines Support reverse pre-search, such as Java 1.4 or above. util. regEx package ,. net System. text. regularexpressions namespace, as well as the deelx regular engine that is the most simple and easy to use on this site.

3. Other general rules

There are also some rules that are more common among the regular expression engines, which were not mentioned in the previous sections.

In the 3.1 expression, "/XXX" and "/uxxxx" can be used to represent a character ("X" indicates a hexadecimal number)

Form

Character range

/Xxx

The ID ranges from 0 ~ 255 characters in the range. For example, a space can be expressed as "/x20 ".

/Uxxxx

Any character can be expressed by "/u" plus the 4-digit hexadecimal number of its number, for example: "/u4e2d"

3.2 When the expressions "/s", "/D", "/W", "/B" indicate special meanings, the corresponding uppercase letters indicate the opposite meanings.

Expression

Matching

/S

Match all non-blank characters ("/s" can match each blank character)

/D

Match all non-numeric characters

/W

Match All characters other than letters, numbers, and underscores

/B

Match non-word boundary, that is, when both sides are in the "/W" range or the left and right sides are not in the "/W" Range

3.3 It has special meaning in the expression. You need to add "/" to match the character summary.

Character

Description

^

Matches the start position of the input string. To match the character "^", use "/^"

$

Matches the end position of the input string. To match the "$" character, use "/$"

()

Mark the start and end positions of a subexpression. To match parentheses, use "/(" and "/)"

[]

Use a custom expression that can match multiple characters. To match brackets, use "/[" and "/]"

{}

Symbol of the number of matches. To match braces, use "/{" and "/}"

.

Match any character except the line break (/N. To match the decimal point, use "/."

?

Modifies the number of matches to 0 or 1. To match "? "Character itself, please use "/? "

+

Modify the number of matches to at least one. To match the "+" character, use "/+"

*

Modifies the number of matches to 0 or any times. To match the "*" character, use "/*"

|

The relationship between the expressions on both sides. Match "|", please use "/|"

The subexpression in parentheses "()". If you want to keep the matching results for future use, you can use "(? : XXXXX) "Format

Example 1: expression "(? (/W)/1) + "match" A bbccdd EFG ", the result is" bbccdd ". Parentheses "(? :) "The matching result of the range is not recorded, so" (/W) "uses"/1 "for reference.

3.5 introduction to common expression attribute settings: ignorecase, singleline, multiline, and global

Expression attributes

Description

Ignorecase

By default, the letters in the expression are case-sensitive. Configured as ignorecase makes the matching case insensitive. Some expression engines extend the "case" concept to the case of Unicode.

Singleline

By default, the decimal point "." matches characters other than the line break (/N. Configured with singleline, the decimal point can match all characters including line breaks.

Multiline

By default, the expressions "^" and "$" only match the start and end positions of the string. For example:

① XXXXXXXXX ②/n
③ XXXXXXXXX ④

When multiline is configured, it can make "^" Match ①, match linefeed, and match ③ before the next line, so that "$" matches ④, or match before linefeed, the end position of a row.

Global

It mainly takes effect when expression is used for replacement. If it is set to global, all matches are replaced.

 

 

4. Other prompts

4.1 if you want to know that the advanced Regular Expression Engine supports complex regular expressions, refer to the deelx Regular Expression Engine instructions on this site.

4.2 if you want the expression to match the entire string instead of finding a part of the string, you can use "^" and "$" at the beginning and end of the expression, for example: "^/d + $" requires that the entire string contain only numbers.

4.3 If the Matching content is a complete word instead of a part of the word, use "/B" at the beginning and end of the expression, for example: use "/B (if | while | else | void | int ......) /B "to match the keywords in the program.

4.4 expressions do not match null strings. Otherwise, the matching is always successful, and nothing is matched. For example, to write a match "123" or "123. "," 123.5 ",". 5 "in these expressions, integers, decimal points, and decimal digits can be omitted, but do not write the expressions as:"/D */.? /D * ", because if there is nothing, this expression can also be matched successfully. Better Syntax: "/d + /.? /D * |/./d + ".

4.5 do not loop through an infinite number of submatches that can match null strings. If each part of the subexpression in the parentheses can match 0 times, and the parentheses can match infinitely, the situation may be more serious than the previous one, an endless loop may occur during the matching process. Although some regular expression engines have already tried to avoid this situation, such as. net regular expressions, we should try to avoid this situation. If we encounter an endless loop when writing an expression, we can start with this to find out if it is the reason described in this article.

4.6 properly select greedy mode and non-Greedy mode. For more information, see the topic.

4.7 or "|" on both sides, it is best to match only one side of a character, so that the expressions on both sides are not different because of the switching position.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.