Regular expressions (in the JAVA language, for example) _

Regular expressions (in the JAVA language, for example) __java

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The concept of regular expressions:
A string used to match and process text. People often use patterns to represent actual regular expressions. Regular expressions are created by the regular expression language. Regular expression language is a "mini" language built into other languages or software products, but it is not a complete programming language. In different programming languages or applications, regular expressions have different syntax and functionality.

The purpose of a regular expression:
(1) Search (match): Searches a string for one or more substrings that match the regular expression. Search is divided into matching and substring search. A match is a judgment of the entire string provided by the user to see if it matches a regular expression, such as an e-mail address match. A substring search is the universal meaning of "search", which refers to finding all the substrings that match a regular expression, such as finding all the words in a paragraph of English text.
(2) Replace (match and replace): Find and replace substrings that match a regular expression in a string, such as replacing all the Cos in a string with sin.

Match a single character
(1) Matching plain text
Regular expressions can contain plain text (or even plain text).
There may be multiple matching results when matching plain text, and most of the regular expression implementations provide a mechanism for finding all of the matching results (usually returning an array).
Regular expressions in Java are case-sensitive.

Text:
Hello,my name is Ben.
Regular expression:
Ben
Results:
Hello,my name is Ben.

(2) Match any character
. Characters can match any single character (character, letter, number,. character itself, which may or may not match the line terminator).

(3) matching special characters
The special characters here refer to the special characters in the regular expression and note the distinction between the escape character story in the Java language.
A special character in a regular expression refers to the word regular, which has a special meaning in it, and is not a simple plain text.
If you want to match the special character itself, you need to add \ At the beginning of the special character. However, in the Java in the end to add how many \, this problem is more headaches, specific resolution for details see [about Java Regular expression in the] (http://zhidao.baidu.com/question/1637942975153285780.html).

Special characters that need to be escaped in a regular expression:

$    matches the end position of the input string.
()  marks the start and end position of a subexpression.
*    match the preceding subexpression 0 or more times.
+    matches the preceding subexpression one or more times.
.    Matches any single character except the newline character \ n.
[]  marks the beginning of a bracket expression.
?    Matches the preceding subexpression 0 or more times, or indicates a non-greedy qualifier.
\    escapes the next character tag.
^    matches the start position of the input string.
{}  marks the beginning of a qualifier expression.
|    Indicates a choice between two items.

Java language relay literal characters:

1. Eight binary escape sequence:
\0n     with octal value 0 characters n (0 <= n <= 7) \0nn with    octal value 0 character nn (0 <= n <= 7) 
\0mnn   Belt Characters with eight binary value 0 mnn (0 <= m <= 3, 0 <= n <= 7) 
2.16 in-process escape sequence:
\xhh    characters with hexadecimal value of 0x hh 
\uhhhh  with hexadecimal A value of 0x characters hhhh 
3. Special characters:
\ "   double quotes 
\ '   single quotes \   backslash 
\a   alarm (Bell) character
\e   escape character
\cx corresponds to X's control 
4. Control character: 
\ t   tab (' \u0009 ') 
\ n   line break (' \u000a ') 
\ r   return character (' \u000d ') 
\f

Match a set of characters
(1) match one of multiple characters
Use metacharacters [] to define a set of characters, all of which are part of the collection, and the character set matches the result of a character that can match any one of the characters in the set.
Text:
The phrase "regular expression" is often abbreviates as regex or regex.
Regular expression:
[Rr]eg[ee]x
Results:
The phrase "regular expression" is often abbreviates as regex or regex.

(2) using character set and interval
For the definition of simplified, regular expressions provide a special metacharacters--the character range can be defined by-(hyphen).
The first and the trailing characters of the character interval can be any character in the ASCII character table, but in practical applications, it is often a range of numeric characters and alphabetic characters.
When defining a character range, it is important to avoid having the trailing character of the character range less than its first character.
There can be multiple character intervals in the same character set.
-The character is a special meta character that can only be used between [] as a meta character. In places other than the character set,-just an ordinary character that can only match the-itself, so in regular expressions,-characters do not need to be shifted.

(3) Do not match
Use the meta-character ^ to match a character interval. Indicates the character in the character set, and all other characters can be matched.
The effect of the ^ will be used for all characters and character intervals in the given character set, rather than for the character or range of characters immediately following the ^ character.

Using meta characters
Meta characters can be roughly divided into two types, one for matching text (for example,.) and the other for regular expression syntax (e.g. [and]).
Differences between some metacharacters:. And [are metacharacters, but only if you do not escape it; t and N are metacharacters, but only if you escape it.
(1) to escape special characters
Any unary character can be escaped by prefixing it with a \ character.

(2) matching white space characters
\ r \ n is the text line end tag used by the Windows operating system;
\ n is the text line end tag used by UNIX and Linux operating systems.
Whitespace characters:

[\b]  Rollback (and delete) a character (backspace key) \f a newline character
\ n    newline character
\ r    carriage return
\ t    tab (TAB) \v a    vertical tab (\ X0B)

(3) match a particular character category

\d any   numeric character (equivalent to [0-9])
\d   any non-numeric character (equivalent to [^0-9])
\w   any number, letter, underscore character (equivalent to [a-za-z0-9_])
\w Any one that is   not (number, letter, underscore character) (equivalent to [^a-za-z0-9_])
\s   any white space character (equivalent to [\f\n\r\t\v])
\s   any non-white-space character (equivalent to [^\f\n \r\t\v])

(4) using POSIX character classes in Java

\p{lower}    Lowercase alphabetic characters: [A-z] 
\p{upper}    uppercase characters: [A-z] 
\p{ascii}    all ascii:[\x00-\x7f] 
\p{ Alpha}    Alphabetic characters: [\p{lower}\p{upper}] 
\p{digit}    decimal digits: [0-9] 
\p{alnum}    alphanumeric characters: [\p{alpha}\p{ Digit}] 
\p{punct}    punctuation:! " #$%& ' () *+,-./:;<=>?@[\]^_ ' {|} ~ 
\p{graph}    visible characters: [\p{alnum}\p{punct}] 
\p{print    can print characters: [\p{graph}\x20] 
\p{blank}    Space or tab: [\ t] 
\p{cntrl}    control character: [\x00-\x1f\x7f] 
\p{xdigit}   hexadecimal number: [0-9a-fa-f] 
\p{space}

Duplicate match
(1) How many matches are there
When you add a + suffix to a character set, you must put + on the outside of the character set. For example: [0-9]+ is correct, [0-9+] is not. [0-9+] is actually a legitimate regular expression, but it does not match one or more numbers, it defines a character set composed of numbers 0 through 9 and +. and by analogy.
+ is a meta character. If you need to match the + itself, you must use its escape sequence +. and by analogy.
When used in character set, the metacharacters such as. and + will be interpreted as ordinary characters without being escaped, but there is no harm in escaping. That is [.] The use of the effect and [.] is the same.
1. Match one or more characters +
2. Match 0 or more characters *
3. Match 0 or one character.

(2) matching number of repetitions
1. Set an exact value for the number of repeat matches, such as [0-9]{3}.
2. Set an interval for the number of repeat matches, such as [0-9]{3,5}, note that the comma is not a connector.
3. Match "At least repeat how many times", such as [0-9]{3,}, note that the comma is not missed.

(3) Prevent excessive match
Text:
This are not available to customers living in ak and hi
Regular expression:
<[Bb]>.*</[bb]>
Results:
This are not available to customers living in ak and hi< /b>
Explain:
Because * and + are so-called "greedy" meta characters, their behavior patterns are more than enough to match. They are matched as far as possible from the beginning of a text to the end of the text, rather than from the beginning of the text until the first match is encountered.
The opposite of the greedy pattern is the lazy pattern, as long as you add one to the "greedy" character. Suffix can be.

"Greedy" meta-character       "lazy" meta character
     *                  *?
     +                  +?
     {N,}               {N,}?

Text:
This are not available to customers living in ak and hi
Regular expression:
<[Bb]>.*?</[bb]>
Results:
This are not available to customers living in ak and hi< /b>

Position matching
If you only need to match a specific position in a paragraph of text, the problem can be solved by position matching, that is, by using a boundary qualifier. Some special meta characters are used in regular expressions to indicate where the matching operation is to occur (which can be interpreted as a boundary). Boundaries are divided into word boundaries, non word boundaries, line boundaries, and string boundaries.

The beginning of the ^   character ((? m) is qualified for the beginning of the line)   the end of the $ string (in (? m) is qualified for the end of the line) 
\b The  word boundary (including the beginning and end)
\b  Non-word boundaries (including beginning and end)
\a  The beginning of a string 
\z  the end of a string

(1) Word boundaries
The word boundary specified by the qualifier \b (b refers to the boundary boundary), \b used to match the beginning or end of a word.
\b Matches a position where a character that can be used to form a word (Letters, numbers, and underscores) and a character that cannot be used to form a word.
\b Matches only one location and does not match any characters. The string to match with the \b Cat \b is a string (c, a, t) of length 3, not a string of length 5.
If you want to match a complete word, you must add the \b Qualifier before and after the text you want to match.

Text:
The cat scattered his food all over the room
Regular expression:
\bcat\b
Results:
The cat scattered his food all over the room

(2) Non-word boundaries
If you do not match a word boundary using \b

Text:
Please enter the nine@@ zzfcthotfixz ID as it appears on your color @@@ coded Pass-key
Regular expression:
\b@@@\b
Results:
Please enter the nine@@ zzfcthotfixz ID as it appears on your color @@@ coded pass-key

(3) string bounds
The metacharacters used to define the string bounds are: one to define the beginning of the string and the other to define the $ at the end of the string.
^ is one of several meta characters that have multiple uses. It can only play the "negation" role when it appears in a character set (placed between [and]) and tightens the path behind the opening parenthesis [. If it is outside of a character set and at the beginning of a pattern, ^ matches the beginning of the string.
The metacharacters used to define the string boundary is a \a that defines the beginning of the string, and the other is the \z that defines the end of the string.

The following example simply demonstrates the use of the ^ meta character and does not demonstrate the use of meta characters. The use of metacharacters. The use of metacharacters is similar.

Text:
<?xml  version= "1.0"  encoding= "UTF-8"?>
<wsdl:definitions  targetnamespace=http ://tips.cf>
Regular expression: 
^\s*<\?xml.*?\?>
Result: The
first line of text will be matched
to explain:
here ^\s* Will match the beginning of a string and the subsequent 0 or more white-space characters, noting that lazy metacharacters are also used.

(4) Line boundary
Regular expressions support the practice of using some special metacharacters to change the behavior of other metacharacters, which enables the (? m) notation of the branch matching mode (multiline mode) to be a sequence of metacharacters that can change the behavior of other meta characters.
By default, regular expressions ^ and ignore line terminators only match the beginning and end of the entire input sequence (that is, the entire string). Multiple-line mode can be enabled through an embedded flag expression (? m). Branch Horse

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More