JavaSE-31 Java Regular Expressions

Source: Internet
Author: User
Tags valid email address

Overview

A regular expression is a powerful string processing tool that enables you to find, extract, split, and replace strings.

Several methods of the string class need to rely on support for regular expressions.

Method Method description
Boolean matches (String regex) Determines whether the string matches the specified regular expression
String ReplaceAll (string regex,string replacement) Replaces all substrings in the string that match the regex into replacement
String Replacefirst (string regex,string replacement) Replace the substring in the string with the first one that matches the regex to replacement
String[] Split (String regex) Splits the string into multiple substrings, using the regex as a delimiter

Java provides the pattern and Matcher classes to support regular expressions.

To create a regular expression

Regular expression Syntax composition

Character

Description

\

Marks the next character as a special character, text, reverse reference, or octal escape. For example, "n" matches the character "n". "\ n" matches the line break. The sequence "\\\\" matches "\ \", "\ \" ("Match" (".

^

Matches the starting position of the input string. If the Multiline property of the RegExp object is set, ^ will also match the position after "\ n" or "\ r".

$

Matches the position of the end of the input string. If you set the Multiline property of the RegExp object, the $ will also match the position before \ n or \ r.

*

Matches the preceding character or sub-expression 0 or more times. For example, zo* matches "z" and "Zoo". * Equivalent to {0,}.

+

Matches the preceding character or sub-expression one or more times. For example, "zo+" matches "Zo" and "Zoo", but does not match "Z". + equivalent to {1,}.

?

Matches the preceding character or sub-expression 0 or one time. For example, "Do (es)?" Match "Do" in "do" or "does".? Equivalent to {0,1}.

{n}

N is a non-negative integer. Matches exactly N times. For example, "o{2}" does not match "O" in "Bob", but matches two "o" in "food".

{n,}

N is a non-negative integer. Match at least N times. For example, "o{2,}" does not match "O" in "Bob", but matches all o in "Foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".

{n,m}

m and n are non-negative integers, where n <= M. Matches at least N times, up to m times. For example, "o{1,3}" matches the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note: You cannot insert a space between a comma and a number.

?

When this character follows any other qualifier (*, + 、?、 {n}, {n,}, {n,m}), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible string searched, while the default "greedy" pattern matches the string that is searched for as long as possible. For example, in the string "Oooo", "o+?" Only a single "O" is matched, and "o+" matches All "O".

.

Matches any single character except for "\ r \ n". To match any character that includes "\ r \ n", use a pattern such as "[\s\s]".

(pattern)

Matches the pattern and captures the matched sub-expression. You can use the $0...$9 property to retrieve a captured match from the result "match" collection. To match the bracket character (), use "\ (" or "\)".

(?:pattern)

A subexpression that matches the pattern but does not capture the match, that is, it is a non-capturing match and does not store a match for later use. This is useful for combining pattern parts with the "or" character (|). For example, ' Industr (?: y|ies) is a more economical expression than ' industry|industries '.

(? =pattern)

A subexpression that performs a forward lookahead search that matches the string at the starting point of the string that matches the pattern . It is a non-capture match, that is, a match that cannot be captured for later use. For example, ' Windows (? =95|98| nt|2000) ' Matches Windows 2000 ' in Windows, but does not match Windows 3.1 in Windows. Lookahead does not occupy characters, that is, when a match occurs, the next matching search immediately follows the previous match, rather than the word specifier that makes up the lookahead.

(?! pattern)

A subexpression that performs a reverse lookahead search that matches a search string that is not at the starting point of a string that matches the pattern . It is a non-capture match, that is, a match that cannot be captured for later use. For example, ' Windows (?! 95|98| nt|2000) ' matches Windows 3.1 ' in Windows, but does not match Windows 2000 in Windows. Lookahead does not occupy characters, that is, when a match occurs, the next matching search immediately follows the previous match, rather than the word specifier that makes up the lookahead.

x| y

Match x or y. For example, ' Z|food ' matches ' z ' or ' food '. ' (z|f) Ood ' matches "Zood" or "food".

[XYZ]

Character. Matches any one of the characters contained. For example, "[ABC]" matches "a" in "plain".

[^XYZ]

The reverse character set. Matches any characters that are not contained. For example, "[^abc]" matches "plain" in "P", "L", "I", "N".

[A-Z]

The character range. Matches any character within the specified range. For example, "[A-z]" matches any lowercase letter in the range "a" to "Z".

[^ A-Z]

The inverse range character. Matches any character that is not in the specified range. For example, "[^a-z]" matches any character that is not in the range "a" to "Z".

\b

Matches a word boundary, which is the position between the word and the space. For example, "er\b" matches "er" in "never", but does not match "er" in "verb".

\b

Non-word boundary match. "er\b" matches "er" in "verb", but does not match "er" in "Never".

\cx

Matches the control character indicated by x . For example, \cm matches a control-m or carriage return character. The value of x must be between A-Z or a-Z. If this is not the case, then the C is assumed to be the "C" character itself.

\d

numeric character matching. equivalent to [0-9].

\d

Non-numeric character matching. equivalent to [^0-9].

\f

The page break matches. Equivalent to \x0c and \CL.

\ n

Line break matches. Equivalent to \x0a and \CJ.

\ r

Matches a carriage return character. Equivalent to \x0d and \cm.

\s

Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].

\s

Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v].

\ t

TAB matches. Equivalent to \x09 and \ci.

\v

Vertical tab matches. Equivalent to \x0b and \ck.

\w

Matches any character, including underscores. Equivalent to "[a-za-z0-9_]".

\w

Matches any non-word character. Equivalent to "[^a-za-z0-9_]".

\xN

Match N, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". Allows the use of ASCII code in regular expressions.

\Num

Matches num, where num is a positive integer. To capture a matching reverse reference. For example, "(.) \1 "matches two consecutive identical characters.

\N

Identifies an octal escape code or a reverse reference. If there are at least N captured subexpression in front of \n , then n is a reverse reference. Otherwise, if n is an octal number (0-7), then n is the octal escape code.

\nm

Identifies an octal escape code or a reverse reference. If there is at least a nm capture subexpression in front of the \nm , then nm is a reverse reference. If there are at least N captures in front of the \nm , then n is a reverse reference followed by the character M. If neither of the preceding conditions exists, the \nm matches the octal value nm, where n and m are octal digits (0-7).

\NML

When N is an octal number (0-3),m and l are octal numbers (0-7), the octal escape code NMLis matched.

\uN

Matches n, where n is a Unicode character represented by a four-bit hexadecimal number. For example, \u00a9 matches the copyright symbol (?).

Quantity identifier

If you need to match a postal code, in the form of: 000-000, using regular expressions: \\d\\d\\d-\\d\\d\\d, this is cumbersome to use, you can use the regular expression of the number of identifiers for shorthand, written in the following format:

\\D{3}-\\D{3}

The number identifier in a regular expression supports three modes:

    1. Greediness (greedy mode): The number in a regular expression indicates that greedy mode is used by default. The greedy pattern's expression will always match until the match is not matched.
    2. Reluctant (grudging mode): With a question mark "? The suffix indicates that he will only match the fewest characters. Also known as the minimum matching pattern.
    3. Possessive (possessive mode): suffix is indicated by the plus sign "+" and is exactly matched. Less used, not all programming languages are supported.

Three mode number identifiers:

Greedy mode Barely mode Occupancy mode Expression description
X? X?? x?+ X-expression appears 0 or 1 times
x* X*? x*+ An X expression appears 0 or more times
x+ X+? X + + An x expression appears 1 or more times
X{n} X{n}? x{n}+ X indicates that there is n times
X{n,} X{n,}? x{n,}+ X expression appears at least n times
X{N,M} X{n,m}? x{n,m}+ X expression appears at least n times and appears at most M-times

    • Greedy Pattern Example

In greediness mode, the match is as large as possible until the entire content is matched, and when the match is found to be unsuccessful, the fallback is reduced until the match is successful.

Example code:

String test= "List<a href= ' http://www.baidu.com ' > Baidu </a>list"; String reg= "<.+>";//\\.+ means that any character appears one or more times System.out.println (Test.replaceall (Reg, "* * *"));

Description: We expect the match content to be <a href= ' http://www.baidu.com ', but in greedy mode he will match a href= ' http://www.baidu.com ' > Baidu </a , the result output:

List***list

    • Example of a reluctant pattern

In reluctant mode, attempts to match a larger range of content are no longer attempted as long as the match succeeds.

Example code:

String test= "List<a href= ' http://www.baidu.com ' > Baidu </a>list"; String reg= "<.+?>";//\\.+ means that any character appears one or more times, that is, a reluctant way System.out.println (Test.replaceall (Reg, "* * *"));

Output Result:

list*** Baidu ***list

  

    • Occupancy Mode Example

The possessive pattern has a certain similarity to greediness, which is to match the maximum range of content as much as possible until the content ends, but unlike greediness, the exact match no longer rolls back the attempt to match a smaller range.

Example code:

String test= "List<a href= ' http://www.baidu.com ' > Baidu </a>list"; String reg= "<.++>";//\\.+ means that any character appears one or more times, + means System.out.println of possession (Test.replaceall (Reg, "* * *"));

Output Result:

List<a href= ' http://www.baidu.com ' > Baidu </a>list

Common Regular Expressions
Rules Regular expression syntax
One or more kanji ^[\u0391-\uffe5]+$
Zip code ^[1-9]\d{5}$
QQ number ^[1-9]\d{4,11}$
Mailbox ^[a-za-z_]{1,}[0-9]{0,}@ (([a-za-z0-9]-*) {1,}\.) {1,3} [a-za-z\-] {1,}$
User name (beginning with letter + number/letter/underscore) ^[a-za-z][a-za-z1-9_-]+$
Mobile phone number ^1[3|4|5|8][0-9]\d{8}$
Url ^ ((HTTP|HTTPS)://)? ([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)? $
18-digit ID number ^ (\d{6}) (18|19|20)? (\d{2}) ([01]\d) ([0123]\d) (\d{3}) (\d| X|X)? $

Use of regular expressions

The pattern and Matcher classes are used to process regular expressions. Processing process:

    1. The regular expression string is compiled into a pattern object.
    2. Creates a corresponding Matcher object using the Pattern object.
    3. The States involved in performing a match remain in the Matcher object, and multiple Matcher objects can share the same pattern object.

The typical application code is as follows:

Define the Pattern object: Compile the string into the pattern object pattern p=pattern.compile ("a*b"),//Use the Pattern object to create the Matcher object Matcher m=p.matcher (" Aaaaab ");//Get match result Boolean result=m.matches ();//output match result System.out.println (result);//Output True

  

Pattern objects can be reused multiple times. If a regular expression needs to be used only once, you can use the static method matches () of the pattern class.

The sample code is as follows:

Boolean b=pattern.matches ("A*b", "Aaaab"); System.out.println (b);//Output True

  

    • Pattern class

The Pattern object represents a regular expression compiled in-memory representation, and is an immutable class of thread-safe classes.

    • Matcher class

The Matcher class provides the following common methods:

Method Description
Find () Returns whether the target string contains a substring that matches the pattern.
Group () Returns the last substring that was matched to pattern.
Start () Returns the start position of the last substring that matches the pattern in the target string.
End () Returns the end position of the last substring that matches the pattern in the target string +1.
Lookingat () Returns whether the preceding part of the target string matches the pattern.
Matches () Returns whether the entire target string matches the pattern.
Reset () Applies an existing Matcher object to a new sequence of characters.

Example 1: Use the Find () method and the group () method to remove a specific substring from the target string. (web crawler extract information Model)

Define the target string content= "My phone number is 13800001234, buy Mac computer phone number 13112340001, save single Dog 172456789078";//define a regular expression string to extract the phone number Reg= "((13\\d) | ( 17\\d)) \\d{8} ";//define Pattern Object pattern P=pattern.compile (reg);//Get Matcher object Matcher m=p.matcher (content);// Output phone number while (M.find ()) {System.out.println (M.group ());}

  

Example 2: Use the Find () method and the Start () method, the end () method to determine the location of a specific substring in the target string.

Define the target string, string content = "My phone number is 13800001234, buy Mac computer phone number 13112340001, save single Dog 172456789078";//define a regular expression string to extract the phone number Reg = "((13\\d) | ( 17\\d)) \\d{8} ";//define pattern object Pattern P = pattern.compile (reg);//Get Matcher object Matcher m = p.matcher (content);// The output phone number while (M.find ()) {System.out.println (M.group () + ", the starting position of the substring:" + m.start () + ", the end position of the substring:" +m.end ());}

Output Result:

13800001234, the substring starts at: 7, the substring ends at: 1813112340001, the substring begins at: 31, the substring ends at: 4217245678907, where the substring starts: 48, End position of the substring: 59

  

Example 3: Use of other methods.

In addition, the matches () method requires an exact match between the entire string and pattern to return true.

The Lookingat () method only requires that the string begin with the pattern to return true.

The Reset () method applies an existing Matcher object to a new string sequence.

Example code:

Define a mailbox string array string[] mails= {"[email protected]", "[email protected]", "[email protected]", "abc.com"};//mailbox Regular expression string Mailreg= "\\w{3,20}@\\w+\\. (com|org|cn|net|gov) ";//define the Pattern object pattern mp=pattern.compile (Mailreg);//Declare Matcher object Matcher m=null;for (String Mail:mails) {if (m==null) {//If the Matcher object is empty, create m=mp.matcher (mail);} else {m.reset (mail);//If the Matcher object is not empty, apply Matcher object}string result=mail+ (m.matches ()? " Is ":" not ") +" a valid email address! "; SYSTEM.OUT.PRINTLN (result);}

Output Result:

[Email protected] is a valid email address! [Email protected] is a valid email address! [Email protected] is a valid email address! ABC.com is not a valid email address!

  

JavaSE-31 Java Regular Expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.