Learn Java Regular Expressions

Source: Internet
Author: User
Tags character classes
The regular expression is used to specify the string mode. You can use a regular expression to locate a string that matches a certain pattern. For example, the following routine locates all hyperlinks in an HTML file by searching for the string mode <a href = "...">.

Of course, to specify a mode, using the mark... is not accurate enough. You need to specify exactly what kind of character arrangement is a legal match. When describing a pattern, you need to use a special syntax.

Here is a simple example. Regular Expression
[Jj] ava. +
Match any of the following strings:

  • The first letter is J or j.
  • The next three letters are ava.
  • The remainder of the string is composed of one or more arbitrary characters.

For example, the string "javaness" matches this special regular expression, but the string "Core Java" does not match.

As you can see, you need to understand some syntax to understand the meaning of regular expressions. Fortunately, for most purposes, it is enough to use a small amount of simple constructs.

  • A character class is a set of optional user characters. It is encapsulated with '[', for example, [Jj], [0-9]. [A-Za-z] or [^ 0-9]. Here-indicates the range (Unicode falls between all characters), ^ indicates complement (all characters outside the specified character ).
  • There are many predefined character classes, such as \ d (number) or \ p {SC} (Unicode currency symbol), as shown in Tables 12-8 and 12-9.
  • Most characters match themselves, such as the ava character in the above example.
  • Symbol. Except for line terminators, which depends on flag settings ))
  • \ Is used as an escape character. For example, \. Matches a period and \ matches a backslash.
  • ^ And $ match the line header and the end of the line respectively.
  • If both X and Y are regular expressions, XY indicates "matching X is followed by matching Y ". X | Y indicates "any matching of X or Y"
  • The quantifier can be used in the expression. X + indicates that X is repeated once or multiple times. X * indicates that X is repeated 0 times or multiple times. X? Indicates that X is repeated 0 times or 1 time.
  • By default, a quantizer always matches the longest possible occurrence of successful overall matching. Can I add a suffix? (Reluctant or stingy match to match the minimum number of duplicates), and + (possessive or greedy match to match the maximum number of duplicates even if the overall match fails) to change this property.

For example, the string cab matches [a-z] * AB, but does not match [a-z] * + AB. In the first case, [a-z] * only matches the character c, so the character AB exactly matches the rest of the pattern. However, the greedy version [a-z] * + matches the character cab, and the remaining part of the pattern AB fails to match (so that the overall match fails ).

  • You can use groups to define subexpressions. Encapsulate groups in (), such as ([+-]?) ([0-9] + ). Then you can get the pattern matcher to return a match for each group, or use \ n to reference the group (refer back to a group with \ n ), where n is the group number (starting with \ 1)

Here is a slightly complex but useful regular expression-it is used to describe decimal and hexadecimal integers.
[+-]? [0-9] + | 0 [Xx] [0-9A-Fa-f] +

Unfortunately, its syntax is not fully standardized between various programs and libraries using regular expressions. Consensus has been reached on the basic structure, but there are many maddening differences in details ). The regular expression class in Java uses a syntax similar to the Perl language, but not the same. Table 12-8 shows the construction of all regular expressions in the Java syntax. For more information about regular expressions, see the API documentation of the Pattern class or Jeffrey E. f. fried L's book Mastering Regular Expressions (O 'Reilly and Associates, 1997) (I just went to the second bookstore and checked it. The Southeast University Press has introduced its second version, photocopy)

Table 12-8 regular expression syntax

Syntax explanation
Character
C character c
\ Unnnn, \ xnn, \ 0n, \ 0nn, \ 0nnn code units with hexadecimal or octal values \ 0n octal 0n represents the characters (0 <=n <= 7) \ 0nn octal 0nn represents the characters (0 <= n <= 7) \ 0mnn octal 0mnn represents the characters (0 <= m <= 3, 0 <= n <= 7) \ xnn hexadecimal 0xnn represents the character \ uhhhh hexadecimal 0xhhhh represents the character

\ T, \ n, \ r, \ f, \ a, \ e control characters, which are tabs, line breaks, carriage returns, page breaks, alarms, and escape characters in turn
\ Cc control character c

Character class
[C1C2...] C1, C2 ...... Any character in. Ci can be a character, a character range (C1-C2), or a character class.
[^...] Character Set
[... &...] Intersection of two character classes
Predefined character classes
. Any character except the line terminator (if the DOTALL flag is set, it indicates any character)
\ D number [0-9]
\ D non-numeric [^ 0-9]
\ S blank character [\ t \ n \ r \ f \ x0B]
\ S non-blank characters
\ W character [a-zA-Z0-9 _]
\ W non-word characters
\ P {name} is a specified character class. See table 12-9.
\ P {name} specifies the character set of the character class
Boundary match
^ $ Start and end of the input (in multiline mode, it is the beginning and end of the line)
\ B word boundary
\ B Non-word boundary
\
\ Z input end
\ Z end of the input except the last line terminator
\ G last matched end
Quantifiers
X? Optional X (that is, X may or may not appear)
X * X, which can be repeated 0 or multiple times
X + X, which can be repeated once or multiple times
X {n} X {n,} X {n, m} X repeats n times, at least n times, and n to m times
Keyword suffix
? Set the default (Greedy) match to reluctant match
+ Set the default (Greedy) match to possessive match.
Set Operations
Y is followed by xy x.
Matching of X | y x or Y
Group
(X) Match X and capture it in an automatically counting group.
\ N matches the nth Group
Escape
\ C character c (must not be a letter)
\ Q... \ E referencing words by word...
(?...) Special Structure, look at the Pattern class API
The simplest use of a regular expression is to test whether a special string matches the regular expression. Here is a test program written in Java. First, construct a Pattern object from a string that represents a regular expression. Then obtain a Matcher object from this mode and call its matches () method: Pattern pattern = Pattern. compile (patternString); Matcher matcher = pattern. matcher (input); if (matcher. matches ())...

Table 12.9 Predefined Character Class name (Predefined Character Class Names)

Lower case ASII character [a-z]
Upper uppercase ASCII character [A-Z]
Alpha ASCII letter [A-Za-z]
Digit ASCII number [0-9]
Alnum ASCII letter or number [A-Za-z0-9]
Xdigit hexadecimal number [0-9A-Fa-f]
Print or Graph printable ASCII characters [\ x21-\ x7E]
Punct non-letter or digit ASCII [\ p {Print} & \ P {Alnum}]
All ASCII characters [\ x00-\ x7F]
Cntrl ASCII control character [\ x00-\ x1F]
Blank space character or tab [\ t]
Blank Space character [\ t \ n \ r \ f \ 0x0B]
JavaLowerCase depends on lower-case characters of Character. isLowerCase ()
JavaUpperCase depends on the uppercase characters of Character. isUpperCase ()
JavaWhitespace depends on the blank Character of Character. isWhitespace ()
JavaMirrored depends on the Mirrored (?) of Character. isMirrored (?)
InBlock the Block here is the Block name of unicode characters, separated by spaces, such as BasicLatin or Mongolian. Block Name List reference http://www.unicode.org
Category OR InCategory the Category here is the type name of the Unicode character, such as L (letter) or SC (currency symbol ). Type List reference http://www.unicode.org
Matcher input can be any class object that implements the CharSequence interface, such as String, StringBuilder or CharBuffer.

When compiling mode, you can set one or more labels, such as Pattern pattern = Pattern. compile (patternString, Pattern. CASE_INSENSITIVE + Pattern. UNICODE_CASE );

The following six logos are supported: CASE_INSENSITIVE: the matching character is case-insensitive. By default, this flag only takes us ascii characters into account. UNICODE_CASE: when combined with CASE_INSENSITIVE, use Unicode letters to match MULTILINE: ^ and $ to match the beginning and end of a row, instead of the entire input UNIX_LINES: when matching ^ and $ in multiline mode, only '\ n' is considered as the line terminator DOTALL: When this flag is used ,. symbol match all characters including the line terminator CANON_EQ: Considering the standardized equivalence of Unicode characters

If a regular expression contains a group, the Matcher object can reveal the group boundary. Method int start (int groupIndex) int end (int groupIndex)

Returns the start index and end index (past-the-end index) of a special group ).
By calling String group (int groupIndex), you can simply obtain the matched String.
0th groups represent all groups. The index of the first actual group is 1. Call groupCount to obtain the total number of groups.

Use parentheses to arrange nested groups. For example, a given mode (1? [0-9]) ([0-5] [0-9]) [ap] m, and enter am. The Matcher object reports the following groups.
(1? [0-9]) ([0-5] [0-9]) [ap] m
And enter
11: 59 am
Matcher reports the following groups

Group Index start and end string
0 0 7 11; 59am
1 0 5 11: 59
2 0 2 11
3 3 5 59
 
For example, 12-9 indicates entering a pattern and a string to be matched. It will output whether the input string matches the pattern. If the input matches the mode that contains the group, the program will use parentheses to print the group boundary, for example (11) :( 59) am

Example 12-9. RegexTest. java
1. import java. util .*;
2. import java. util. regex .*;
3.
4 ./**
5. This program tests regular expression matching.
6. Enter a pattern and strings to match, or hit Cancel
7. to exit. If the pattern contains groups, the group
8. boundaries are displayed in the match.
9 .*/
10. public class RegExTest
11 .{
12. public static void main (String [] args)
13 .{
14. Sums in = new sums (System. in );
15. System. out. println ("Enter pattern :");
16. String patternString = in. nextLine ();
17.
18. Pattern pattern = null;
19. try
20 .{
21. pattern = Pattern. compile (patternString );
22 .}
23. catch (PatternSyntaxException e)
24 .{
25. System. out. println ("Pattern syntax error ");
26. System. exit (1 );
27 .}
28.
29. while (true)
30 .{
31. System. out. println ("Enter string to match :");
32. String input = in. nextLine ();
33. if (input = null | input. equals ("") return;
34. Matcher matcher = pattern. matcher (input );
35. if (matcher. matches ())
36 .{
37. System. out. println ("Match ");
38. int g = matcher. groupCount ();
39. if (g> 0)
40 .{
41. for (int I = 0; I <input. length (); I ++)
42 .{
43. for (int j = 1; j <= g; j ++)
44. if (I = matcher. start (j ))
45. System. out. print ('(');
46. System. out. print (input. charAt (I ));
47. for (int j = 1; j <= g; j ++)
48. if (I + 1 = matcher. end (j ))
49. System. out. print (')');
50 .}
51. System. out. println ();
52 .}
53 .}
54. else
55. System. out. println ("No match ");
56 .}
57 .}
58 .}

Generally, you do not want to match the entire input to a regular expression. Instead, you want to find one or more matched substrings in the input. Use the find method of the Matcher class to find the next match. If it returns True, use the start and end methods to find the matching range.

While (matcher. find () {int start = matcher. start (); int end = matcher. end (); String match = input. substring (start, end );...}

This mechanism is used for example 12-10. It locates all hypertext references in a web page and prints them. To run a program, provide a URL in the command line, such
Java HrefMatch http://www.horstmann.com

Example 12-10. HrefMatch. java
1. import java. io .*;
2. import java.net .*;
3. import java. util. regex .*;
4.
5 ./**
6. This program displays all URLs in a web page
7. matching a regular expression that describes
8. <a href =...> HTML tag. Start the program
9. java HrefMatch URL
10 .*/
11. public class HrefMatch
12 .{
13. public static void main (String [] args)
14 .{
15. try
16 .{
17. // get URL string from command line or use default
18. String urlString;
19. if (args. length> 0) urlString = args [0];
20. else urlString = "http://java.sun.com ";
21.
22. // open reader for URL
23. InputStreamReader in = new InputStreamReader (new URL (urlString). openStream ());
24.
25. // read contents into string buffer
26. StringBuilder input = new StringBuilder ();
27. int ch;
28. while (ch = in. read ())! =-1) input. append (char) ch );
29.
30. // search for all occurrences of pattern
31. string patternString = "<a \ s + href \ s * = \ s * (\" [^ \ "] * \" | [^ \ s>]) \ s *> ";
32. Pattern pattern = Pattern. compile (patternString, Pattern. CASE_INSENSITIVE );
33. Matcher = pattern. matcher (input );
34.
35. while (matcher. find ())
36 .{
37. int start = matcher. start ();
38. int end = matcher. end ();
39. String match = input. substring (start, end );
40. System. out. println (match );
41 .}
42 .}
43. catch (IOException e)
44 .{
45. e. printStackTrace ();
46 .}
47. catch (PatternSyntaxException e)
48 .{
49. e. printStackTrace ();
50 .}
51 .}
52 .}

The replaceAll method of the Matcher class uses a replacement string to replace the matching of all the regular expressions that appear. For example, the following command replaces all numerical sequences #

Pattern pattern = Pattern. compile ("[0-9] +"); Matcher matcher = pattern. matcher (input); String output = matcher. replaceAll ("#");

The replacement string can contain group references in the mode: $ n is replaced by the nth group. When $ appears in the replaced text, use \ $ to include it.
The replaceFirst method only appears when the mode is replaced for the first time.

Finally, the Pattern class has a split method, which is similar to the tokenizer string. It uses regular expression matching as the boundary and separates the input into a string array. For example, the following command separates the input into tokens ),

Pattern pattern = Pattern. compile ("\ s * \ p {Punct} \ s *"); String [] tokens = pattern. split (input );

Class

Java. util. regex. Pattern 1.4
Method
Static Pattern compile (String expression)
Static Pattern compile (String expression, int flags)
Compile a regular expression string to a pattern object for fast matching
Parameters:
Expression Regular expression
One or more CASE_INSENSITIVE, UNICODE_CASE, MULTILINE, UNIX_LINES, DOTALL, and CANON_EQ In the flags
 
Matcher (CharSequence input)
Returns a matcher object that can be used to locate pattern matching in an input.

String [] split (CharSequence input)
String [] split (CharSequence input, int limit)
Separates the input string into tokens and uses pattern to specify the separator format. Returns an array of tokens. The separator is not part of the mark.
Parameters:
String that is separated into tokens by input.
The maximum number of strings generated by limit.

Class
Java. util. regex. Matcher 1.4
Method
Boolean matches ()
Returns whether the input matches the pattern.

Boolean lookingAt ()
Returns True if the start matching mode is entered.

Boolean find ()
Boolean find (int start)
Try to find the next match and return True when the match is found.
Parameters:
Index to start searching
 
Int start ()
Int end ()
Returns the start position and end position of the current match.

String group ()
Returns the current match.

Int groupCount ()
Returns the number of groups in the input mode.

Int start (int groupIndex)
Int end (int groupIndex)
Returns the start position and end position of the current matching group.
Parameters:
GroupIndex group index (starting from 1), 0 indicates the entire match

String group (int groupIndex)
Returns a string that matches a given group.
Parameters:
GroupIndex
Group Index (starting from 1), 0 indicates the entire match

String replaceAll (String replacement)
String replaceFirst (String replacement)
Returns the string from the matcher input, but replaces all or the first match with the replacement expression.
Parameters:
Replacement string

Matcher reset ()
Matcher reset (CharSequence input)
Reset the mather status.

(Transferred from DotJox's Blog)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.