Regular expression Syntax:
syntax explanation character: C|character c\unnnn, \xnn,\0n,\0nn,\0nnn|a code element with a given hexadecimal or decimal \t,\n,\r,\f,\a,\e|Control: tab, line feed, carriage return, page feed, warning, escape character \cc|control character class associated with character C: [c1c2 ...] | Anything by c1,c2 ... Represents the character, where CI can be multiple characters, the range of characters is c1-C2 or character class [^......] |complement set of character classes [...&&, ...] |the intersection of two character sets a predefined character class: All characters except the line terminator (all characters are represented when the Dotall flag is set) \d| a number [0-9]\d| A non-numeric [0-9]\s|a blank character [\t\n\t\f\x0b]\s|a non-whitespace character \w| A word character [a-za-z0-9]\w|a non-word character \p{name}|a named character class \p{name}|the complement boundary match for a named character class:^$ |the beginning and end of the input \b|A word boundary \b|a non-word character \a|the beginning of the input \z|end of input \z|end of input except Terminator \g|Last matched end quantifier: X? |Optional XX* |x repeat 0 or more x+ |X Repeat 1 or more x{n}x{n,}x{n,m}|x repeats n times, at least n times, between N to M times the quantifier suffix:? |Turn the default (greedy) match into a barely match+ |Convert default (greedy) match to possessive matching set operation: XY|any string in x followed by any string x in Y| Y |any string group of x or Y: (x)|captures the string that matches x as a group \ n|match escape for nth group: \c|character C (must be a character not in the alphabet) \q.....\e|Verbatim references ... (?...) | Special structure
View Code
1. Designing a simple expression to match any number of phone numbers can be a complex thing, because there are many different types of phone number formats. All must choose a more efficient mode. For example: (212) 555-1212, 212-555-1212 and 212 555 1212, some people would think that they are all equivalent.
First let's make up a regular expression. For simplicity, a regular expression is first formed to identify the number of phone numbers in the following format: (NNN) nnn-nnnn.
The first step is to create a pattern object to match the substring above. Once the program is running, you can make the object generic if needed. A regular expression that matches the format above can be composed of the following: (\d{3}) \s\d{3}-\d{4}, where the \d single character type is used to match any number from 0 to 9, and the {3} repetition symbol is a handy notation to indicate 3 consecutive digits and is equivalent to (\d\d\d). \s is also a useful single-character type for matching spaces, such as space, tab, and line breaks.
Isn't it simple? However, if you use the regular expression pattern in a Java program, there are two things to do. For the Java interpreter, the characters before the backslash character (\) have a special meaning. In Java, a regex-related package does not all understand and recognize the backslash character (/), although you can try it. However, to avoid this, you should use the double backslash character (\ \) In order for the backslash character (\) to be completely passed in the pattern object. In addition to the two meanings of the parentheses in the regular expression, if you want it to be interpreted literally (that is, the parentheses), you need to precede it with a double backslash character (\ \). This is the same as the following:
\ \ (\\d{3}\\) \\s\\d{3}-\\d{4}
The following section of code implements the function of reading from a text file line by row, and searching the number of phone numbers line by row, once the match is found and then output in the console.
1 PackagePerl;2 ImportJava.io.*;3 Importjava.util.regex.*;4 Public classTest_number {5 Public Static voidMain (string[] args)throwsioexception{6 BufferedReader in;7Pattern pattern=pattern.compile ("\ \ (\\d{3}\\) \\s\\d{3}-\\d{4}");8in=NewBufferedReader (NewFileReader ("c:/users/liuzhongfeng/desktop/file/java/phone.txt"));9 String S;Ten while((S=in.readline ())! =NULL){ OneMatcher matcher=Pattern.matcher (s); A if(Matcher.find ()) { - System.out.println (Matcher.group ()); - } the } - in.close (); - } -}
View Code
The result is:
(121) 525-1111(545-5555) 545-5555 : (121) 525-1111 ( -545-5555) 545-55555(000)-545-5555
For the Find () function in the program, to match the search with the regular expression to match the ID of any target string, the group () method, which returns a string containing the matched text, should be noted that the above code is used only when each row can contain only one matching phone number string. To be sure, Java's regular expression package can be used when a row contains multiple matching targets. It's pretty beautiful! Unfortunately, this is just a phone number match.
Obviously, there are two more points that can be improved. There may be spaces between the location number and the local number at the beginning of the phone number. We can also match these cases by adding the/s in the regular expression, where the metacharacters indicate that there may be 0 or 1 space characters in the pattern.
2nd, the first three digits and the last four digits of the local number digits may be space characters, not hyphens, more winners, or no separators at all, or 7 digits together. In these cases, we can use (-|)? To solve. The regular expression of this structure is the converter, which can match several of the above mentioned cases. When () can contain a pipe character |, it can match whether it contains a space character or a hyphen, and the tail of the "metacharacters" indicates whether there is no delimiter at all. Finally, the location number may not be enclosed in parentheses, which can be simply appended to the parentheses, but this is not a good workaround. Because it also contains non-paired parentheses, such as "(555" or "555"). Instead, we can use a different converter to force the phone number to have parentheses: (\ (\d{3}\) |\d{3}). If we replace the regular expression in the above code with these improvements, the code above becomes a useful phone number match: The pattern pattern =pattern.compile ("\ \ (\\d{3}\\) |\\d{3}) \\s ? \\d{3} (-|)? \\D{4} ");
To be sure, you can try to further refine the above code yourself.
2. The second example: it is adapted from the Friedl, its function is to check whether there are duplicate words in the text, which is often encountered in print typesetting, but also a grammar checker problem.
There are several regular expressions that match words, most likely directly \b\w+\b, with the advantage of just a small number of regex metacharacters. where/w metacharacters are used to match any character from the letter A to U. The + metacharacters indicate that the match matches one or more characters, and the/b metacharacters are used to describe the boundary of the matched word, which can be a space or any different punctuation (including commas, periods, etc.).
Parentheses have several different usages in regular expressions, one is the ability to provide a combination type, which is used to hold the result of a match or partial match (so that it can be used later), even if the same pattern is encountered. In the same regular expression, it is possible (and often expected) to have more than one combination type. The matching results in the nth combination type can be retrieved by a backward scan. Backward scanning makes searching for duplicate words very simple: \b (\w+) \s+\1\b. Scanning backwards \1 refers to any word that is matched by \w+. Our regular expression can therefore match this, it has one or more space characters, followed by a word with the same.
The final further modification is to make our match sensitive to case sensitivity. For example, the following situation: "The the theme of this article is the Java's regex package.", which can be implemented very simply in a regex, by using a predefined static flag in the pattern class Case_ Insensitive.
Pattern Pattern=pattern.compile ("\\b (\\w+) \\s+\\1\\b");
Pattern.case_insensitive ();
2, Charsequence
JDK 1.4 defines a new interface that provides an abstraction of a sequence of characters for the two classes of string and StringBuffer.
Interface charsequence{
charAt (int i);
Length ();
subsequence (int start,int end);
ToString ();
}
3. Pattern and Matcher
Give an example first. The following program can test whether a regular expression matches a string. The first argument is the string to match, followed by the regular expression. Regular expressions can have more than one. In a unix/linux environment, the regular expression under the command line must also be quoted.
Java's regular expressions are implemented by the Java.util.regex pattern and the Matcher class . the Pattern object represents a compiled regular expression. The static compile () method is responsible for compiling a string representing a regular expression into a pattern object. As the above example shows, a Matcher object can be obtained by sending a string to the Matcher () method of the pattern. As long as you pass a string to the pattern Matcher () method, you can obtain a Matcher method to query the result of the match.
Boolean matches ()
Boolean Lookingat ()
Boolean find ()
Boolean find (int start)
Matches () assumes that pattern matches the entire string, while Lookingat () means that the pattern matches the beginning of the string. For the Find () function, the function of Matcher.find () is to discover multiple sequences of characters that match the pattern in charsequence. For example:
Java: Regular Expressions