Regular expression: An expression that conforms to a certain rule
Role: Dedicated to manipulating strings
With some specific symbols to represent some code operations, thus simplifying the writing, so learning regular expressions is learning some special symbols of the use.
Benefits: can simplify complex operations on strings
Disadvantages: The more symbol definition, the longer, the worse the reading
Specific operation function:
1, matching: String matches method, match the entire string with the rule, as long as there is a mismatch, the match ends, return false
2, cutting: According to the folding word to complete the cutting, in order to allow the results of the rule to be reused, the rules can be encapsulated into a group, with () completed, the group appears numbered, from the outset, want to use the existing group (n is the number of rent) such as: "(.) \\1+"------ Cut by any number of overlapping words
3. Replacement: String ReplaceAll ()
Replaces overlapping characters with a single character, and $ gets the elements of the group.
4. Get: Take out strings that match the rules
Operation Steps:
A, encapsulate the regular expression as an object
B, Jeang Jong The object is associated with the string to manipulate
C, after association, get the regular matching engine
D, the use of the engine to match the rules of the substring to operate, such as the removal of
Gets the instance:
Import java.util.regex.*;
Class RegexDemo2
{
public static void Main (string[] args)
{
Getdemo ();
}
public static void Getdemo ()
{
String str= "ni hao ma ye xu CEng jing de ni yao Hao yi Xie!";
System.out.println (str);
String regex= "\\b[a-za-z]{3}\\b";
Pattern P=pattern.compile (regex);//to encapsulate a regular expression as an object
Matcher M=p.matcher (str);//To associate a regular object with the string to manipulate
while (M.find ())
{
System.out.println (M.group ());
System.out.println (M.start () + "----" +m.end ());
}
}
}
The construction summary of regular expressions
Structure
The
Character
X
Character X
\\
Backslash character
\0n
Character n with octal value 0 (0 <= n <= 7)
\0nn
Character nn with octal value 0 (0 <= n <= 7)
\0mnn
Character Mnn with octal value 0 (0 <= m <= 3, 0 <= n <= 7)
\xhh
Character with hexadecimal value of 0x hh
\uhhhh
Characters with hexadecimal value of 0x HHHH
\ t
tab characters (' \u0009 ')
\ n
New lines (line break) (' \u000a ')
\ r
return character (' \u000d ')
\f
Page breaks (' \u000c ')
\a
Alarm (Bell) character (' \u0007 ')
\e
Escape character (' \u001b ')
\cx
The control character corresponding to X
Character class
[ABC]
A, B, or C (simple Class)
[^ABC]
Any character except A, B, or C (negation)
[A-za-z]
A to Z or A to Z, the letters at both ends are included (range)
[A-d[m-p]]
A to D or M to P:[a-dm-p] (and set)
[A-z&&[def]]
D, E or F (intersection)
[A-Z&&[^BC]]
A to Z, except B and C:[ad-z] (minus)
[A-z&&[^m-p]]
A to Z, not M to P:[a-lq-z] (minus)
Predefined character classes
.
Any character (may or may not match the line terminator)
\d
Number: [0-9]
\d
Non-digit: [^0-9]
\s
whitespace characters: [\t\n\x0b\f\r]
\s
Non-whitespace characters: [^\s]
\w
Word characters: [a-za-z_0-9]
\w
Non-word characters: [^\w]
POSIX character class (Us-ascii only)
\p{lower}
Lowercase alphabetic characters: [A-z]
\p{upper}
Uppercase characters: [A-z]
\P{ASCII}
All ascii:[\x00-\x7f]
\p{alpha}
Alphabetic characters: [\p{lower}\p{upper}]
\p{digit}
Decimal digits: [0-9]
\p{alnum}
Alphanumeric characters: [\p{alpha}\p{digit}]
\P{PUNCT}
Punctuation mark:! " #$%& ' () *+,-./:;<=>?@[\]^_ ' {|} ~
\p{graph}
Visible characters: [\p{alnum}\p{punct}]
\p{print}
printable characters: [\p{graph}\x20]
\p{blank}
Spaces or tabs: [\ t]
\p{cntrl}
Control characters: [\x00-\x1f\x7f]
\p{xdigit}
hexadecimal digits: [0-9a-fa-f]
\p{space}
whitespace characters: [\t\n\x0b\f\r]
Java.lang.Character Class (Simple Java character type)
\p{javalowercase}
Equivalent to Java.lang.Character.isLowerCase ()
\p{javauppercase}
Equivalent to Java.lang.Character.isUpperCase ()
\p{javawhitespace}
Equivalent to Java.lang.Character.isWhitespace ()
\p{javamirrored}
Equivalent to java.lang.Character.isMirrored ()
Classes for Unicode blocks and categories
\p{ingreek}
Characters in a Greek block (simple block)
\p{lu}
Uppercase letters (Simple category)
\P{SC}
Currency symbol
\p{ingreek}
All characters, except in the Greek block (negation)
[\p{l}&&[^\p{lu}]]
All letters, except uppercase letters (minus)
Boundary Matching Device
^
The beginning of a line
$
End of Line
\b
Word boundaries
\b
Non-word boundaries
\a
The beginning of the input
\g
End of last match
\z
The end of the input, only for the last terminator (if any)
\z
End of input
Greedy quantity Word
X?
X, not once or once
x*
X, 0 or more times
x+
X, one or more times
X{n}
X, exactly n times
X{n,}
X, at least n times
X{N,M}
X, at least n times, but not more than m times
Reluctant quantity word
X??
X, not once or once
X*?
X, 0 or more times
X+?
X, one or more times
X{n}?
X, exactly n times
X{n,}?
X, at least n times
X{n,m}?
X, at least n times, but not more than m times
Possessive quantity Word
x?+
X, not once or once
x*+
X, 0 or more times
X + +
X, one or more times
x{n}+
X, exactly n times
x{n,}+
X, at least n times
x{n,m}+
X, at least n times, but not more than m times
Logical operator
Xy
X followed by Y
X| Y
X or Y
X
X, as a capturing group
Back reference
\ n
Any matching nth capture group
Reference
\
Nothing, but reference the following characters
\q
Nothing, but refer to all characters until \e
\e
Nothing, but ending a reference starting with \q
Special construction (not capture)
(?: X)
X, as a non-capturing group
(? idmsux-idmsux)
Nothing, but will match the flag Idmsux On-off
(? idmsux-idmsux:x)
X, as a idmsux group with a given flag-on-off
(? =x)
X, through a positive lookahead of 0 widths
(?! X
X, through a negative lookahead of 0 widths
(? <=x)
X, through a positive lookbehind of 0 widths
(? <! X
X, through a negative lookbehind of 0 widths
(? >x)
X, as a separate, non-capturing group
Backslashes, escapes, and references
The backslash character (' \ ') is used to reference the escaped construct, as defined in the previous table, and to refer to other characters that will be interpreted as non-escaped constructs. Therefore, the expression \ \ Matches a single backslash, and \{matches the left parenthesis.
It is wrong to use backslashes before any alphabetic characters that escape constructs are used, and they are reserved for future extensions of regular expression languages. You can use a backslash before a non-alphanumeric character, regardless of whether the character is part of an escaped construct or not.
The backslash in the Java source code string is interpreted as Unicode escape or other character escape, as required by the Java Language specification. Therefore, you must use two backslashes in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, and "\\b" matches the word boundary. string literal "\ (hello\)" is illegal and will result in a compile-time error; to match the string (hello), you must use string literal "\ (hello\\)".
Character class
A character class can appear in other character classes, and can contain a set operator (implicit) and an intersection operator (&&). The collection operator represents a class that contains at least one of its operand classes. The intersection operator represents a class that contains all the characters in its two operand classes.
The precedence of the character class operators is as follows, in order from highest to lowest:
1
Literal value escape
\x
2
Group
[...]
3
Range
A-Z
4
and set
[A-E] [I-u]
5
Intersection
[A-z&&[aeiou]]
Note that the different sets of metacharacters are actually inside the character class, not outside of the character class. For example, regular expressions. It loses its special meaning inside the character class, and the expression-becomes the range that forms the meta character.
Line Terminator
A line terminator is a sequence of one or two characters that marks the end of the line of the input character sequence. The following code is recognized as a line terminator: New lines (' \ n '), a carriage return ("\ r \ n") followed by a new line character, a separate carriage return (' \ R '), the next line of characters (' \u0085 '), the row delimiter (' \u2028 '), or the paragraph separator (' \u2029).
If you activate Unix_lines mode, the new line character is the only recognized line terminator.
If the DOTALL flag is not specified, the regular expression. can match any character (except the line terminator).
By default, regular expressions ^ and $ ignore line terminators, matching only the beginning and end of the entire input sequence. If the multiline mode is activated, then ^ matches at the beginning of the entry and after the line terminator (the end of the input). When in multiline mode, $ matches only before the row terminator or at the end of the input sequence.
Groups and captures
Capturing groups can be numbered by counting their open brackets from left to right. For example, in an expression ((A) (B (C)), there are four such groups:
1
((A) (B (C)))
2
\a
3
(B (C))
4
C
Group 0 always represents an entire expression.
The capture group is named so that each subsequence of the input sequence that matches the groups is saved in the match. The captured subsequence can later be used in an expression by a back reference, or it can be obtained from the match after the matching operation completes.
The capture input associated with a group is always the child sequence that matches the group most recently. If the group is recalculated again for quantification, the value that was previously captured (if any) will be preserved if the second calculation fails, for example, to "ABA" the string with an expression (a (b)) + matches, the second group is set to "B". At the beginning of each match, all captured inputs are discarded.
The group that starts with (?) is a pure, non-capturing group that does not capture text or count for group totals
Example: Check QQ number
Class Regexdemo
{
public static void Main (string[] args)
{
CHECKQQ ();
}
public static void Checkqq ()
{
String qq= "534946o910";
String regex= "[1-9][0-9]{4,14}";
Boolean flag=qq.matches (regex);
if (flag)
{
System.out.println ("qq=" +qq+ "is ok!");
}
Else
{
System.out.println ("You enter the QQ:" +qq+ "illegal");
}
}
}
/*
Demand
* Turn the following strings into: I want to learn programming
* Which one of the four functions? Or a few of them.
* Way of thinking:
* 1, if you just want to know if the string is right or wrong, use the matching
* 2, want to change the existing string into another string, replace
* 3, want to customize the way the string into multiple strings, cutting, get the rules of the substring
* 4, want to get the string substring match the requirements, get, get the substring that matches the rule
**/
Class RegexTest1
{
public static void Main (string[] args)
{
Test_1 ();
}
public static void Test_1 ()
{
String str= "I am ... Do you want to ...? Learn to learn ... The series of ... Chengcheng ";
String regex= "\\.+";
Str=str.replaceall (Regex, "");
Str=str.replaceall ("(.) \\1+ "," $ ");
System.out.println (str);
}
}
/*
Requirements: Verify the email address.
*/
public static void Checkmail ()
{
String mail = "abc12@sina.com";
Mail = "1@1.1";
String reg = "[A-za-z0-9_]+@[a-za-z0-9]+ (\\.[ a-za-z]+) + ";//More exact match.
Reg = "\\w+@\\w+ (\\.\\w+) +";//a relatively less precise match.
Mail.indexof ("@")!=-1
System.out.println (Mail.matches (reg));
}
Web crawler (spider):
Import java.io.*;
Import java.util.regex.*;
Import java.net.*;
Import java.util.*;
Class RegexTest2
{
public static void Main (string[] args) throws Exception
{
Getmails_1 ();
}
public static void Getmails_1 () throws Exception
{
URL url = new URL ("http://192.168.1.254:8080/myweb/mail.html");
URLConnection conn = Url.openconnection ();
bufferedreaded Bufin = new BufferedReader (New InputStreamReader (Conn.getinputstream ()));
String line = null;
String Mailreg = "\\w+@\\w+ (\\.\\w+) +";
Pattern p = pattern.compile (Mailreg);
while ((Line=bufin.readline ())!=null)
{
Matcher m = p.matcher (line);
while (M.find ())
{
System.out.println (M.group ());
}
}
}
/*
Gets the e-mail address in the specified document.
Use the Get feature. Pattern Matcher
*/
public static void Getmails () throws Exception
{
BufferedReader BUFR =
New BufferedReader (New FileReader ("Mail.txt"));
String line = null;
String Mailreg = "\\w+@\\w+ (\\.\\w+) +";
Pattern p = pattern.compile (Mailreg);
while ((Line=bufr.readline ())!=null)
{
Matcher m = p.matcher (line);
while (M.find ())
{
System.out.println (M.group ());
}
}
}
}