PS: In all examples, regular expression matching results are included between [and] in the source text. In some examples, java is used. If the regular expression is used in java, it will be described in the corresponding area. All java examples have passed the test under JDK1.6.0 _ 13.
1. Escape special characters
Metacharacters are characters with special meanings in regular expressions. Because metacharacters have special meanings in regular expressions, they cannot be used to represent themselves. Add a backslash before the metacharacters to escape the metacharacters. In this way, the escape sequence matches the character itself rather than its special metacharacters. For example, if you want to match [and], You must escape it: \ [and \].
To escape metacharacters, you need to use a slash (\), which means that the \ character is also a metacharacter. to match the \ character itself, it must be converted \\. For example, match the windows file path.
2. Match blank characters
Metacharacters can be roughly divided into two types: one is used to match text (such as.), and the other is required by the regular expression syntax (such as [and]).
During regular expression search, we often encounter the need to match non-printable blank characters in the original text. For example, we may need to find all the tabs, or we need to find the line breaks. It is difficult to directly input such characters into a regular expression, in this case, we can use the special metacharacters listed below to enter them:
\ B rollback (and deletion) one character (Backspace key)
\ F page feed
\ N linefeed
\ R carriage return
\ T Tab (Tab key)
\ V vertical Tab
Let's look at an example and remove the blank lines in the file:
Text:
8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6
6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8
3 26 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3
Regular Expression: \ r \ n
Analysis: \ r \ n matches a combination of carriage return and line feed. In windows, it is used as the end label of the text line. The search by using the regular expression \ r \ n will match two consecutive row-end labels, which are exactly blank rows.
Note: in Unix and Linux operating systems, only one line break is used to end a text line. In other words, in Unix or Linux systems, only \ n can be used to match blank lines, \ r is not required. In addition, Regular Expressions applicable to windows and Unix/Linux should include a preemptible \ r and a must-match \ n, that is, \ r? \ N \ r? \ N, which will be discussed in later articles.
The Java code is as follows:
Public static void matchBlankLine () throws Exception {
BufferedReader br = new BufferedReader (new FileReader (new File ("E:/ .txt ")));
StringBuilder sb = new StringBuilder ();
Char [] cbuf = new char [1024];
Int len = 0;
While (br. ready () & (len = br. read (cbuf)> 0 ){
Br. read (cbuf );
Sb. append (cbuf, 0, len );
}
String reg = "\ r \ n ";
System. out. println ("original content: \ n" + sb. toString ());
System. out. println ("processed :-----------------------------");
System. out. println (sb. toString (). replaceAll (reg, "\ r \ n "));
}
The running result is as follows:
Original content:
8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6
6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8
3 2 6 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3
After processing :-----------------------------
8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6
6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8
3 2 6 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3
3. match a specific character category
Character Set combination (matching one of multiple characters) is the most common form of matching, and some common character set combinations can be replaced by special metacharacters. These metacharacters match a certain type of characters (metacharacters). metacharacters are not essential, it can be used to enumerate relevant characters one by one or to define a character range to match a certain type of characters. However, the regular expressions constructed by these characters are simple and easy to understand and are often used in practical applications.
1. matching numbers and non-Numbers
\ D any number, equivalent to [0-9] or [0123456789]
\ D any non-digit, equivalent to [^ 0-9] or [^ 0123456789]
2. Match letters, numbers, non-letters, and numbers
Letters (A-Z is case-insensitive), numbers, underscores are a common character set, the following metacharacters are available:
\ W any letter (case-insensitive), number, and underline, equivalent to [0-9a-zA-Z _]
\ W any non-letter number and underline, equivalent to [^ 0-9a-zA-Z _]
3. Match blank and non-blank characters
\ S any blank character, equivalent to [\ f \ n \ r \ t \ v]
\ S any blank character, equivalent to [^ \ f \ n \ r \ t \ v]
Note: The Escape metacharacter \ B is not within the range of \ s.
4. Match the hexadecimal or octal value
Hexadecimal: given by the prefix \ x. For example, \ x0A corresponds to ASCII character 10 (line break), and its effect is equivalent to \ n.
Octal character: given by the prefix \ 0, the value itself can be two or three digits, for example, \ 011 corresponds to ASCII character 9 (Tab), and its effect is equivalent to \ t.
4. Use POSIX character classes
POSIX character classes are short form supported by many regular expressions. Java also supports it, But JavaScript does not. POSIX characters are as follows:
[: Alnum:] Any letter or number, equivalent to [a-zA-Z0-9]
[: Alpha:] any letter, equivalent to [a-zA-Z]
[: Blank:] space or tab, equivalent to [\ t]
[: Cntrl:] ASCII control character (ASCII 0 to 31, plus ASCII 127)
[: Digit:] any number, equivalent to [0-9]
[: Graph:] any printable character, excluding spaces
[: Lower:] Any lowercase letter, equivalent to [a-z]
[: Print:] any printable character
[: Punct:] Neither belongs to any character of [: alnum:] nor [: cntrl :]
[: Space:] any blank character, including space, is equivalent to [^ \ f \ n \ r \ t \ v]
[: Upper:] any capital letter, equivalent to [A-Z]
[: Xdigit:] Any hexadecimal number, equivalent to a [a-fA-F0-9]
The POSIX character is not the same as the metacharacters we have seen before. Let's look at an example of using a regular expression to match the color in a webpage:
Text: <span style = "background-color: # 3636FF; height: 30px; width: 60px;"> test </span>
Regular Expression: # [[: xdigit:] [[: xdigit:] [[: xdigit:] [[: xdigit:] [[: xdigit:] [[: xdigit:]
Result: <span style = "background-color: [# 3636FF]; height: 30px; width: 60px;"> test </span>
Note: The mode used here ends with [starting with [, starting with]. This is required for POSIX character classes. POSIX characters must be included between [: And, the outer [and] characters are used to define a set. The inner [and] characters are components of the POSIX character class.
The POSIX characters in java are different, not between [: And:], but starting with \ p, including between {And}, and are case sensitive, \ p {ASCII} is added as follows:
\ P {Alnum} alphanumeric characters: [\ p {Alpha} \ p {Digit}]
\ P {Alpha} letter: [\ p {Lower} \ p {Upper}]
\ P {ASCII} All ASCII: [\ x00-\ x7F]
\ P {Blank} space or tab: [\ t]
\ P {Cntrl} Control Character: [\ x00-\ x1F \ x7F]
\ P {Digit} decimal number: [0-9]
\ P {Graph} visible characters: [\ p {Alnum} \ p {Punct}]
\ P {Lower} lowercase letter: [a-z]
\ P {Print} printable character: [\ p {Graph} \ x20]
\ P {Punct} punctuation :! "# $ % & '() * +,-./:; <=>? @ [\] ^ _ '{| }~
\ P {Space} blank characters: [\ t \ n \ x0B \ f \ r]
\ P {Upper} uppercase letter: [A-Z]
\ P {XDigit} hexadecimal number: [0-9a-fA-F]
Author: zhanghu198901