Regular Expressions and Regular Expressions
1. What is a regular expression?
When writing a program or webpage that processes strings, it is often necessary to find strings that meet certain complex rules. Regular Expressions are tools used to describe these rules. In other words, a regular expression is the code that records text rules.
You may have used the wildcard (wildcard) for file search in Windows/Dos, that is, * and ?. If you want to find all the Word documents in a directory, you will search for *. doc. Here, * is interpreted as any string. Like wildcards, regular expressions are also a tool for text matching, but they can more accurately describe your needs than wildcards-of course, the cost is more complex-for example, you can write a regular expression to search for all numbers starting with 0, followed by 2-3 numbers, and then a hyphen "-", it is a string of 7 or 8 digits (such as 010-12345678 or 0376-7654321 ).
2. Getting Started
The best way to learn regular expressions is to start with the example, understand the example, and then modify and experiment the example. The following are some simple examples and detailed descriptions of them.
If you search for hi in an English novel, you can use the regular expression hi.
This is almost the simplest regular expression. It can precisely match a string consisting of two characters, the first character is h, and the last one is I. Generally, the regular expression processing tool provides a case-insensitive option. If this option is selected, it can match any of the four cases: hi, HI, Hi, and hI.
Unfortunately, many words contain the two consecutive characters hi, such as him, history, and high. If you use hi for search, the hi here will also be found. To precisely search for the word "hi", we should use \ bhi \ B.
\ B is a special code specified by a regular expression (well, some people call it metacharacter). It represents the start or end of a word, that is, the boundary of a word. Although English words are generally separated by spaces, punctuation marks, or line breaks, \ B does not match any of these word delimiters.Match only one location.
To be more precise, \ B matches the following position: the first character and the last character are not all (one is, one is not or does not exist) \ w.
If you are looking for a Lucy not far behind hi, you should use \ bhi \ B. * \ bLucy \ B.
Here, it is another metacharacters that match any character except the line break. * It is also a metacharacter, but it does not represent a character, nor a position, but a number-It specifies * the content of the front edge can be repeatedly used for any consecutive times to match the entire expression. Therefore,. * When connected, it means that any number of characters do not contain line breaks. Now \ bhi \ B. * \ bLucy \ B is very obvious: first, a word hi, then any character (but not a line break), and finally Lucy.
The line break is '\ n' and the ASCII code is 10 (hexadecimal 0x0A) characters.
If other metacharacters are used at the same time, we can construct a more powerful regular expression. For example:
0 \ d-\ d match a string that starts with 0 and then contains two numbers, then there is a hyphen "-" and the last eight digits (that is, the Chinese phone number. Of course, this example can only match a three-digit area code ).
Here \ d is a new metacharacters that match a digit (0, or 1, or 2, or ......). -It is not a metacharacter. It only matches itself-a hyphen (or a hyphen ).
To avoid so many annoying repetitions, we can also write this expression: 0 \ d {2}-\ d {8 }. Here {2} ({8}) after \ d means that the previous \ d must be repeated twice (eight times ).
3. metacharacters
Now you know several useful metacharacters, such as \ B ,., *, and \ d. there are more metacharacters in the regular expression, such as \ s matching any blank space, including spaces, tabs, line breaks, and Chinese fullwidth spaces. \ W matches letters, numbers, underscores, and Chinese characters.
Special processing of Chinese/Chinese characters is supported by the Regular Expression Engine provided by. Net. For details about other environments, see relevant documents.
Here are more examples:
\ Ba \ w * \ B matches a word that starts with the letter a. First, a word starts with (\ B), and then, then there are any number of letters or numbers (\ w *), and finally the end of the word (\ B ).
Well, now let's talk about the meaning of the word in the regular expression: It's not less than a continuous \ w. Yes, it does not have to do with the thousands of things with the same name when learning English :)
\ D + matches one or more consecutive numbers. Here, the "+" is similar to the "*" metacharacters. The difference is that * matches any number of times (which may be 0 times), and "+" matches one or more times.
\ B \ w {6} \ B matches exactly 6 Characters of words
Character escape
If you want to find the metacharacters themselves, for example, if you want to search for. Or *, you may encounter a problem: You cannot specify them because they will be interpreted as other meanings. In this case, you must use \ to cancel the special meanings of these characters. Therefore, you should use \. And \*. Of course, to find the \ itself, you also need to use \\.
For example, deerchao \. net matches deerchao.net, C :\\ Windows matches C: \ Windows.
Repeated
You have read the above matching methods *, +, {2}, {5, 12. The following are all the qualifiers in the regular expression (a specified number of codes, such as *, {5, 12 ):
The following are examples of repeated use:
Windows \ d + matches one or more numbers after Windows
^ \ W + matches the first word of a row (or the first word of the entire string. The option setting must be used to specify the meaning of the match)
Character class
To search for numbers, letters, or numbers, the blank space is very simple, because there are already metacharacters corresponding to these character sets, but what should you do if you want to match character sets that do not have predefined metacharacters (such as vowels a, e, I, o, u?
You just need to list them in square brackets. For example, [aeiou] matches any English vowel, [.?!] Match punctuation marks (. Or? Or !).
We can also easily specify a character range. For example, [0-9] indicates that the meaning is exactly the same as \ d: a digit; similarly, [a-z0-9A-Z _] is equivalent to \ w (if only English is considered ).
The following is a more complex expression :\(? 0 \ d {2} [)-]? \ D {8 }.
"(" And ")" are also metacharacters, which will be mentioned later in the grouping section. Therefore, escape is required here.
This expression can match phone numbers in several formats, such as (010) 88886666, 022-22334455, or 02912345678. Let's analyze it. First, it is an escape character \ (it can appear 0 times or once (?), Then there is a 0 followed by two numbers (\ d {2}), followed by one of),-, or space. It appears once or does not appear (?), The last eight digits are (\ d {8 }).
Branch Condition
Unfortunately, the expression just now can also match the "Incorrect" format of 010) 12345678 or (022-87654321. To solve this problem, we need to use the branch conditions. The branch condition in a regular expression refers to several rules. If any rule is satisfied, it should be regarded as a match. The specific method is to use | to separate different rules. Can't you understand? It doesn't matter. Let's look at the example:
0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} This expression can match two phone numbers separated by a hyphen: one is a three-digit area code, an eight-digit Local Code (for example, 010-12345678), a four-digit area code, and a seven-digit local code (0376-2233445 ).
\ (0 \ d {2} \) [-]? \ D {8} | 0 \ d {2} [-]? The expression \ d {8} matches the phone number of the three-digit area code. The area code can be enclosed in parentheses or not. The area code can be separated by a hyphen or space, or there is no interval. You can use the branch condition to extend the expression to a four-digit area code.
The expression \ d {5}-\ d {4} | \ d {5} is used to match the zip code of the United States. The U.S. Postal Code uses five digits or nine digits separated by a hyphen. This example is given because it indicates a problem:Note the order of each condition when using the branch condition. If you change it to \ d {5} | \ d {5}-\ d {4, then, it will only match the 5-digit ZIP code (and the first 5-digit of the 9-digit ZIP code ). The reason is that when the branch condition is matched, each condition will be tested from left to right. If a branch is satisfied, other conditions will not be managed.
Group
We have already mentioned how to repeat a single character (simply add a qualifier after the character); but what if you want to repeat multiple characters? You can use parentheses to indicate the subexpression (also called grouping), and then you can specify the number of repetitions of this subexpression, you can also perform other operations on the subexpression (which will be introduced later ).
(\ D {1, 3} \.) {3} \ d {1, 3} is a simple IP address matching expression. To understand this expression, analyze it in the following order: \ d {1, 3} matches 1 to 3 digits (\ d {1, 3 }\.) {3} matches three digits with an English ending (this group is used as a whole), repeats three times, and finally adds one to three digits (\ d {1, 3 }).
Each number in the IP address cannot exceed 255. Never be fooled by the scriptwriter in the third quarter of "24 ......
Unfortunately, it will also match an impossible IP address such as 256.300.888.999. If arithmetic comparison can be used, this problem may be solved simply. However, regular expressions do not provide any mathematical functions. Therefore, you can only use lengthy grouping and selection, character class to describe a correct IP Address: (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?).
The key to understanding this expression is to understand 2 [0-4] \ d | 25 [0-5] | [01]? \ D ?, I will not elaborate on it here. You should be able to analyze its meaning.
Antsense
Sometimes you need to find characters that do not belong to a simple character class. For example, if you want to search for any character except a number, you need to use the negative sense:
Example: \ S + matches strings that do not contain blank characters.
<A [^>] +> match a string prefixed with a enclosed in angle brackets.
Backward reference
Backward reference is used to repeatedly search text matched by the previous Group. For example, \ 1 indicates the text matched by Group 1. Hard to understand? See the example:
\ B (\ w +) \ B \ s + \ 1 \ B can be used to match duplicate words, such as go or kitty. This expression is a word, that is, more than one letter or number (\ B (\ w +) \ B) between the start and end of a word ), this word is captured in a group numbered 1, followed by one or several blank characters (\ s + ), finally, the content captured in group 1 (that is, the previously matched word) (\ 1 ).
You can also specify the group name of the subexpression. To specify the group name of a subexpression, use the following syntax :(? <Word> \ w +) (or you can change the angle brackets :(? 'Word' \ w +), so that the Group Name of \ w + is specified as Word. To reverse reference the content captured by this group, you can use \ k <Word>, so the previous example can also be written as follows: \ B (? <Word> \ w +) \ B \ s + \ k <Word> \ B.
When parentheses are used, there are many syntax for specific purposes. The most common ones are listed below:
Note
Another use of parentheses is through the syntax (? # Comment) to include comments. Example: 2 [0-4] \ d (? #200-249) | 25 [0-5] (? #250-255) | [01]? \ D? (? #0-199 ).
To include comments, it is best to enable the "blank characters in ignore mode" option. In this way, spaces, tabs, and line breaks can be added when an expression is written, which will be ignored in actual use. After this option is enabled, all the text that ends at the end of the line after # is ignored as a comment. For example, we can write the previous expression as follows:
(? <= # Prefix of the text to be matched <(\ w +)> # search for letters or numbers enclosed in angle brackets (that is, HTML/XML tags) # end with the prefix. * # match any text (? = # Assert the suffix of the text to be matched <\/\ 1> # search for the content enclosed by Angle brackets: the front is a "/", followed by the previously captured labels) # End of suffix