What is re?
Presumably you have used the universal character "*" When you do the file search, for example, if you want to find all the word files in the Windows directory, you may use the "*.doc" method to do the search, because "*" represents any character. Re is doing something like this, but it's more powerful.
When writing a program, you often need to match a particular style to a string, and the main function of RE is to describe that particular style, so that you can treat the re as a specific style, for example, "\w+" represents a non-empty string (Non-null string) that consists of any letter and number. A very powerful class library is available in the. NET framework, which makes it easy to use RE to find and replace text, decode complex headers, and validate text.
Next, let's try some examples.
A few simple examples
Suppose you want to find a string of alive after Elvis in the article, the use of re may go through the following procedure, the parentheses are the meaning of the next re:
1. Elvis (find Elvis)
The order in which the above represents the characters to be searched is Elvis. In. NET, you can set the case of a character, so "Elvis", "Elvis", or "Elvis" are all 1-compliant re. But because this character appears in the order of Elvis, so pelvis is also in accordance with 1 of the RE. You can use the 2 re to improve.
2. \belvis\b (see Elvis as a whole word lookup, such as Elvis, Elvis the character case)
"\b" in Re has special meaning, in the above example refers to the word boundary, so \belvis\b with the \b to the Elvis of the boundary, that is, to Elvis the word.
Suppose you want to find the string of characters in the same row after the Elvis followed by a alive, and then use another two special meanings "." and "*". ”.” The representation is any character except for a newline character, and "*" represents a repeating * item until it finds a string that conforms to the re. So ". *" refers to any number of characters except newline characters. So look for the same line of Elvis followed by a string of alive to find out, you can be like 3 re.
3. \belvis\b.*\balive\b (find Elvis followed by alive string, such as Elvis is Alive)
The powerful re can be formed with simple special characters, but it is also found that when more and more special characters are used, the re becomes more and more ugly.
And look at another example.
Make a valid phone number
If you want to collect a 7-digit phone number from a Web page with a customer format of xxx-xxxx, where x is a number, re may write it.
4. \b\d\d\d-\d\d\d\d (Find a seven-digit phone number, such as 123-1234)
Each of the \d represents a number. "-" is a general ligature symbol, in order to avoid too many repetitions of the \d,re can be rewritten as 5 of the way.
5. \b\d{3}-\d{4} (Find a good method for seven-digit phone numbers, such as 123-1234)
The {3} after \d represents repeating the previous item three times, which is equal to \d\d\d.
Re's learning and testing tools Expresso
Because re is not easy to read and users can easily wrong re, Jim has developed a tool software Expresso to help users learn and test re, in addition to the above mentioned URL, but also on the Ultrapico site. After installing Expresso, in the Expression%20%20library, Jim greatly put the example of the article is built in which, can read the article side test, you can also try to modify the example of the RE, immediately can see the results, the younger brother feel very useful. You can try it very much. /"。 After the installation of Expresso, in the expression library, Jim greatly put the example of the article is built in which, can read the article side test, you can also try to modify the example of the RE, immediately can see the results, the younger brother feel very useful. You can try it very much.
. The basic concept of RE in net
Special characters
Some words have special meanings, such as "\b", ".", "*", "\d", etc., as previously seen. "\s" is represented by any space character, such as spaces, tabs, newlines, etc... "\w" stands for any letter or numeric character.
Let's look at some more examples.
6. \ba\w*\b (find words beginning with a, such as able)
This re describes the start boundary (\b) to find a word, followed by the letter "a", plus any number of alphanumeric (\w*), followed by the end of the word (\b).
7. \d+ (Find numeric string)
"+" and "*" are very similar except that the + at least repeats the previous item once. That means at least one number.
8. \b\w{6}\b (Find six alphanumeric words, such as ab123c)
The following table is a special character commonly used by re
. Any character except a newline character
\w any alphanumeric character
\s any whitespace
\d any numeric character
\b Defining the boundaries of a word
^ The beginning of the article, such as "^the" to indicate that the string appearing at the beginning of the article is "the"
$ The end of the article, such as "end$" to indicate that it appears at the end of the article as "End"
The special characters "^" and "$" are used to find certain words that must be the beginning or end of the article, which is especially useful when verifying that the input conforms to a certain style, for example, to verify a seven-digit telephone number, you might enter the following 9 re.
9. ^\d{3}-\d{4}$ (Verify the seven-digit phone number)
This is the same as the 5th re, but there are no other characters before or after it, that is, the entire string of strings is only those seven numbers. In. NET if you set multiline this option, "^" and "$" are compared per row, as long as the beginning of a line conforms to the RE, not the entire article string.
Transpose character (escaped characters)
Sometimes it may be necessary to "^", "$" literal meaning (literal meaning) rather than as a special character, at this time the "\" character is used to remove special character characters, so "\^", "\", "\ \" represents the "^", ".", "\" The literal meaning.
Repeat the above items
Before we see "{3}" and "*" can be used to repeat the aforementioned characters, we will see how to repeat the whole sub-description (subexpressions) with the same syntax. The following table is some of the ways to use repeating the aforementioned items.
* Repeat any number of times
+ Repeat at least once
? Repeat 0 or one time
{n} repeats n times
{n,m} repeats at least n times, but not more than m times
{n,} repeats at least n times
Let's try some more examples.
\b\w{5,6}\b (find words of five or six alphanumeric characters, such as as25d, D58SDF, etc.)
One. \b\d{3}\s\d{3}-\d{4} (Find 10 numbers of phone numbers, such as 800 123-1234)
\D{3}-\D{2}-\D{4} (Find social Security numbers, such as 123-45-6789)
^\w* (first word per line or whole article)
In the espresso can try to have multiline and not multiline difference.
Match a range of characters
How do you sometimes need to find certain characters? The brackets "[]" came in handy. So [Aeiou] is looking for "a", "E", "I", "O", "u" these vowels, [.?!] What you're looking for is ".", "?", "!" These symbols, the special meaning of the special characters in brackets, are removed, that is, the interpretation into purely literal meaning. You can also specify certain range of characters, such as "[A-z0-9]", which is any lowercase letter or any number.
Let's look at a re example of the initial complex lookup phone number.
\ (? \d{3}[(] \s?\d{3}[-]\d{4} (Find a 10-digit telephone number, such as (080) 333-1234)
Such re can be found in more than one format of telephone numbers, such as (080) 123-4567, 511 254 6654 and so on. ”\(?” Represents one or 0 left parenthesis "(", while "[(]" means finding a closing parenthesis ")" or a space character, "\s?" Refers to a group of one or 0 spaces. But such a re will be similar to "800 45-3321" such as the phone, that is, the parentheses do not have a symmetrical balance problem, then will learn to choose one (alternatives) to solve such problems.
Not included in a specific character group (negation)
Sometimes you need to find the characters that are contained in a particular character group, and the following table shows how to do similar descriptions.
\w any character that is not an alphanumeric number
\s any character that is not a whitespace
\d any character that is not a numeric character
\b Position not in word boundary
[^x] is not an arbitrary character of X
[^aeiou] Any character that is not a, E, I, O, u
\s+ (string with no whitespace)
Choose one (alternatives)
Sometimes you will need to find a few specific choices, at this time "|" This special character comes in handy, for example, to find the ZIP code for five numbers and nine numbers (with "-" numbers).
\b\d{5}-\d{4}\b|\b\d{5}\b (Find the ZIP code for five numbers and nine numbers (with "-" numbers)
When using alternatives, you need to be aware of the order of the front and back, because re in the alternatives will prefer to match the leftmost item, 16, if you find five numbers of items in front, then this re will only find five numbers of postal code. Understand the alternative, can be 14 to do a better correction.
(\ (\d{3}\) |\d{3}) \s?\d{3}[-]\d{4} (10-digit phone number)
Group (Grouping)
Parentheses can be used to refer to a sub-description, through the sub-description of the set, can be repeated for the second description or his treatment.
(\d{1,3}\.) {3}\d{1,3} (simple re looking for network address)
This re means the first part (\d{1,3}\.) {3}, refers to the number of the smallest one up to three bits, and followed by "." Symbol, this type has a total of three, followed by one to three digits, that is, such as 192.72.28.1 number.
However, this will have a disadvantage, because the network address number is only up to 255, but the above re as long as the number of a to three-bit is consistent, so this need to compare the number is less than 256, but only use the RE and can not do this comparison. Use an option in 19 to limit the address to the required range, that is, from 0 to 255.
((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?] (Find network address)
Have you ever noticed that re is more and more like an alien? Simply looking for a Web address, it's hard to understand what the RE is all about.
Expresso Analyzer View
Expresso provides a feature that can turn the next re into a tree-like description, a set of separate descriptions of groups, and provides a good debugging environment. Other functions, such as partial match (partial match only find the part of anti-White re) and exception (Exclude match does not look for the anti-White re) are left to you to try.
When the description is grouped together in parentheses, the text that matches the description can be used in subsequent program processing or re itself. Under the preset pattern, the group is named by the number, starting from 1, by the order is left to right, this automatic group named, can be seen in skeleton view or result view in Expresso.
Backreference is used to find the same text that is crawled in the group. For example, "\1" refers to the text that is captured by group 1.
\b (\w+) \b\s*\1\b (looking for repeating words, the repetition here refers to the same word, the middle of which is separated by a word like dog dog)
(\w+) will crawl at least one character of the letter or number of words, and name it as group 1, followed by the search for any space characters, followed by the same text as group 1.
If you do not like the group automatically named 1, you can also name, in the example above, (\w+) rewritten as (? <word>\w+), which is to name the captured group Word,backreference will be rewritten as \k<word>
\b (? <word>\w+) \b\s*\k<word>\b (use a self-named group to crawl repeating words)
There are a number of special syntax elements used in parentheses, and the more general list is as follows:
Crawl (Captures)
(exp) complies with EXP and crawls it into an automatically named group
The (? <name>exp) matches exp and crawls it into the named group name
(?: EXP) in accordance with EXP, do not crawl it
Lookarounds
(? =exp) conforms to the word end of exp
(? <=exp) conforms to the text prefixed with exp
(?! EXP) conforms to the text that is not followed by exp trailing
(? <!exp) conforms to the preceding text without the exp prefix
Annotation Comment
(? #comment) annotations
Positive Lookaround
The next thing to talk about is lookahead and lookbehind assertions. What they are looking for is currently in line with previous or subsequent text and does not contain the current compliance itself. These are like "^" and "\b" special characters, which do not correspond to any text in themselves (to define the position), and therefore are called zero-width assertions, and it may be clear to see some examples.
(? =exp) is a "zero-width positive lookahead assertion". It refers to the text that conforms to the end of the exp, but does not contain the exp itself.
\b\w+ (=ing\b) (the word end is ing, for example, filling matches the fill)
(? <=exp) is a "zero-width positive lookbehind assertion". It refers to text that conforms to the prefix exp, but does not include exp itself.
(<=\bre) \w+\b (the word prefixed with re, for example repeated is peated)
(. <=\d) \d{3}\b (three digits at the end of the word, followed by a number)
(? <=\s) \w+ (? =\s) (alphanumeric string separated by whitespace)
Negative Lookaround
Previously mentioned, how to find a non-specific or non-specific group of characters. But what if you just want to verify that a character doesn't exist and that it doesn't come in? For example, suppose you want to find a word, its letter has q but the next letter is not u, you can do it with the following re.
\b\w*q[^u]\w*\b (a word with a Q in its letter but the next letter is not u)
This re will have a problem, because [^u] to correspond to a character, so if Q is the last letter of the word, [^u] Such a move will be the space character corresponding to go down, the result is likely to meet two words, such as "Iraq haha" such text. This problem can be solved by using negative lookaround.
\b\w*q (?! u) \w*\b (a word with a Q in its letter but the next letter is not u)
This is "zero-width negative lookahead assertion".
\D{3} (?! \d) (three digits, not followed by a digit)
Similarly, you can use (? <!exp), "zero-width negative lookbehind assertion" to match a string of characters that was not preceded by the exp prefix.
(? <![ A-z]) \w{7} (seven alphanumeric string with no letters or spaces in front of it)
(?<=< (\w+) >.* (?=<\/\1> (text between HTML labels)
This uses lookahead and lookbehind assertion to remove text between HTML, excluding HTML labels.
Please comment (Comments)
Parentheses also have a special purpose is to wrap the annotations, the syntax is "(#comment)", if you set the "Ignore Pattern whitespace" option, then the space character in re will be slightly when re used. When this option is set, the text after "#" will be slightly smaller.
Text between HTML volume labels, plus annotations
(? <= #查找前缀, but does not contain it
< (\w+) > #HTML标签
) #结束查找前缀
. * #符合任何文字
(? = #查找字尾, but does not contain it
<\/\1> #符合所抓取群组1之字符串, which is the HTML tag of the front parenthesis
) #结束查找字尾
Search for words with the most characters and the fewest characters (greedy and Lazy)
When re is looking for a range of repetitions (such as ". *"), it usually looks for the most characters in the word, i.e. greedy matching. For example.
A.*b (match word for the maximum character starting at end of a b)
If a string is "Aabab", the matching string obtained by using the above re is "Aabab", because this is the word that seeks the most characters. Sometimes you want the word that matches the fewest characters, which is lazy matching. Just add a question mark (?) to the table repeating the preceding item. You can turn them all into lazy matching. So "*?" Represents the repetition of any number of times, but is consistent with the minimum number of repetitions. For example:
A.*?b (the match word for the minimum character starting with a end for b)
If there is a string "Aabab", using the first of the above re to get the matching string is "AaB" and then "AB", because this is the word to find the fewest characters.
*? Repeat any number of times, the minimum number of repetitions is the principle
+? Repeat at least once, the minimum number of repetitions is the principle
?? Repeat 0 or one time, the minimum number of repetitions is the principle
{n,m}? Repeat at least n times, but not more than M, the minimum number of repetitions is the principle
{N,}? Repeat at least n times, the minimum number of repetitions is the principle
What else didn't mention it?
So far, many elements of the re have been mentioned, and of course there are many elements that are not mentioned, the following table organizes some of the elements that are not mentioned, and the number in the leftmost field is an example of what is illustrated in Expresso.
# syntax Description
\a Bell Character
\b usually refers to the boundary of a word, which is represented in a character group is backspace
\ t Tab
\ r \ Carriage return
\v Vertical Tab
\f from Feed
New Line
\e Escape
\NNN ASCII eight-bit code for NNN characters
PNS \xnn 16-bit code for the nn character
\unnnn Unicode is a nnnn character
The \CN Control n character, for example Ctrl-m is \cm
The beginning of the \a string (similar to ^ but not required by the multiline option)
The end of a \z string
\z End of string
The beginning of the current search for \g
\p{name} Unicode Character group name is a character named name, for example, \p{lowercase_letter} refers to lowercase
(? >exp) Greedy description, also known as non-backtracking times description. This only meets once and does not take backtracking.
(? <x>-<y>exp)
or (?-<y>exp) to balance the group. Complex but easy to use. It allows named crawl groups to work on the stack. (the younger brother is not very understanding of this too)
The Im-nsx:exp changes the RE option for the second description of exp, such as (?-i:elvis), which turns off the Elvis option of a large case.
The (? im-nsx) changes the RE option for subsequent groups.
(? (exp) yes|no) The second description exp as zero-width positive lookahead. If there is a match at this time, then the yes is described as the next conforming, if no, then no is described as the next conforming.
(? (exp) Yes) as described above but no no
(? (name) yes|no) if name group is a valid group name, the Yes is described as the next qualifying, and if no, no is described as the next eligible target.
47 (? ( Name) is the same as above but no no no time description
Regular Expression Advanced 2.1