Advanced | skills | regular
Objective
Regular Expressions (regular expression, the following is called by re to the younger brother has always been a god-dense zone, see some of the network on the big, simple with re on the decision to solve some of the problems of the text, the younger brother will rise to learn the idea of re, but the younger brother is naturally more lazy some, Always hope to see if there is some way to learn quickly, so the younger brother and please out of the Google great God, with his divine power, the younger brother found Jim Hollenhorst on the network of the article, after reading, little brother feel really good, so do a small experience report, Share with move-to.net friends, hoping to bring you a bit of help in learning re. Jim Hollenhorst A big article of the Web site as follows, there is a need for a large direct link.
The Minute Regex Tutorial by Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
What is re?
You must have used all the characters "*" in the file search, for example, to find all the word files in the Windows directory, you might use the "*.doc" method to do the lookup, because "*" represents any character. The RE is doing something like this, but it's more powerful.
When writing a program, you often need to match a string to a specific style, the most important function of RE is to describe this particular style, so you can treat the re as a description of a particular style, for example, "\w+" represents a non-empty string of any letter and number (Non-null string). Provides a very powerful category library in the. NET framework, which makes it easy to use the RE to find and replace text, to decode complex headers, and to validate text.
The best way to learn re is to do it in person by example. Jim Hollenhorst also provides a tool program Expresso (a cup of coffee), to help us learn re, download the URL isHttp://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip。
Next, let's experience some examples.
A few simple examples
If you are looking for a alive string of characters after the Elvis in the article, using the re may pass through the following procedure, meaning that the parentheses are under re:
1. Elvis (find Elvis)
The order of characters you want to look for is Elvis. In. NET you can set the case of a bit of character, so "Elvis", "Elvis", or "Elvis" are all the re in accordance with 1. But because this character appears in the order of Elvis, so pelvis is also consistent with 1 of the RE. Can be improved with a 2 re.
2. \belvis\b (Elvis as a whole, such as Elvis, Elvis in case of slightly character)
"\b" in the re has a special meaning, in the above example refers to the boundary of the word, so \belvis\b with \b to Elvis before and after the boundary is defined, that is to Elvis the word.
Suppose you want to find a string of Elvis followed by a alive in the same line, and then use another two special characters "." and "*". ”.” is represented by any character other than a newline character, and "*" represents a repeat item until a string is found that conforms to the re. So ". *" refers to any number of characters except the newline character. So look for the same line after the Elvis followed by a alive string of characters found out, can be down as 3 of the RE.
3. \belvis\b.*\balive\b (Find a string of characters followed by a alive Elvis, such as Elvis is Alive)
A powerful re can be composed of simple special characters, but it is also found that when more and more special characters are used, the re becomes more and more difficult to understand.
And look at another example.
Compose a valid phone number
If you want to collect a 7-digit phone number from a Web page that has a customer format of xxx-xxxx, where x is a number, the re may write this.
4. \b\d\d\d-\d\d\d\d (find seven digits of telephone number, such as 123-1234)
Each \d represents a number. "-" is a generic hyphen, and to avoid too many repetitive \d,re can be written in the same way as 5.
5. \b\d{3}-\d{4} (a good way to find seven-digit phone numbers, such as 123-1234)
After \d {3}, the representative repeats the previous item three times, which is equal to \d\d\d.
Re's learning and testing tools Expresso
Because the RE is not easy to read and users can easily error the characteristics of the RE, Jim developed a tool software Expresso to help users learn and test the RE, in addition to the above URL, can also be on the Ultrapico website (http://www.Ultrapico.com)。 After the installation of Expresso, in the expression library, Jim greatly put the example of the article is built in which, you can see the article side of the test, you can try to modify the example under the RE, immediately can see the results, little brother feel very good. You can have a big try.
. The basic concept of RE in net
Special characters
Some words have special meaning, such as "\b", ".", "*", "\d", etc. as previously seen. "\s" represents the arbitrary spaces, such as spaces, tabs, newlines and so on. "\w" represents any letter or number character.
Let's see some more examples.
6. \ba\w*\b (look for words beginning with a, such as able)
This re describes the starting boundary (\b) for a word, then the letter "a", plus any number of alphanumeric (\w*), and then the ending boundary (\b) of the word.
7. \d+ (Find numeric string)
"+" and "*" are very similar, except for + at least repeat the previous item once. In other words, there is at least one number.
8. \b\w{6}\b (find six alphanumeric characters, such as ab123c)
The following table is a common special character for re
. Any character other than a newline character
\w arbitrary alphanumeric characters
\s arbitrary spaces
\d any number of characters
\b Boundary of the defined word
^ The beginning of an article, such as "^the", to indicate that the string appearing at the beginning of the article is "the"
$ The end of an article, such as "end$", to indicate that the end of the article appears
Special characters "^" and "$" are used to find certain words must be the beginning or end of the article, which is especially useful when verifying that the input conforms to a certain style, for example, to verify a seven-digit number, you may enter the following 9 re.
9. ^\d{3}-\d{4}$ (Verify the seven-digit telephone number)
This is the same as the 5th re, but there are no other characters before and after, that is, the entire string has only seven digits of the phone number. If you set multiline this option in. NET, then "^" and "$" are compared for each row, as long as the beginning of a row conforms to the RE, not the entire article string.
Conversion character (escaped characters)
Sometimes it may take "^", "$" simple literal meaning (literal meaning) instead of them as special characters, at this point the "\" character is the character used to remove the special meaning of special characters, so "\^", "\.", "\ \" Represent "^", ".", "\" The literal meaning.
Repeat the aforementioned item
"{3}" and "*" can be used to repeat the preceding characters, and then we will see how to repeat the entire description (subexpressions) with the same syntax. The following table describes some of the ways in which you would use repeating previously mentioned items.
* Repeat any number of times
+ Repeat at least once
? Repeat 0 times or once
{n} repeat n times
{n,m} repeats at least n times, but not more than m times
{N,} repeat at least n times
Let's try some more examples.
\b\w{5,6}\b (find five or six alphanumeric characters, such as as25d, D58SDF, etc.)
\B\D{3}\S\D{3}-\D{4} (Find 10 digits of phone number, such as 800 123-1234)
\D{3}-\D{2}-\D{4} (Find social Security number, such as 123-45-6789)
^\w* (first word of each line or whole article)
In espresso you can try the difference between multiline and multiline.
Match a range of characters
Sometimes you need to look up some specific characters. Then the brackets "[]" came in handy. therefore [Aeiou] is looking for "a", "E", "I", "O", "u" these vowels, [.?!] What to look for is ".", "?", "!" These symbols, the special meaning of the special characters in the brackets, are removed, that is, the literal meaning of the interpretation. You can also specify a range of characters, such as [a-z0-9], that refer to any lowercase letter or any number.
Next, look at an example of re with the first complex search phone number.
(? \d{3}[(] \s?\d{3}[-]\d{4} (Find a 10-digit number, such as (080) 333-1234)
Such re can be found in more than one format of the phone number, such as (080) 123-4567, 511 254 6654, and so on. ”\(?” Represents one or 0 left parentheses "(," and "[(]" means "to find a right parenthesis") or spaces, "\s?" Refers to one or 0 spaces groups. But such a re would find a phone like "800" 45-3321, which is not symmetric, and then learn the alternatives to solve the problem.
Not included in a particular character group (negation)
Sometimes you need to find characters that are contained in a particular set of characters, and the following table shows how to do a description like this.
\w is not an alphanumeric character
\s is not a spaces any character
\d is not any character of a numeric character
\b is not a word boundary position
[^x] is not any character of X
[^aeiou] is not any character of a, E, I, O, u
\s+ (string that does not contain spaces)
Choose one (alternatives)
Sometimes you need to look for a few specific choices, and then "|" This special character comes in handy, for example, to find five digits and nine digits (with a "-") ZIP code.
\b\d{5}-\d{4}\b|\b\d{5}\b (find five digits and nine digits (with "-") ZIP code)
In the use of alternatives should be noted before and after the order, because the re in alternatives will be preferred to match the leftmost items, 16, if you find five numbers of items in front, then the RE will only find five digits of the ZIP code. Learn to choose one, you can make a better correction of 14.
(\ (\d{3}\) |\d{3}) \s?\d{3}[-]\d{4} (10-digit phone number)
Group (Grouping)
Parentheses can be used to mediate a secondary description, which can be repeated or dealt with on a secondary description.
(\d{1,3}\.) {3}\d{1,3} (simple re looking for network address)
The meaning of this re is the first part (\d{1,3}\.) {3}, which refers to the smallest number with a maximum of three digits, followed by a "." Symbols, this type of total three, followed by one to three digits, that is, such as 192.72.28.1 number.
But there is a drawback, because the network address number is only up to 255, but the above re as long as a number of three digits are consistent, so this need to compare the number is less than 256, but only the use of the RE alone can not do such a comparison. Use the option in 19 to limit the address to the desired range of 0 to 255.
((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?] (Find network address)
Have you noticed that the RE is more and more like what the aliens are saying? Simply looking for a Web address, looking directly at the RE is full of difficult to understand.
Expresso Analyzer View
Expresso provides a feature that can turn the re down into a tree-like description, a separate description of a set of groups, providing a good debugging environment. Other features, such as partial conformance (Partial match only to find the part of the anti-White re) and the exception (Exclude match only does not look for the anti-White re part) will be left to you big try.
When the secondary description is grouped together with parentheses, the text that conforms to the secondary description can be used in the subsequent program processing or the re itself. In a preset situation, the group that meets is named after the number, starting from 1, and the order is from left to right, and this automatic group name can be seen in skeleton view or result view in Expresso.
Backreference is used to find text that matches the same text that is crawled in a group. For example, "\1" refers to the text that the group 1 captures.
\b (\w+) \b\s*\1\b (looking for repetitive words, where the repetition refers to the same word, the middle has a gap between the words such as dog dog)
(\w+) will crawl at least one character of the letter or number of words, and named it group 1, followed by the search for arbitrary spaces, and then group 1 the same text.
If you do not like the group automatically named 1, you can also name, in the example above, (\w+) rewritten as (? <word>\w+), which is to name the crawled group Word,backreference will be rewritten as \k<word>
\b (? <word>\w+) \b\s*\k<word>\b (use a named group to crawl duplicate words)
There are a number of special syntax elements to use parentheses, and the more general list is as follows:
Crawl (captures)
(exp) conforms to exp and crawls it into an automatically named group
(? <name>exp) conforms to exp and crawls it into named group name
(?: EXP) conforms to exp, does not crawl it
Lookarounds
(? =exp) conforming to the character at the end of the word exp
(? <=exp) conforming to the text prefixed with exp
(?! EXP) conforms to the text that does not follow the end of the exp word
(? <!exp) conforms to the previous text without the EXP prefix
Annotation Comment
(? #comment) annotation
Positive Lookaround
The next thing to talk about is lookahead and lookbehind assertions. What they are looking for is the text that is currently in line with before or after, and does not contain the current fit itself. These are like "^" and "\b" special characters, which do not correspond to any text (used to define position), and therefore are called zero-width assertions, some examples may be clearer.
(? =exp) is a "zero-width positive lookahead assertion". It refers to text that conforms to the end of the word exp, but does not contain the exp itself.
\b\w+ (=ing\b) (Word is the word ing, for example, the filling is in accordance with fill)
(? <=exp) is a "zero-width positive lookbehind assertion". It refers to text that complies with the prefix exp, but does not contain the exp itself.
(<=\bre) \w+\b (the word prefixed with re, for example, repeated is peated)
(<=\d) \d{3}\b (three digits at the end of the word, followed by a number)
(? <=\s) \w+ (? =\s) (alphanumeric string separated by spaces)
Negative Lookaround
As mentioned before, how to find a character that is not specific or is not in a particular group. But what if you just want to verify that a character does not exist and do not correspond to these characters? For example, suppose you want to find a word that has q in its letter but the next letter is not u, you can do it with the following re.
\b\w*q[^u]\w*\b (a word that has a Q in its letter but the next letter is not u)
Such a re will have a problem, because [^u] to correspond to a character, so if Q is the last letter of the word, [^u] Such a dismount will spaces, the result will be able to meet two words, such as "Iraq haha" text. Using negative lookaround can solve such problems.
\b\w*q (?!) u) \w*\b (a word that has q in its letter but the next letter is not u)
This is "zero-width negative lookahead assertion".
\D{3} (?! \d) (three digits, followed by a digit)
Similarly, you can use (? <!exp), "zero-width negative lookbehind assertion", to match a string of characters that does not have the exp prefix previously.
(? <![ A-z]) \w{7} (seven alphanumeric strings preceded by letters or spaces)
(?<=< (\w+) >). * (?=<\/\1>) (text between HTML labels)
This uses lookahead and lookbehind assertion to remove text between HTML, excluding HTML labels.
Please comment (Comments)
Parentheses also have a special purpose to wrap the annotation, syntax is "(? #comment)", if you set the "Ignore pattern whitespace" option, the re in the spaces when the re is used slightly. When this option is set, the text after "#" will be slightly smaller.
Text between HTML volume labels, annotated
(? <= #查找前缀, but does not contain it
< (\w+) > #HTML标签
) #结束查找前缀
. * #符合任何文字
(? = #查找字尾, but does not contain it
<\/\1> #符合所抓取群组1之字符串, which is the HTML tag of the front parenthesis
) #结束查找字尾
Search for words with the most characters and minimum characters (greedy and Lazy)
When the re is looking for a range of repetitions (such as ". *"), it usually looks for the most literal characters, that is, greedy matching. For example.
A.*b (the words that begin with the most characters for a end of a b)
If a string is "Aabab", the conforming string obtained using the RE is "Aabab" because it is the word that looks for the most characters. Sometimes hope is the word with the fewest characters that is lazy matching. Just add a question mark (?) to the table that repeats the aforementioned item. You can turn them all into lazy matching. So "*?" Represents the repetition of any number of times, but with a minimum number of repetitions. For example:
A.*?b (in accordance with the minimum character starting with a end of B)
If there is a string that is "Aabab", the same string that is obtained by using the first of the RE is "AaB" and then "AB", since this is the word for finding the fewest characters.
*? Repeat any number of times, minimum number of repetitions as the principle
+? Repeat at least once, minimum number of repetitions as the principle
?? Repeat 0 or more times, minimum number of repetitions as the principle
{n,m}? Repeat at least n times, but not more than m times, minimum number of repetitions as the principle
{N,}? Repeat at least n times, minimum number of repetitions as the principle
What else didn't mention it?
So far, many elements have been mentioned to build re, and of course there are many elements that are not mentioned, the following table collates some of the elements that are not mentioned, and the number in the leftmost field is the example in Expresso.
# syntax Description
\a Bell Character
\b usually refers to the boundary of the word, which in the character group represents the backspace
\ t Tab
$ \ R Carriage Return
\v Vertical Tab
\f from Feed
New Line
\e Escape
\NNN ASCII eight-bit nnn characters
16-bit \xnn for nn characters
\unnnn Unicode to nnnn characters
\CN control n characters, for example ctrl-m is \cm
The start of the \a string (and ^ similar but not required by the multiline option)
The end of the \z string
End of \z string
\g the beginning of the current search
\p{name} Unicode character group name of the character, such as \p{lowercase_letter} refers to lowercase
(? >exp) Greedy description, also known as non-backtracking. This only meets once and does not adopt backtracking.
(? <x>-<y>exp)
The OR (?-<y>exp) balances the group. Complex but easy to use. It allows named crawl groups to be used on the stack. (younger brother is not very understand this too)
(? im-nsx:exp) to change the RE option for the secondary description exp, for example (?-i:elvis) is to turn off the Elvis option
IM-NSX changes the RE option for subsequent groups.
(? (exp) yes|no) The second description exp as zero-width positive lookahead. If there is a compliance at this time, then yes is described as the next conformance, if no, then no is described as the next conformance.
(? (exp) Yes) and the same but no no no description above
(? (name) yes|no) if name group is a valid group name, then yes is described as the next conformance, and if not, no is described as the next compliance target.
47 (?) ( Name Yes) and the same but no no no description above
Conclusion
After a series of examples, and Expresso help, I believe you have a basic understanding of the RE, of course, there are many articles on the Web, if you are greatly interested in http://www.codeproject.com There are many related articles about re. If a great deal of interest in the book, Jeffrey Friedl mastering Regular Expressions A lot of large majority have pushed (little brother has not read). Hope that by such a report, so that the interest of re can greatly shorten the learning curve, of course, this is the first time the younger brother contact re, if there is any error in the article or description of the bad place, you can please greatly understanding, and please you will need to amend the place mail to the younger brother, the younger brother will be very grateful to you greatly.