Preface
Regular Expressions (a regular expression, which is called "re" below) has always been a secret zone for younger brother. As you can see, some of the network's big data has solved some text problems simply by using "re, the younger brother was born with the idea of learning a new student, but the younger brother was naturally a little lazy and always wanted to see if there were any fast learning methods. So the younger brother invited the Google gods to learn from him, mr. Jim holenhorst's Article After reading, the younger brother thought it was really good, so I had to make a careful report, share with friends of Move-to.Net, hope to bring you a little bit of help in learning re. The URL of Jim holenhorst's large article is as follows. You can directly link it to a large article if necessary.
The 30 minute RegEx tutorial by Jim holenhorst
Http://www.codeproject.com/useritems/RegexTutorial.asp
What is re?
Presumably, you have used a-character "*" when searching for files. For example, if you want to find all the Word files in the Windows directory, you may use "*. doc "is used for search, because" * "represents any character. Re is doing something like this, but it is more powerful.
WriteProgramRe is mainly used to describe the specific style. Therefore, you can regard re as a descriptive style. For example, "\ W +" represents a non-null string consisting of any letter or number ). The. NET Framework provides a very powerful category library to easily use re for text search and replacement, decoding complex headers, and text verification.
The best way to learn Re is to do it with examples. Jim holenhorst also provides a tool program expresso (cup of coffee) to help us learn about re. The download URL is http://www.codeproject.com/useritems/regextutorial/expressosetup2_1c.zip.
Next, let's try some examples.
Some simple examples
If you want to search for Elvis followed by an alive string in the article, using RE may go through the following process. Parentheses are the meaning of the RE:
1. ELVIS (search for Elvis)
The above indicates that the character order to be searched is Elvis. In. net, the Case sensitivity can be set to slightly different characters. Therefore, "Elvis", "Elvis", or "Elvis" are all RESS under 1. But because only the characters appear in the order of Elvis, pelvis also conforms to the RESS under 1. You can use the RE of 2 to improve the performance.
2. \ belvis \ B (Elvis is regarded as a whole word search, such as when Elvis and Elvis are slightly case sensitive)
"\ B" has a special meaning in re. In the above example, it refers to the word boundary, SO \ belvis \ B defines the front and back boundary of Elvis with \ B, that is, the word Elvis is required.
Assume that Elvis in the same row is followed by an alive string, and the other two special characters "." and "*" are used "*"."." This indicates any character except the line break character, and "*" indicates that the project is repeated * until the re-compliant string is found. Therefore, ". *" refers to any number of characters except for line breaks. Search for Elvis in the same row and find out the alive text string, which can be like 3 Re.
3. \ belvis \ B. * \ balive \ B (FIND THE alive text string followed by Elvis, such as elvis is alive)
You can use simple and special characters to form a powerful re, but it also finds that when more and more special characters are used, re will become more and more difficult to understand.
Let's look at another example.
Form a valid phone number
If you want to collect 7-digit phone numbers in the format of XXX-XXXX from the web page, where X is a number, RE may write like this.
4. \ B \ D-\ D (find the seven-digit phone number, such as 123-1234)
Each \ D represents a number ." -"Is a general hyphen. To avoid too many repeated \ D, RE can be rewritten as 5.
5. \ B \ D {3}-\ D {4} (search for a better seven-digit phone number, such as 123-1234)
{3} After \ D indicates that the previous project is repeated three times, that is, it is equal to \ D.
Re learning and testing tool Expresso
Because Re is not easy to read and users are prone to errors, Jim has developed a tool software expresso to help users learn and test the re. Besides the URL described above, you can also go to the ultrapico website (http://www.Ultrapico.com ). After expresso is installed, in expression library, Jim builds many examples of the article. You can view the article and test it, or try to modify the re under the example, I can see the result immediately, and the younger brother thinks it is very useful. You can try it.
Basic concepts of RE in. net
Special characters
Some characters have special meanings, such as "\ B", ".", "*", and "\ D ." \ S represents any space character, such as spaces, tabs, and newlines .." \ W represents any letter or number.
Let's look at some examples.
6. \ Ba \ W * \ B (search for words starting with a, such as able)
This re description is used to find the start boundary of a word (\ B), followed by the letter "A", plus any number of letters and numbers (\ W *), then terminate the end boundary of the word (\ B ).
7. \ D + (search for numeric strings)
"+" And "*" are very similar, except that + must repeat the previous project at least once. That is to say, there must be at least one number.
8. \ B \ W {6} \ B (search for six letters and numbers, such as ab123c)
The following table lists the special characters commonly used by Re.
. Any character except for line breaks
\ W any letter or Digit
\ S any space character
\ D any number character
\ B defines the word boundary
^ The beginning of the article, for example, "^ the'' indicates that the string that appears at the beginning of the article is ""
$ End of an article, such as "End $", indicates that the end of an article appears as "end"
Special characters "^" and "$" are used to search for certain words that must be the beginning or end of an article. They are especially used to verify whether the input meets a certain style, for example, if you want to verify a seven-digit phone number, you may enter the following 9 re.
9. ^ \ D {3}-\ D {4} $ (verify the phone number with seven digits)
This is the same as the 5th re, but there are no other characters before and after it, that is, the entire string only has the seven numbers of phone numbers. In. if the multiline option is set in. net, "^" and "$" compare each line, as long as the beginning and end of a line meet the RE, instead of the entire article string for a comparison.
Conversion character (escaped characters)
Sometimes, you may need literal meaning instead of special characters, in this case, the "\" character is used to remove special characters, so "\ ^ ","\. "," \ "represents" ^ ",". the literal meaning.
Repeat the preceding project
I have read "{3}" and "*" before to repeat the preceding characters. Then we will see how to repeat the entire description (subexpressions) with the same syntax ). The following table describes how to repeat the preceding items.
* Repeat any number of times
+ Repeat at least once
? Zero or one repetition
{N} repeated n times
{N, m} repeats at least N times, but does not exceed M times
{N ,}repeat at least N times
Let's try some examples.
10. \ B \ W {5, 6} \ B (search for five or six alphanumeric characters, such as as25d and d58sdf)
11. \ B \ D {3} \ s \ D {3}-\ D {4} (find the phone number of ten numbers, such as 800 123-1234)
12. \ D {3}-\ D {2}-\ D {4} (find a social insurance number, such as 123-45-6789)
13. ^ \ W * (the first word in each line or entire article)
In espresso, try the difference between multiline and no multiline.
Match characters in a certain range
How to identify specific characters? In this case, brackets "[]" come in handy. Therefore, [aeiou] looks for the vowels "A", "E", "I", "O", and "u", [.?!] What are you looking ".","?" ,"!" These symbols remove the special meanings of special characters in brackets, that is, they are interpreted as literal meanings. You can also specify characters in a certain range, such as "[a-z0-9]", referring to any lowercase letter or any number.
Next, let's look at a complicated re-Example for finding phone numbers.
14 .\(? \ D {3} [(] \ s? \ D {3} [-] \ D {4} (find the phone number of 10 digits, for example (080) 333-1234)
Such a re can be used to find phone numbers in multiple formats, such as (080) 123-4567, 511 254 6654, and so on ." \(?" Represents one or zero left parentheses (", and" [(] "indicates finding a right parentheses") "or space character," \ s ?" It refers to one or zero space character groups. However, such a re will find a phone number like "800) 45-3321", that is, there is no symmetric balance between the brackets, and you will learn to choose one later (alternatives) to solve this problem.
Not included in a specific character group (negation)
Sometimes you need to find the characters contained in a specific character group. The following table describes how to perform such a description.
\ W is not an arbitrary character of letters and numbers
\ S is not any character of the space character
\ D is not an arbitrary number character
\ B is not at the word boundary
[^ X] Not any character of X
[^ Aeiou] is not any character of A, E, I, O, u
15. \ s + (a string that does not contain space characters)
Alternatives)
Sometimes you need to find several specific options. At this time, the special character "|" comes in handy. For example, search for the zip code of five and nine numbers (.
16. \ B \ D {5}-\ D {4} \ B | \ B \ D {5} \ B (search for five numbers and nine numbers ("-")) zip code)
When using alternatives, you need to pay attention to the order before and after, because in alternatives, re will first select the project that matches the leftmost, and in 16, if you put the item that finds the five numbers in front, the re will only find the zip code of five digits. If you have learned how to choose one, you can make a better correction of 14.
17. (\ D {3} \) | \ D {3}) \ s? \ D {3} [-] \ D {4} (10-digit phone number)
Grouping)
Parentheses can be used to describe a description. Through the description, you can repeat or process the description.
18. (\ D {1, 3} \.) {3} \ D {1, 3} (find a simple re of the network address)
This Re indicates the first part (\ D {1, 3 }\.) {3} refers to a number with a minimum of three digits followed ". symbol. There are three numbers in this type, followed by one to three digits, that is, numbers such as 192.72.28.1.
However, this may lead to a disadvantage because the network address number can only reach 255 at most, but the above re is consistent as long as it is one to three digits, therefore, we need to make the comparison number smaller than 256, but only using re alone cannot make such comparison. In 19, use the option to limit the address to the required range, that is, 0 to 255.
19. (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) (Search for network addresses)
Have you noticed that RE is getting more and more like what aliens say? Simply look for the network address, directly look at the re are full of difficult to understand miles.
Expresso analyzer View
Expresso provides a function that converts the next re into a tree description. A group of separate descriptions provide a good debugging environment. Other functions, such as partially conforming (partial match only searches for the part of the anti-white re) and exclude match (exclude match only does not look for the part of the anti-white re) are left for you to try.
When the sub-description is grouped in parentheses, the text that matches the sub-description can be used in subsequent program processing or Re itself. Under the predefined condition type, the corresponding group is named by a number, starting from 1, and from left to right. This automatic group is named, you can see it in skeleton view or result view in expresso.
Backreference is used to find the same text captured in a group. For example, "\ 1" refers to the text captured by Group 1.
20. \ B (\ W +) \ B \ s * \ 1 \ B (looking for repeated words, the repetition here refers to the same word, there is a blank space in the middle to separate words such as dog)
(\ W +) captures at least one character of the letter or number, and name it group 1, and then search for any space character, followed by the same text as Group 1.
If you do not like the automatic group name 1, you can also name it yourself. In the preceding example, (\ W +) is rewritten (? <Word> \ W +). This is to name the captured group as word, and the backreference must be rewritten to \ K. <word>
21. \ B (? <Word> \ W +) \ B \ s * \ K <word> \ B (use a self-naming group to capture duplicate words)
There are many special syntax elements using parentheses. The more common list is as follows:
Captures)
(Exp) Match exp and capture it into the automatically named group
(? <Name> exp) matches the exp and crawls it into the named group name.
(? : Exp) conforms to exp and does not capture it.
Lookarounds
(? = Exp) text that matches the word ending with exp
(? <= Exp) text prefixed with exp
(?! Exp) match the text that is not followed by the exp character.
(? <! Exp) text that is not prefixed with exp
Comment comment
(? # Comment) Annotation
Positive lookaround
Next we will talk about lookahead and lookbehind assertions. They are currently looking for text that matches the previous text or text that does not contain the current text. These are just like the special characters "^" and "\ B". They do not correspond to any text (used to define the position), and therefore are called zero-width assertions, some examples may be clear.
(? = Exp) is a "zero-width positive lookahead assertion ". It refers to the text that matches the word ending with exp, but does not contain exp itself.
22. \ B \ W + (? = Ing \ B) (the end of the word is ing. For example, filling matches fill)
(? <= Exp) is a "zero-width positive lookbehind assertion ". It refers to the text with the prefix exp, but does not contain the exp itself.
23 .(? <= \ BRE) \ W + \ B (for example, repeated matches peated)
24 .(? <= \ D) \ D {3} \ B (three digits at the end of the word, followed by a digit)
25 .(? <= \ S) \ W + (? = \ S) (a string of letters and numbers separated by space characters)
Negative lookaround
As mentioned earlier, how to find a character that is not specific or not in a specific group. But what if we only want to verify that a character does not exist and do not match these characters in? For example, if you want to find a word with Q in its letter, but the next letter is not u, you can use the following re for it.
26. \ B \ W * Q [^ u] \ W * \ B (a word with Q in its letter, but the following letter is not U)
This Re has a problem, because [^ u] corresponds to a character, so if Q is the last letter of the word, [^ u] in this way, the space character is matched, and the result may conform to two characters, such as Iraq haha. You can solve this problem by using negative lookaround.
27. \ B \ W * q (?! U) \ W * \ B (a word with Q in its letter, but the following letter is not U)
This is "zero-width negative lookahead assertion ".
28. \ D {3 }(?! \ D) (three digits are not followed by one digit)
Similarly, you can use (? <! Exp), "zero-width negative lookbehind assertion", to match the text string that is not prefixed with exp.
29 .(? <! [A-Z]) \ W {7} (a string of seven letters and numbers without letters or spaces)
30 .(? <= <(\ W +)> ).*(? = <\/\ 1>) (text in the HTML volume compartment)
This uses lookahead and lookbehind assertion to retrieve the text between HTML, excluding the HTML Tag.
Comments please)
Another special purpose of parentheses is to wrap comments. The syntax is "(? # Comment) ". If the" ignore pattern whitespace "option is set, the space character in the RE will be slightly used when the RE is used. When this option is set, the text after "#" will be omitted.
31. Add comments to the text in the HTML volume compartment
(? <= # Search for a prefix, but does not contain it
<(\ W +)> # HTML Tag
) # End search prefix
. * # Match any text
(? = # Search for the end of a word, but it is not included
<\/\ 1> # The string that matches the captured group 1, that is, the HTML tag of the parentheses
) # End searching
Find the most character and the least character (greedy and lazy)
When re is used to find the repetition of a range (for example, ". *"), it usually looks for the highest number of characters, that is, greedy matching. For example.
32. A. * B (the most character that starts to end with a and ends with B)
If a string is "aabab", the matching string obtained by the preceding Re is "aabab", because this is the word with the most characters. Sometimes it is expected to match the minimum character, that is, lazy matching. Add the question mark (?) You can change all of them to lazy matching. So "*?" This indicates the number of repeated attempts, but the minimum number of repeated attempts is used. For example:
33. .*? B (the minimum character that starts to end with a and ends with B)
If a string is "aabab", the first matching string obtained by the preceding Re is "AAB" and then "AB", because this is the word with the least characters.
*? Repeat any number of times. The minimum number of repeat is the principle.
+? Repeat at least once. The minimum number of repeat is the principle.
?? Zero or one repetition. The minimum repetition count is the principle.
{N, m }? Repeat at least N times, but not more than m times.
{N ,}? Repeat at least N times.
Nothing to mention?
So far, many re-creation elements have been mentioned. Of course, many elements have not been mentioned. The following table lists some unmentioned elements, the number in the leftmost field is an example in expresso.
# Syntax description
\ A bell character
\ B Usually refers to the word boundary. In a character group, it represents backspace.
\ T Tab
34 \ r carriage return
\ V vertical Tab
\ F from feed
35 \ n New Line
\ E escape
36 \ nnn ascii octal code: NNN character
37 \ xnn characters with a 16-digit NN
38 \ unnnn UNICODE: NNNN character
39 \ CN control n characters, for example, Ctrl-M is \ cm
40 \ A string start (similar to ^, but does not require the multiline option)
The end of the 41 \ Z string
\ Z string end
42 \ G start of current search
For example, \ P {lowercase_letter} refers to the character of the Unicode character group named name.
(?> Exp) greedy description, also called non-backtracking description. This only works once and does not support backtracking.
44 (? <X>-<Y> exp)
Or (? -<Y> exp) balance the group. Complex but easy to use. It allows the named capture group to be used in the stack. (The younger brother doesn't know this too well)
45 (? Im-NSX: exp) Change re option for the next description exp, such (? -I: Elvis) is to turn off the Elvis case-sensitive option.
46 (? Im-NSX.
(? (Exp) Yes | no) exp is regarded as zero-width positive lookahead. If yes, yes is described as the next conformity. If no, no is described as the next conformity.
(? (Exp) Yes) Same as the above but no description
(? (Name) Yes | no) if the name group is the name of a valid group, yes is described as the next conformity. If no, no is described as the next conformity.
47 (? (Name) Yes) Same as the above but no description
conclusion
after a series of examples and the help of expresso, I believe you will have a basic understanding of RE. Of course, there are many articles about re on the Internet, if you are interested in http://www.codeproject.com, there are many articles about re. If you are very interested in books, many mastering regular expressions of Jeffrey Friedl have been pushed (younger brother hasn't read it yet ). I hope that this kind of experience report can greatly shorten the learning curve for the Re. Of course, this is the first time that the younger brother has come into contact with Re. If there are any errors or poor descriptions in the article, I would like to thank you very much for your understanding. I would like to ask you to mail me the places you need to modify to your younger brother.