Getting started with regular expressions in 30 minutes

Source: Internet
Author: User
Tags processing text
(Reprinted) Regular Expression 30 minutes getting started tutorial Author: deerchao Source: unibetter? You may have used wildcards for file search in Windows/Dos, that is, * and ?. If you want to find all the Word documents in a directory, you will search for *. doc. Here, * is interpreted as any string. Like wildcards, regular expressions are also a tool for text matching, but they can describe your requirements more accurately than wildcards-of course, the cost is more complex. For example, you can write a regular expression to search for all numbers starting with 0, followed by 2-3 numbers, and then a hyphen "-", it is a string of 7 or 8 digits (such as 010-12345678 or 0376-7654321 ). When writing a program or webpage that processes strings, you can often find strings that meet certain complex rules. Regular Expressions are tools used to describe these rules. In other words, a regular expression is the code that records text rules. For example, \ d + is a concise code, representing the rule with one or more digits. 2008 is in line with this rule, a3 does not match (it contains characters not numbers ). The best way to learn regular expressions is to start with the example, understand the example, and then modify and experiment the example. The following are some simple examples and detailed descriptions of them. If you search for hi in an English novel, you can use the regular expression hi. This is the simplest regular expression. It can precisely match a string consisting of two characters, the first character is h, and the last one is I. Generally, the regular expression processing tool provides a case-insensitive option. If this option is selected, it can match hi, HI, Hi, and hI. Unfortunately, many words contain the two consecutive characters hi, such as him, history, and high. If you use hi for search, the hi here will also be found. To precisely search for the word "hi", we should use \ bhi \ B. \ B is a special code specified by a regular expression, representing the beginning or end of a word. Although English words are usually separated by spaces or punctuation marks or exchange behavior, \ B does not represent any of these word delimiters, but only one position. If you are looking for a Lucy not far behind hi, you should use \ bhi \ B. * \ bLucy \ B. Here, it is another special code that represents any character except line breaks. * It is also a special code, but it does not represent a character or a location, but a number-It specifies * the content of the front edge can be repeated any time to make the entire expression match. Therefore,. * When connected, it means that any number of characters do not contain line breaks. Now \ bhi \ B. * \ bLucy \ B is very obvious: first, a word hi, then any character (but not a line break), and finally Lucy. If you use other special code at the same time, we can construct a more powerful regular expression. For example, 0 \ d-\ d represents a string starting with 0, then there are two numbers, then a hyphen "-", and finally eight numbers (that is, the Chinese phone number. Of course, this example can only match the three-digit area code, if you want to match the area code with four digits at the same time, please find the answer below in the tutorial ). Here \ d is a new special code that represents any number (0, or 1, or 2, or ...). -It is not a special code. It only indicates its own-font size. To avoid so many annoying repetitions, we can also write this expression as follows: 0 \ d {2}-\ d {8} Here {2} ({8}) after \ d }) the specified \ d must appear twice (8 times) consecutively ). Test a regular expression. If you do not think it is difficult to read or write a regular expression, either you are a genius or you are not a human on Earth. The syntax of a regular expression is a headache, even for those who often use it. Because it is difficult to read/write and error-prone, it is necessary to create a tool to test regular expressions. Because the details of regular expressions vary in different environments, this tutorial introduces Microsoft.. net. the Regulator. First, make sure that you have installed. net Framework1.1, download The regulator, download The package, and run setup.exe to install it. Special code now you know a few code with special meanings, such as \ B ,., *, and \ d. in fact, there are more special code, such as \ s representing any blank space, including spaces, tabs, and line breaks. \ W represents letters or numbers. Next, let's try more examples: \ ba \ w * \ B matches a word starting with the letter a-first at the beginning of a word (\ B), then, then there are any number of letters or numbers (\ w *), and finally the end of the word (\ B ). \ W (any character); \ d + matches one or more consecutive numbers. Here, + is a special code similar to *. The difference is that * Indicates repeating any time (which may be 0 times), while + indicates repeating once or more times. \ B \ w {6} \ B matches exactly 6 letters/numbers. Table 1. common special code/syntax descriptions. match any character except line break \ w match letter or digit \ s match any blank character \ d match number \ B match word start or end ^ match string start $ match string end special code ^ and $ are a bit similar to \ B, all match a location. ^ Match the start of the string you want to search for, and $ match the end. These two codes are very useful when verifying the entered content. For example, if a website requires that the QQ number you enter must be 5 to 12 digits, you can use: ^ \ d {5, 12} $. The {} Here is similar to the {2} mentioned above, but {2} indicates that there are only a few repeated 2 times, and {5, 12} indicates that there must be at least 5 repeated times, A maximum of 12 times, otherwise none match. Because ^ and $ are used, the entire input string must be matched with \ d {5, 12}, that is, the entire input must be 5 to 12 digits, therefore, if the entered QQ number can match this regular expression, it will meet the requirements. Similar to case-insensitive options, some regular expression processing tools also have an option to process multiple rows. If this option is selected, the meaning of ^ and $ is changed to the start and end of the matching row. Character escape if you want to find the special code itself, such as you search for. Or *, a problem occurs: you cannot specify them because they are interpreted as other meanings. In this case, you must use \ to cancel the special meanings of these characters. Therefore, you should use \. And \*. Of course, to find the \ itself, you also need to use \\. example: www \. unibetter \. com matches www.unibetter.com, c :\\ windows matches c: \ windows, 2 \ ^ 8 matches 2 ^ 8 (usually the 8th power of 2 ). Repeat the preceding steps: *, +, {2}, {5, 12. The following are all repeated Regular Expressions: Table 2. Common quantifiers code/syntax descriptions * repeated zero or more times + repeated once or more times? Repeated zero or one {n} repeat n times {n,} repeat n times or more {n, m} repeat n to m times. Below are some examples of repeated use: windows \ d + match one or more numbers after Windows 13 \ d {9} match with 13 followed by 9 Numbers (Chinese mobile phone number) ^ \ w + matches the first word of a line (or the first word of the entire string, which indicates the specific meaning depends on the option settings). To search for numbers, letters, or numbers, blank space is very simple, because there are special codes corresponding to these character sets, but if you want to match character sets that do not have predefined special codes, such as vowels (a, e, I, o, u), what should I do? You just need to list them in brackets. For example, [aeiou] matches any vowel, [.?!] Match punctuation marks (. Or? Or !) (An English statement usually ends with only the three punctuation marks ). Note that the special code in brackets is not interpreted as other meanings, so we do not need to write it as [\. \?!]. (In fact, this write will cause an error because it occurs twice \). We can also easily specify a character range. For example, [0-9] indicates that the meaning is exactly the same as \ d: a digit, similarly, [a-z0-9A-Z] is equivalent to \ w. The following is a more complex expression :\(? 0 \ d {2} [)-]? \ D {8 }. This expression can match phone numbers in several formats, such as (010) 88886666, 022-22334455, or 02912345678. Let's analyze it. First, it is an escape character \ (it can appear 0 times or once (?), Then there is a 0 followed by two digits ({2}), followed by one of),-, or space. It appears once or does not appear (?), The last eight digits are (\ d {8 }). Unfortunately, it can also match "Incorrect" format like 010) 12345678 or (022-87654321. To solve this problem, find the answer below this tutorial. Otherwise, you must find characters that do not belong to a simple character class. For example, if you want to search for any character except numbers, you need to use the negative sense: Table 3. common codes/syntax descriptions \ W match any character that is not a letter or digit \ S match any character that is not a blank character \ D match any non-digit character \ B match is not a word Start or the Ending position [^ x] matches any character other than x [^ aeiou] matches any character other than aeiou, for example: \ S + indicates a string that does not contain a blank character. <A [^>] +> indicates a string prefixed with a enclosed in angle brackets. After the replacement, it is time to solve the problem of three or four-digit area codes. Replacement in a regular expression refers to several rules. If any rule is satisfied, it should be regarded as a match. The specific method is to use | to separate different rules. Can't you understand? It doesn't matter. Let's look at the example: 0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} This expression can match two phone numbers separated by a hyphen: one is a three-digit area code, an eight-digit Local Code (for example, 010-12345678), a four-digit area code, and a seven-digit local code (0376-2233445 ). \ (0 \ d {2} \) [-]? \ D {8} | 0 \ d {2} [-]? The expression \ d {8} matches the phone number of the three-digit area code. The area code can be enclosed in parentheses or not. The area code can be separated by a hyphen or space, or there is no interval. You can try to replace it with | extend this expression to a four-digit area code. The expression \ d {5}-\ d {4} | \ d {5} is used to match the zip code of the United States. The U.S. Postal Code uses five digits or nine digits separated by a hyphen. This example is given because it indicates a problem: sequence is important when replacement is used. If you change it to \ d {5} | \ d {5}-\ d {4, then, it will only match the 5-digit ZIP code (and the first 5-digit of the 9-digit ZIP code ). The reason is that each condition is tested from left to right during Matching and replacement. If a condition is met, other replacement conditions are not managed. Windows98 | Windows2000 | the example of WindosXP is to tell you that replacement can be used not only for two rules, but also for more rules. Grouping we have already mentioned how to repeat a single character; but what if you want to repeat a string? You can use parentheses to indicate the subexpression (also called grouping), and then you can specify the number of repetitions of this subexpression, you can also perform other operations on the subexpression (this will be introduced later in the tutorial ). (\ D {1, 3} \.) {3} \ d {1, 3} is a simple IP address matching expression. To understand this expression, analyze it in the following order: \ d {1, 3} represents a number ranging from 1 to 3 (\ d {1, 3 }\.} {3} indicates that a three-digit ending point (this group is used as a whole) is repeated three times, and a one-to-three-digit ending point (\ d {1, 3}) is added }). Unfortunately, it will also match an impossible IP address such as 256.300.888.999 (each number in the IP address cannot exceed 255 ). If arithmetic comparison can be used, this problem may be solved simply. However, regular expressions do not provide any mathematical functions. Therefore, you can only use lengthy grouping and selection, character class to describe a correct IP Address: (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?). The key to understanding this expression is to understand 2 [0-4] \ d | 25 [0-5] | [01]? \ D ?, I will not elaborate on it here. You should be able to analyze its meaning. After a subexpression is specified with parentheses In The Back Reference, the text matching the subexpression can be further processed in the expression or other programs. By default, each group will automatically have a group number. The rule is: use the left parentheses of the Group as the sign, from left to right, the group number of the first group is 1, and the second is 2, and so on. Backward reference is used to repeatedly search text matched by the previous Group. For example, \ 1 indicates the text matched by Group 1. Hard to understand? For example, \ B (\ w +) \ B \ s + \ 1 \ B can be used to match duplicate words, such as go and kitty. The first is a word, that is, more than one letter or number (\ B (\ w +) \ B) between the start and end of a word ), then there is one or several blank characters (\ s +, and finally the matched word (\ 1 ). You can also specify the group number or group name of the subexpression. To specify the group name of a subexpression, use the following syntax :(? <Word> \ w +), so that the Group Name of \ w + is specified as Word. To reverse reference the content captured by this group, you can use \ k <Word>, so the previous example can also be written as follows: \ B (? <Word> \ w +) \ B \ s * \ k <Word> \ B. When parentheses are used, there are many syntax for specific purposes. Table 4. Grouping syntax capture (exp) matches exp and captures text to automatically named groups (? <Name> exp) Match exp and capture the text to the group named name (?: Exp) matches exp, and the position of the matched text is not captured (? = Exp) match the position (? <= Exp) matches the position (?! Exp) the position behind the matching is not the exp position (? <! Exp) match the position comment (? # Comment) this type of group does not affect the processing of regular expressions, but we have discussed the first two syntaxes to help people read comments. Third (?: Exp) does not change the processing method of the regular expression, but the content of such a group match will not be captured into a group as in the first two methods. Location specifies the next four things used to find before or after some content (but not including the content), that is, they are used to specify a location, just like \ B, ^, $, therefore, they are also called assertion with zero width. We 'd better illustrate it with examples :(? = Exp) is also called the zero-width predicate. It matches certain positions in the text. These positions can be followed by the given suffix exp. For example, \ B \ w + (? = Ing \ B), match the first part of the word ending with ing (except for the ing part), if you are looking for I'm singing while you're dancing. it will match sing and danc. (? <= Exp) is also called the assertion with Zero Width. It matches certain positions in the text and matches exp with the given prefix. For example (? <= \ Bre) \ w + \ B will match the second half of the word starting with re (Except re). For example, it matches ading when searching for reading a book. If you want to add a comma (, of course, from the right side) to each of the three digits in a long number, you can search for the parts that need to be added with a comma :((? <= \ D) \ d {3}) * \ B. Please analyze this expression carefully. It may not be as simple as what you first see. The following example uses both the prefix and Suffix :(? <= \ S) \ d + (? = \ S) match the numbers separated by spaces (emphasize again, do not include these spaces ). The negative position specifies the method used to find characters that are not a character or are not in a character class ). But what if we only want to ensure that a character does not appear, but do not want to match it? For example, if we want to find such a word, which contains the Letter q, but q is not followed by the letter u, we can try this: \ B \ w * q [^ u] \ w * \ B matches a word that contains the Letter q, not the letter u. But if you do more tests (or if you are keen enough, you can simply observe them), you will find that if q appears at the end of a word, like Iraq, Benq, this expression will cause an error. This is because [^ u] Always matches one character, so if q is the last character of a word, the [^ u] Following will match the word separator (which may be a space, a full stop or something else) after q, And the \ w + \ B Following will match the next word, therefore, \ B \ w * q [^ u] \ w * \ B can match the entire Iraq fighting. The specified negative position can solve this problem because it only matches one position and does not consume any characters. Now, we can solve this problem as follows: \ B \ w * q (?! U) \ w * \ B. Assertion (?! Exp), only matching the position where the suffix exp does not exist. \ D {3 }(?! \ D) matches three digits, and the three digits cannot be followed by digits. Similarly, we can use (? <! Exp), and the assertion is performed to find the position where the prefix exp does not exist :(? <! [A-z]) \ d {7} match the first seven digits that are not lowercase letters (error found in the experiment? Note whether your "case sensitive" item is selected first ). A more complex example :(? <= <(\ W +)> ).*(? = <\/\ 1>) matches the content in the simple HTML Tag that does not contain the attribute. (<? (\ W +)>) specifies the prefix: The word enclosed by Angle brackets (for example, <B>), and then. * (any string), followed by a suffix (? = <\/\ 1> ). Pay attention to the \/In the suffix, which uses the character escape mentioned above; \ 1 is a reverse reference, which references the first group captured, the previous (\ w +) if the prefix is <B>, the suffix is </B>. The entire expression matches the content between <B> and </B> (remind me again, excluding the prefix and suffix itself ). Another use of annotating parentheses is the pass-through syntax (? # Comment) to include comments. To include comments, it is best to enable the "blank characters in ignore mode" option. In this way, spaces, tabs, and line breaks can be added when an expression is written, which will be ignored in actual use. After this option is enabled, all the text that ends at the end of the line after # is ignored as a comment. For example, we can write the previous expression as follows :(? <= # Search for a prefix, but does not contain it <(\ w +)> # search for letters or numbers (TAGS) enclosed in angle brackets # End of the prefix. * # match any text (? = # Search for the suffix, but does not contain it <\/\ 1> # search for the content enclosed by Angle brackets: the front is a "/", followed by the previously captured tag) # suffix ends greed and laziness when a regular expression contains quantifiers that can accept duplicates (a specified number of codes, such as *, {5, 12, the common behavior is to match as many characters as possible. Consider this expression: a. * B, which will match the longest string starting with a and ending with B. If you use it to search for aabab, it will match the entire string aabab. This is called greedy matching. Sometimes, we need to be more lazy to match, that is, to match as few characters as possible. All the quantifiers mentioned above can be converted to the lazy match mode. You just need to add a question mark after it ?. This way .*? This means to match any number of duplicates, but use the minimum number of duplicates if the entire match is successful. Now let's look at the lazy version example: .*? B matches the string that is shortest, starts with a, and ends with B. If it is applied to aabab, it will match aab and AB. Table 5. Lazy quantifiers *? Repeat any time, but as few as possible +? Repeat once or more times, but as few as possible ?? Repeated 0 or 1 times, but as few as possible {n, m }? Repeat n to m times, but try to repeat {n ,}? Repeat more than n times, but try to repeat as few times as possible. There are some things that haven't been mentioned. I have already described a large number of elements for constructing regular expressions, and there are some things I haven't mentioned. The following is a list of Unmentioned elements, including syntax and simple description. You can find more detailed references on the Internet to learn about them-when you need them. If you have installed the MSDN Library, you can also find detailed documentation on Regular Expressions Under. net. Table 6. syntax \ a alarm character that has not been discussed (print it to a computer) \ B is usually the word demarcation position, but if it is used in the character class, it indicates the escape \ t tab, tab \ r press enter \ v vertical Tab \ f Tab break \ n line break \ e Escape \ 0nn ASCII code octal characters with nn Code \ xnn ASCII code hexadecimal code is nn in the Unicode code of \ unnnn, The hexadecimal code is the character \ cN ASCII control character of nnnn. For example, \ cC indicates the start of the Ctrl + C \ A string (similar to ^, but not affected by the multi-row processing option) \ Z string end or end of the line (not affected by the multi-row processing option) \ z string end (similar to $, but not affected by the option of processing multiple rows) \ G the character class named name in the Unicode at the beginning of the current search \ p {name, for example, \ p {IsGreek} (?> Exp) greedy subexpression (? <X>-<y> exp) Balance group (? -<Y> exp) Balance group (? Im-nsx: exp) change the processing option (? Im-nsx) is the partial change processing option (? (Exp) yes | no) Use exp as the zero-width forward positive asserted. If this position can match, use yes as the expression of this group; otherwise, use no (? (Exp) yes) Same as above, just use an empty expression as no (? (Name) yes | no) if the content is captured by the group named name, use yes as the expression; otherwise, use no (? (Name) yes) Same as above. It is just a reference character program that uses an empty expression as no. Some terms I think you may already know are the most basic unit for processing text, which may be letters or numbers, punctuation Marks, spaces, line breaks, Chinese characters, and so on. A string with 0 or more characters. Text text, String. Matches the rules to check whether they comply with the rules.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.