Regular
by Jim Hollenhorst Cold fish
Have you ever thought about what a regular expression is and how you can get a basic understanding of it quickly? My goal is to get you started in 30 minutes and have a basic understanding of regular expressions. The fact is that regular expressions are not as complex as they seem. The best way to learn it is to start writing regular expressions and keep practicing. After the first 30 minutes, you should know some basic structures and have the ability to design and use regular expressions in your programs or Web pages. For those who want to delve into it, there are plenty of very good resources available to help you learn more deeply.
What exactly is a regular expression?
I believe you should be familiar with the "computer wildcard" character for pattern matching. For example, if you want to find all mircosoft Word files in a Windows folder, you search for "*.doc" because you know the asterisk will be interpreted as a wildcard that matches the string of all the sequences. Regular expressions are a more detailed extension of this functionality.
It is common to locate strings that match complex patterns when writing a program or Web page that handles text. Regular expressions are used to describe such patterns. Thus, a regular expression is a reduced code for a pattern. For example, the pattern "\w+" is an accurate way of expressing "matching any non-empty string that contains alphanumeric characters." NET Framework provides a powerful class library that makes it easier to include regular expressions in your application. With this library, you can easily search and replace text, decode complex headings, parse languages, or validate text.
A good way to learn the mysterious syntax of regular expressions is to use examples as an object to start learning, and then practice creating your own regular expressions.
Let's get started!
A few simple examples
Search Elvis
Suppose you spend all your free time scanning documents to find evidence that Elvis is still alive. You can use the following regular expression to search for:
1. Elvis--Find Elvis
This is a perfectly valid regular expression for searching the exact sequence of characters. In. NET, you can easily set options to ignore the various situations of characters, so this expression will match "Elivs", "ELVIS", or "ELVIS". Unfortunately, it will also match the word "pelvis" with the latter five letters. We can improve this expression as follows:
2. \belvis\b--Find Elvis as a whole word
Now things have become more interesting. "\b" is a special code that represents "a position that matches the beginning or end of any word." This expression will only match the full word "Elvis", whether in lowercase or uppercase.
Suppose you want to find all of these lines, in which the word "Elvis" follows the word "alive". Period or dot "." is a special code that matches any character except the line break. The asterisk "*" indicates that the previous section repeats the necessary number of times to ensure that a match can be made. In this way, ". *" means "match any number of characters except for line breaks." It is a simple matter now to create an expression that says, "search for the word ' Elvis ' that follows the word ' alive ' in the same line."
3. \belvis\b.*\balive\b--Find text with "Elvis" followed by "Alive"
With just a few special characters we begin to create powerful regular expressions, and they are beginning to become difficult for us to understand.
Let's look at another example.
Determine the legality of the phone number
Suppose your Web page collects a customer's 7-bit phone number, and you want to verify that the phone number entered is in the correct format, "Xxx-xxxx", where each "X" is a number. The following expression searches the entire text for a string such as:
4. \b\d\d\d-\d\d\d\d--Find Seven-digit phone number
Each "\d" means "match any single digit". "-" has no special meaning and, by literal interpretation, matches a hyphen. To avoid tedious duplication, we can use a shorthand character that contains the same meaning:
5. \b\d{3}-\d{4}--Find Seven-digit phone number a better way
"{3}" after "\d" means "repeat previous character three times".
. Base of the net regular expression
Let's explore. The basis of the net-positive expression
Special characters
You should know a few characters that have special meaning. You've seen "\b", ".", "*", and "\d". To match any white space characters, such as spaces, tabs, and line breaks, use "\s". Similarly, "\w" matches any alphanumeric character.
Let's try more examples:
6. \ba\w*\b--Find words, that start and the letter A
This searches for the beginning of a word (\b), then a letter "a", followed by any number of repeated alphanumeric characters (\w*), and finally the end of a word (\b).
7. \d+--find repeated strings of digits
Here, "+" and "*" are similar except that it needs to be repeated at least once.
8. \b\w{6}\b--Find six letter words
Test these expressions in Expresso, and then practice creating your own expressions. Here is a table that shows characters that have special meaning:
. |
Matches any character except a newline character |
\w |
Match any alphanumeric character |
\s |
Match any white space character |
\d |
Match any number |
\b |
Match the beginning or end of a word |
^ |
Match the start of a string |
$ |
End of matching word string |
Table 1 Common special characters for regular expressions
Start phase
Special characters "^" and "$" are used to search for text that must begin with some text and/or end with some text. Particularly useful when validating input, in which the entire text entered must match a pattern. For example, to verify a 7-bit phone number, you might want to use:
9. ^\d{3}-\d{4}$--Validate a seven-digit phone number
This is the same as the 5th example, but forcing it to conform to the entire text string, there are no more characters to match the ends of the text. Passed in. NET to set the "Multiline" option, "^" and "$" change their meaning to match the start and end of a line of text, rather than the entire body string. Examples of Expresso use this option.
Code-changing characters
When you want to match one of these special characters, an error occurs, like "^" or "$". Use the backslash notation to remove their special meaning. This way, "\^", "\.", and "\ \", respectively, match the text characters "^", ".", and "\".
Repeat
You've seen it. "{3}" and "*" can specify the number of repetitions of a single character. Later, you will see how the same syntax is used to repeat the entire subexpression. There are several other ways to specify a repetition, as shown in the following table:
* |
Repeat any number of times |
+ |
Repeat one or more times |
? |
Repeat one or more times |
N |
Repeat n times |
{N,m} |
Repeat at least n times, up to M times |
{N,} |
Repeat at least n times |
Table 2 Common quantifiers
Let's try a few examples:
\b\w{5,6}\b--Find all five and six letter words
\B\D{3}\S\D{3}-\D{4}--Find ten digit phone numbers
\D{3}-\D{2}-\D{4}--Social security Number
^\w*--the "the" in "line" or in the text
Try the last example when setting and not setting the "Multiline" option, which changes the meaning of "^".
Character Set combination
Searching for alphanumeric characters, numbers, and whitespace characters is easy, but what if you need to search for any character in a character set? This can be easily resolved by listing the characters you want in square brackets. In this way, "[Aeiou]" can match any vowel, and "[.?!]" Matches the punctuation at the end of the sentence. In this case, pay attention to "." and "?" Lose their special meaning in square brackets and are interpreted as textual meaning. We can also specify a range of characters, so "[a-z0-9]" means "match any lowercase letter or any number".
Let's try a more complex expression that searches for a phone number:
(? \d{3}[)]\s?\d{3}[-]\d{4} A ten digit phone number
This expression will search for several formats of phone numbers, such as "(800) 325-3535" or "650 555 1212". “\(?” Search for 0 or 1 left parentheses, "[)]" to search for a right parenthesis or a space. "\s?" Search for 0 or one blank character. Unfortunately, it will also find cases like "650 555-1212" where parentheses are not removed. Below, you will see how to solve this problem with options.
Negative
There are times when we need to search for a character that is not a very easy to define character set of members. The following table shows how this character is specified:
\w |
Match any non-alphanumeric character |
\s |
Match any non-white-space character |
\d |
Match any non-numeric character |
\b |
Match non-word start or end position |
[^x] |
Match any non-X characters |
[^aeiou] |
Match any characters that are not in Aeiou |
Table 3 How to specify you don't want anything
\s+--all strings that does not contain whitespace characters
Later, we'll see how to use "lookahead" and "lookbehind" to search for situations where more complex patterns are missing.
Available options
To select from several options, allow matches of any one, using the vertical bar "|" To separate the optional options. For example, there are two kinds of postal codes, one is 5-bit, and the other is 9-bit plus a hyphen. We can use the following expression to find any one:
\b\d{5}-\d{4}\b|\b\d{5}\b--Five and nine digit ZIP codes
When using optional options, the order is important because the matching algorithm will attempt to match the leftmost selection first. If the order in this example is reversed, the expression will only find a 5-bit zip code, and will not find 9-bit. We can use the option to improve the 10-bit phone number expression, allowing the inclusion of the region code, whether it is separated by a white space character or a hyphen:
(\ (\d{3}\) |\d{3}) \s?\d{3}[-]\d{4}--Ten digit phone numbers, a better way
Group
Parentheses can be used to divide a subexpression to allow repetition or other special processing, such as:
(\d{1,3}\.) {3}\d{1,3}--A simple IP address finder
The first part of the expression searches followed by a "\." One to three digits. This is placed in parentheses and is repeated three times by using the modifier "{3}" followed by the same expression as before with the suffix part.
Unfortunately, this example allows the partition of an IP address to be any one, two-bit, or three-digit number, although a valid IP address cannot have a number greater than 255. It would be nice to be able to compare a obtained number of N to the n<256, but only regular expressions are not possible. The next example uses pattern matching to test multiple options based on the first digit to guarantee limiting the range of numbers. This means that an expression can become unwieldy, although the description of the search pattern is simple.
((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?] --IP Finder
A "callback" is used to search for the reappearance of matched text previously captured by a group. For example, "\1" means "match captured text in Group 1". Here is an example:
\b (\w+) \b\s*\1\b--find repeated words
It runs by first capturing a string of at least one alphanumeric character in Group 1, but only if it is the beginning or ending character of a word. \w+ It then searches for any number of whitespace characters "\s*" followed by words that end with the captured text "\1".
In the above example, to replace the grouping "(\w+)", we can write it "(? <word>\w+)" To name this group "Word". A callback to this group can be written as "\k<word>". Try the following example:
\b (? <word>\w+) \b\s*\k<word>\b--Capture repeated Word in a named group
By using parentheses, there are a number of special-purpose syntax elements available. Some of the most commonly used generalizations are as follows:
Capture |
(exp) |
Match exp and capture it in a group of automatic counts |
(? <name>exp) |
Match exp and capture it in a named group
|
(?: EXP) |
Match exp and do not capture it |
Look at |
(? =exp) |
Match any suffix exp before the position |
(? <=exp) |
Match any position after the prefix exp |
(?! Exp |
Match any locations that are not found after the suffix exp |
(? <!exp) |
Match any unresolved prefix exp position |
Comments |
(? #comment) |
Comments |
Table 4 Common grouping structures
We have said the first two. The third "(?: EXP)" Does not change the match behavior, it simply does not capture named or counted groupings like the first two.
OK view (Positive lookaround)
The following four are so-called forward or back assertions. They look forward or backward from the current match to what is needed and do not include them in the match. It is important to understand that these expressions match a position similar to "^" or "\b" and do not match any text. For this reason, they are also referred to as "0-width assertions." It's best to use examples to explain them:
"(? =exp)" is "0-width-determined forward assertion." It matches the position of a text before the given suffix, but does not include the suffix in the match:
\b\w+ (? =ing\b)--The beginning of words ending with "ing"
"(? <=exp)" is "0-width-determined rear assertion". It matches the position after the given prefix, but does not include the prefix in the match:
(<=\bre) \w+\b-the end of words starting with "re"
The following example can be used to repeat the example of inserting commas into a three-digit number of digits:
(<=\d) \d{3}\b-Three digits at the end of a word, preceded by a digit
Here is an example of searching for prefixes and suffixes at the same time:
(? <=\s) \w+ (? =\s)--alphanumeric strings bounded by whitespace
Negative viewing (Negative lookaround)
Before, I explained how to search for a character that is not a particular character or a member of a character set. So what if we want to simply verify that one character doesn't appear, but doesn't want to match anything? For example, what if we want to search for words where "q" is not followed by "U"? We can try:
\b\w*q[^u]\w*\b-Words with "Q" followed by not "U"
Running example you will see that if "Q" is the last letter of a word, it will not match, such as "Iraq". This is because "[^q]" always matches one character. If "Q" is the last character of the word, it matches the trailing white-space character, so this example matches two complete words at the end of the expression. Negative viewing can solve this problem because it matches a location without consuming any text. As with a definite view, it can also be used to match the position of an arbitrarily complex subexpression, not just a single character. We can do better now:
\b\w*q (?!) u) \w*\b-Search for words with ' Q ' not followed by ' U '
We use the "0 width negation forward assertion", "(?!) EXP) ", it succeeds only if the suffix" exp "does not appear. Here is another example:
\D{3} (?! \d)--Three digits not followed by another digit
Similarly, we can use "(? <!exp)", "0 width negative assertion" to search for a position in the text where the prefix "exp" does not appear:
(? <![ A-z]) \w{7}--Strings of 7 alphanumerics not preceded by a letter or space
Here is another example of using the following:
(?<=< (\w+) >). * (?=<\/\1>)--Text between HTML tags
This is used to search for an HTML tag, and use a forward search for the corresponding closing tag, so that you can get the middle text without two tags.
Comments
Another use of punctuation is to include comments using the "(? #comment)" syntax. A better approach is to set the "Ignore pattern whitespace" option, which allows whitespace characters to insert an expression and then ignore it when using an expression. After you set this option, any text behind the number sign "#" at the end of each line is ignored. For example, we can format the previous example as follows:
Text between HTML tags, with comments
(? <= # Search for a prefix, but exclude it
< (\w+) > # Match a tag of alphanumerics within angle brackets
) # End the prefix
. * # Match any text
(? = # Search for a suffix, but exclude it
<\/\1> Match the previously captured tag preceded by "/"
) # End the suffix
Greed and laziness
When a regular expression has a quantifier that can accept a repeat range (like ". *"), the normal behavior is to match as many characters as possible. Consider the following regular expression:
A.*b--the longest string starting with a and ending with B
If this is used to search the string "Aabab", it will match the entire string "Aabab". This is called a "greedy" match. Sometimes we prefer the "lazy" match, where one match uses the smallest number of duplicates found. All quantifiers in table 2 can add a question mark "?" To the "lazy" classifier. So, "*?" "Matches any number of matches, but uses the minimum number of duplicates to achieve a successful match." Now let's try an example of a lazy version (32):
A.*?b--the shortest string starting with a and ending with B
If we apply this to the same string "Aabab", it first matches "AaB" and then matches "AB".
*? |
Repeat any number of times, but as few as possible |
+? |
Match one or more times, but as little as possible |
?? |
Repeat 0 or more times, but as little as possible |
{n,m}? |
Repeat at least n times, but not more than m times, but as little as possible |
{N,}? |
Repeat at least n times, but as little as possible |
Table 5 Lazy quantifiers
What have we missed?
I've already described a number of elements, using them to start creating regular expressions, but I've also omitted something that is summed up in the table below. Many of these are illustrated with additional examples in the project file. The example number is listed in the left column of this table.
|
\a |
Alarm characters |
|
\b |
is usually a word boundary, but it represents a backspace key in a character set combination |
|
\ t |
Tabs |
34 |
\ r |
Enter |
|
\v |
Vertical tab |
|
\f |
Page break |
35 |
\ n |
Line feed |
|
\e |
Esc |
36 |
\nnn |
ASCII code octal number of characters nnn |
37 |
\xnn |
Characters with a hexadecimal number of NN |
38 |
\unnnn |
Character with Unicode code as nnnn |
39 |
\cn |
Control n characters, such as carriage return (CTRL-M) is \cm |
40 |
\a |
The beginning of the string (like ^ but not dependent on multiline options) |
41 |
\z |
The end of the string or the end of the string before \ n (ignores multiple lines) |
|
\z |
End of string (ignore multiple lines) |
42 |
\g |
The start phase of the current search |
43 |
\p{name} |
Any character in a Unicode class named name, such as \p{isgreek} |
|
(? >exp) |
Greedy subexpression, also known as a non-backtracking subexpression. It matches only once and then no longer participates in backtracking. |
44 |
(? <x>-<y>exp) or (?-<y>exp) |
Balancing Group. This is complicated but powerful. It allows named capture groups to is manipulated on a push down/pop up stack and can is used, for example, to search for M Atching parentheses, which is otherwise not possible with regular expressions. The example in the project file. |
45 |
(? im-nsx:exp) |
The regular expression option is the child expression exp |
46 |
(? im-nsx) |
Change the regular expression options for the rest of the enclosing group |
|
(? (exp) yes|no) |
The subexpression exp is treated as a zero-width positive lookahead. If it matches at this point, the subexpression yes becomes the next match and otherwise no is used. |
|
(? (exp) Yes) |
Same as above but with a empty no expression |
|
(? (name) yes|no) |
This is the same syntax as the preceding case. If name is a valid group name, the Yes expression is matched if the named group had a successful match, otherwise the No E Xpression is matched. |
47 |
(? (name) Yes) |
Same as above but with a empty no expression |
Table 6 What we are missing. The left column shows the sequence number of the example in the project file that describes the structure
Conclusion
We have given a lot of examples to illustrate. NET regular expressions, emphasize the use of tools (such as Expresso) to test, practice, and then use examples to learn. If you want to study in depth, there are a lot of online resources on the web that will help you learn more deeply. You can start by visiting the Ultrapico website. If you want to read a book, I suggest that Jeffrey Friedl write the latest edition of "Mastering Regular Expressions".
There are a lot of good articles in Code project that contain the following tutorials:
· An Introduction to Regular Expressions by Uwe Keim
· Microsoft Visual C #. NET Developer ' s cookbook:chapter on Strings and Regular Expressions
Note: This example can be downloaded from the Ultrapico website Expresso test, click here to download the program, click here to see the original.