regular Expressions in asp.net
Release Date: 8/17/2004 | Renew Date: 8/17/2004
Crash Course
Steven A. Smith
Scope of application:
Microsoft®.net Framework
Microsoft®asp.net
Regular Expression APIs
Summary: Regular expressions are a useful tool for working with text. Regular expressions are useful whether you are validating user input, searching for patterns within a string, or reformatting text in a variety of efficient ways.
Download the source code for this article. content of this page
|
Introduction |
|
Regular Expression Usage History introduction |
|
Simple expression |
|
Qualifier |
|
Metacharacters |
|
Character class |
|
Predefined collection meta-characters |
|
Expression examples |
|
Verification in the ASP.net |
|
Regular Expression APIs |
|
Free tools |
|
Advanced Topics |
|
Conclusion |
|
Resources |
|
Author Introduction |
Introduction
The Microsoft®.net Framework's support for regular expressions is superb, and even in microsoft®asp.net there are controls that rely on regular expression languages. This paper introduces the basic knowledge and the recommended content of the in-depth study of regular expressions.
This article mainly focuses on the regular expression of little or no experience, but is familiar with ASP.net, can use. NET Programming Beginners. In addition, it is hoped that this article, together with regular expression cheat sheet, will be the handy reference or refresher material for developers with regular expression experience. This article discusses the following topics:
1. |
Regular expression Use history overview |
2. |
Simple expression |
3. |
Qualifier |
4. |
Meta character |
5. |
Character class |
6. |
Predefined collection meta characters |
7. |
Expression Sample details |
8. |
asp.net validation |
9. |
Regular expression API |
Ten. |
Free tools |
All |
Advanced Topics Overview |
. |
Summary and other resources |
In general, if you have questions about this article or the regular expression, visit http://www.aspadvice.com/and ask a question via the Regex mailing list. More than 350 subscribers were involved in writing this article. Back to the first regular expression of the page Use history introduction
Regular expression Design in the 50 's, exists so far. Regular expressions are originally used to describe "regular sets", which are the models studied by some neuroscientists. The regular expression was first proposed by a mathematician, Stephen Kleene, and was eventually used by Ken Thompson in two very popular text utilities, QED and grep. Jeffrey Friedl in his works "Mastering Regular Expressions (2nd edition)" in the further elaboration. Suggest people who want to know more about regular expression theory and history look at this book.
In the last 50 years, regular expressions have evolved from vague and esoteric mathematical concepts to the main functions used in various tools and software packages. Although many UNIX tools have supported regular expressions for decades, it has only been in the majority of Windows developer toolkits for nearly a decade. In Microsoft®visual basic®6 or microsoft®vbscript, regular expressions are still difficult to use, even if they are ideal. But with the implementation of the. NET Framework, regular expression support has developed to the point where regular expressions can be used by all Microsoft developers and all. NET languages.
So, what exactly is a regular expression? A regular expression is a language that can explicitly describe patterns in a text string. In addition to simply describing these patterns, the regular expression engine can often be used to traverse a match and use a pattern as a separator to parse a string into a substring, or to intelligently replace or reformat text. Regular expressions provide an efficient and straightforward way to resolve many common tasks related to text processing.
When discussing regular expressions, regular expressions are usually based on the text that matches (or does not match) the regular expression. This article (and the System.Text.RegularExpressions Class) refers to 3 participating objects in a regular expression interaction: the pattern of the regular expression, the input string, and the "match" of all patterns within the string. Back to the top of the page simple expression
The simplest regular expression is familiar, that is, a literal string. A particular string can be described by the literal itself; a regular expression pattern like foo can exactly match the input string foo. In this case, the following input will also be matched : the fooD was quite tasty, which may not be the expected result if you want an exact match.
Of course, using regular expressions to match an exact string equal to its own is not a worthwhile implementation and does not reflect the true effect of regular expressions. If you don't look for foo, look for all the words that start with the letter F , or all 3-letter words. Currently, this is beyond the reasonable range of literal strings. We need to study regular expressions more deeply. Here is an example of a literal expression and some matching input.
Mode |
Input (Match) |
Foo |
Foo, food, Foot, "There ' evil afoot." |
Back to the top of the page
Qualifier
Quantifiers provide a simple way to specify the number of times that a particular character or character set is allowed to recur in the pattern. There are 3 non-explicit qualifiers:
1. |
*, describing "appearing 0 or more times". |
2. |
+, description "appears 1 or more times." |
3. |
?, describing "appearing 0 or 1 times". |
Quantifiers always refer to a pattern that qualifies match either (left), usually a single character, unless the pattern group is created with parentheses. Here are some examples of patterns and matching input.
Mode |
Input (Match) |
fo* |
Foo, foe, food, Fooot, "Forget it", funny, puffy |
fo+ |
Foo, foe, food, foot, "forget it" |
Fo? |
Foo, foe, food, foot, "Forget it", funny, puffy |
In addition to specifying exactly 0 or 1 occurrences of a given pattern, the? character can also force the pattern or the child mode to match the fewest number of characters (if multiple characters in the input string are matched).
Explicit qualifiers are also available in addition to explicit qualifiers (typically called qualifiers, but not explicit qualifiers), which are distinguished from the next group. The concept of qualifiers is very vague in terms of the number of occurrences of the pattern. With explicit qualifiers, you can specify exactly the number, range, or set of numbers. The explicit qualifier is located behind the applied pattern, as is the regular qualifier. An explicit qualifier uses the curly braces {} and the numeric value in it to indicate the upper and lower limits of the number of occurrences of the pattern. For example,x{5} will exactly match 5 x characters (xxxxx). If you specify only one number, the number is capped; If the number is followed by a comma, such as x{5,it represents a match for any x characters that appear to be greater than 4. Here are some examples of patterns and matching input.
Mode |
Input (Match) |
Ab{2}c |
ABBC, AAABBCCC |
Ab{,2}c |
AC, ABC, ABBC, AABBCC |
Ab{2,3}c |
ABBC, ABBBC, AABBCC, AABBBCC |
Back to the top of the page
Meta character
In regular expressions, there is a special meaning of the construction, that is, the meta character. There are a number of known metacharacters, such as *,?,+ , and {} characters. Other words regular have special meanings in the expression language. These characters include:$ ^. [ ( | ) ] and /.
. The (period or point) meta character is the simplest but most commonly used character. It can match any single character. If you want to specify that some patterns can contain any combination of characters, it is useful to use a period, but be sure to be within a certain length range. In addition, we know that an expression will match all the patterns contained in a longer string, and what to do if only an exact match pattern is required. This occurs frequently in validation scenarios, for example, to ensure that the postal code or phone number entered by the user is in the correct format. Use the ^ meta character to specify the beginning of a string (or line) and use the $ metacharacters to specify the end of a string (or line). By adding these characters to the beginning and end of the pattern, you can force the pattern to match only the exact matching input string. If the ^ meta Word used at the beginning of the character class specified in square brackets [] , it also has a special meaning. See the specific content.
/ (backslash) metacharacters can be either "escaped" by a special meaning or an instance of a predefined collection meta character. Again, see the specific content. In order to include literal-style metacharacters in regular expressions, you must use a backslash to "escape". For example, if you want to match a string that starts with "c:/", you can use:^c://. Note that to use ^ metacharacters to indicate that the string must start in this mode, and then use the backslash meta character escape literal backslash.
| (pipe) metacharacters are used for alternating designations, especially for specifying "this or that" in the pattern. For example,a|b will match any input that contains "a" or "B", which is very similar to character class [AB] .
Finally, parentheses () are used to group patterns. It allows a full pattern to appear multiple times using qualifiers. To make it easier to read, or to match specific input parts separately, you may be able to parse or reformat the format.
Some examples of using metacharacters are listed below.
Mode |
Input (Match) |
. |
A, B, C, 1, 2, 3 |
.* |
ABC, 123, any string, no character match |
^c:// |
C:/windows, c://///, C:/foo.txt, c:/ followed by any other content |
abc$ |
ABC, 123ABC, any string ending with ABC |
(ABC) {2,3} |
ABCABC, ABCABCABC |
Back to the top of the page
character class
A character class is a "mini" language in a regular expression, defined in square brackets [] . The simplest character class is just a single character in parentheses, such as [Aeiou]. When you use a character class in an expression, you can use any of these characters at this point in the pattern (but only one character, unless you use a qualifier). Note that you cannot define a word or pattern by using a character class, and you can define only a single character.
To specify any numeric number, you can use the character class [0123456789]. However, because it is not convenient to use characters, you define the range of characters by using hyphens in parentheses. A hyphen has a special meaning in a character class (not in a regular expression, so it's not exactly called a regular expression metacharacters), and a hyphen character character has a special meaning in the character class only if the hyphen is not the first character. To specify any numeric number with hyphens, you can use [0-9]. Lowercase letters are the same, you can use [a- z], uppercase letters can be used [a- z]. The scope of a hyphen definition depends on the character set used. Therefore, the order in which characters appear in (for example,) ASCII or Unicode tables determines the characters that are included in the scope. If you want to include a hyphen in the scope, specify it as the first character. For example:[-.?] will match any one character in 4 characters (note that the last character is a space). Also note that regular expression metacharacters do not have special processing in character classes, so these metacharacters do not need to be escaped. Character classes have their own rules and syntax, considering that the character class is a language that is separate from other regular expression languages.
You can also match any character other than a character class member if you use the character ^ as the first character of the characters class to negate the class. Therefore, to match any non-vowel character, you can use the character class [^aaeeiioouu]. Note that if you want to negate hyphens, you should use hyphens as the second character of the character class, such as [^-]. Remember that^ 's role in character classes is completely different from what it does in regular expression patterns.
Some of the character classes used in the operation are listed below.
mode |
enter (match) |
^b[aeiou]t$ |
Bat, bet, bit, bot, but |
^[0-9]{5}$ |
11111, 123 99999 |
^c:// |
C:/windows, c://///, C:/foo.txt, c:/ followed by any other content |
abc$ |
ABC, 123ABC, any string ending with ABC |
(ABC) {2,3} |
Abcabc, abcabcabc |
^[^-][0-9]$ |
0, 1, 2 、... (Mismatch-0,-1,-2, etc.) |
In the next release of the. NET Framework, the code name "Whidbey" is added to the character class as a new feature, called The Character class difference (character class subtraction). Its primary role is to allow the subtraction of another character class from one character class to provide a more readable way of describing some patterns. The specification can be accessed from the following address: Http://www.gotdotnet.com/team/clr/bcl/TechArticles/techarticles/Specs/Regex/CharacterClassSubtraction.doc. Its syntax is similar to [A-z-[aeiou]], matching all lowercase consonants. Returns the first predefined collection meta character of the page
A lot of work can be done with the tools currently available. However, it is quite a long process to use [0-9] to represent each numeric number in the pattern, or (worse) to use [0-9a-za-z] to represent any alphanumeric character. To mitigate the pain of handling these common but lengthy patterns, predefined meta character sets are defined beforehand. The different implementations of regular expressions define different predefined meta character sets, and the predefined meta character sets described below are supported by the System.Text.RegularExpressions API in the. NET Framework. The standard syntax for these predefined meta characters is to have a backslash / followed by one or more characters. Most predefined metacharacters have only one character, which is easy to use, and is an ideal substitute for lengthy character classes. Here are two examples:/D matches all numeric numbers,/w matches all word characters (alphanumeric underlines). Exceptions are certain character code matches, at which point the address of the matched character must be specified, such as /u000d will match the Unicode return. Some of the most commonly used character classes and their equivalent metacharacters are listed below.
Metacharacters |
Equivalent character class |
/A |
Match ringtones (alarms);/u0007 |
/b |
Matches the word boundary outside the character class, which matches the backspace character,/u0008 |
/t |
Match tabs,/u0009 |
/R |
Match carriage return,/u000d |
/w |
Match Vertical tab,/u000b |
/F |
Match page breaks,/u000c |
/n |
Match new Row,/u000a |
/e |
Match escape character,/u001b |
/040 |
Matches 3-bit 8-in ASCII characters. /040 represents a space (decimal number 32). |
/x20 |
Matches ASCII characters using a 2-bit 16-digit number. In this case,/x2-represents a space. |
/cc |
Matches the ASCII control character, in this case ctrl-c. |
/u0020 |
Matches Unicode characters using a 4-bit 16-digit number. In this case, the/u0020 is a space. |
/* |
Any character that does not represent a predefined character class is treated only as the character itself. Therefore, /* is equivalent to /x2A(text *, not * metacharacters). |
/p{name} |
Matches any character in the named character class ' name '. Support names are Unicode groups and block ranges. For example, Ll, Nd, Z, Isgreek, isboxdrawing, and Sc (currency). |
/p{name} |
Matches text that is not included in the named character class ' name '. |
/w |
matches any word character. This is equivalent to [a-za-z_0-9]for non-Unicode and ECMAScript implementations. In the Unicode category, this is equivalent to [/p{ll}/p{lu}/p{lt}/p{lo}/p{nd}/p{pc}]. |
/w |
The negation of/w is equivalent to the ECMAScript compatible set [^a-za-z_0-9] or Unicode character category [^/p{ll}/p{lu}/p{lt}/p{lo}/p{nd}/p{pc}]. |
/s |
Matches any whitespace area character. is equivalent to Unicode character class [/f/n/r/t/v/x85/p{z}]. If you specify ECMAScript compatibility using the ECMASCRIPT option,/S is equivalent to [/f/n/r/t/v] (note the leading spaces). |
/S |
Matches any non-white space character. is equivalent to the Unicode character category [^/f/n/r/t/v/x85/p{z}]. If you specify ECMAScript compatibility by using the ECMAScript option,/S is equivalent to [^/f/n/r/t/v] (note the ^ after space). |
/d |
matches any decimal digit. In ECMAScript mode, is equivalent to Unicode [/p{nd}], non-Unicode [0-9]. |
/d |
Matches any non-decimal digit. In ECMAScript mode, is equivalent to Unicode [/p{nd}], non-Unicode [^0-9]. |
Back to the top of the page
Expression Examples
Many people like to learn by example, and here are some examples of expressions. To get more examples, visit the regular expression online database in the following address: http://www.regexlib.com/.
Mode |
Description |
^/d{5}$ |
5 numeric numbers, such as U.S. ZIP codes. |
^ (/d{5}) | (/d{5}-/d{4}$ |
5 numeric digits or 5 digits-dashes-4 digits. United States ZIP code that matches 5-digit number format, or 5-digit + 4-digit number format. |
^ (/d{5} (-/d{4})? $ |
Same as the previous one, but more effective. Use? You can make the 4-digit number in the pattern optional, rather than requiring a different two pattern (in another way). |
^[+-]?/d+ (/./d+)? $ |
Matches any real number with an optional symbol. |
^[+-]?/d*/.? /d*$ |
Same as previous, but also matches an empty string. |
^ (20|21|22|23| [01]/d) [0-5]/d$ |
Matches a 24-hour time value. |
//*.*/*/ |
Match C-language-style annotation/* ... * |
Back to the top of the page
verification in the ASP.net
ASP.net provides a set of validation controls that make it easy to validate input on a Web form, as opposed to using the old (or the traditional) ASP processing task. One of the most effective validators is RegularExpressionValidator, which, as you would expect, allows you to provide a regular expression that must match the input to validate the input. To set the validatio of a control