A detailed explanation of the use of C # 's regular expressions

Source: Internet
Author: User
Tags alphabetic character character set control characters html tags numeric numeric value tag name expression engine

So far, many programming languages and tools contain support for regular expressions, and C # is no exception, and the C # base Class library contains a namespace (System.Text.RegularExpressions) and a series of classes that can give full play to the power of regular expressions (Regex, Match, group, etc.). So, what is a regular expression, and how do you define a regular expression?

the basis of regular expressions

L What is a regular expression
When writing a handler for a string, there is often a need to find strings that meet certain complex rules. Regular expressions are the tools used to describe these rules. In other words, regular expressions are code that records text rules.
Often, we use the wildcard character (* and?) when we use Windows to find files. If you want to find all the Word documents in a directory, you can use *.doc to find them, where the * is interpreted as any string. Like wildcard characters, regular expressions are also tools for text matching, but they describe your needs more precisely than wildcards--and, of course, the cost is more complex.
A simple example--Verifying the phone number
The best way to learn regular expressions is to start with an example, and we'll start with the validation phone number, step-by-step understanding of regular expressions.
In our country, telephone numbers (such as: 0379-65624150) usually contain 3 to 4 for an area code starting with 0 and a number 7 or 8, which is usually separated by a hyphen '-'. In this example, we first introduce a meta character \d, which is used to match a number from 0 to 9. This regular expression can be written as: ^0\d{2,3}-\d{7,8}$
We'll analyze him, 0 matches the number "0", \d matches a number, {2,3} says repeat 2 to 3 times,-matches only "-" itself, the next \d matches a number, and {7,8} says repeat 7 to 8 times. Of course, the phone number can also be written (0379) 65624150, which is given to the reader to complete.
L Meta character
In the example above, we came across a meta character \d, and as you can imagine, regular expressions have a lot of metacharacters like \d, and the following table lists some common metacharacters:

Metacharacters
Description
.
Match any character except the line feed
\b
Match the start or end of a word
\d
Matching numbers
\s
Match any white space character
\w
Match letters or numbers or underscores or kanji
^
Match the start of a string
$
End of Match string

Table 1, commonly used meta characters
L Escape Character
If you want to look up the meta characters themselves, such as if you're looking for, or *, there's a problem: You can't specify them because they'll be interpreted as something else. Then you have to use \ To remove the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you have to use \.
For example: unibetter\.com matching unibetter.com,c:\\windows matching C:\Windows.
L Qualifier
A qualifier is also called a repeating description character that represents the number of times a character will appear. For example, the {3,4} we use when matching phone numbers indicates that 3 to 4 times are present. The most commonly used qualifiers are:

Qualifier
Description
*
Repeat 0 or more times
+
Repeat one or more times
?
Repeat 0 times or once
N
Repeat n times
{N,}
Repeat N or more times
{N,m}
Repeat N to M times
Table 2, commonly used qualifiers

Two. NET Regular expression support

The System.Text.RegularExpressions namespace contains classes that provide access to the. NET Framework regular expression engine. This namespace provides regular expression functionality that you can use from any platform or language running within the Microsoft. NET Framework.

1. Using regular expressions in C #
After understanding the classes in C # that support regular expressions, we'll work together to write a regular expression of the validation phone number mentioned above to the C # code to validate the phone number.
The first step is to create a Windows project called Simplecheckphonenumber.
The second step is to introduce the System.Text.RegularExpressions namespace.
The third step is to write the regular expression. The regular expression here is the string of the validation number above. Since the string above can only verify the phone number used to connect the area code and number with a hyphen, we have made some modifications: 0\d{2,3}-\d{7,8}|\ (0\d{2,3}\) \d{7,8}. In this expression, | Part of the number is mentioned above, and the latter part is used to verify (0379) 65624150 This type of phone number. Because (and) is also a metacharacters, you use the escape character. | Represents a branch match, either matching a previous part or matching a portion of the following.
Step fourth, the regular expression constructs a Regex class.
Step fifth, use the IsMatch method of the Regex class to validate the match. The IsMatch () method of the Regex class returns a bool value, which returns true if there is a match, or false.

Third, the regular expression of advanced

Group L
We have used repeated individual characters when matching the phone number. Let's learn how to use grouping to match an IP address.
As we all know, the IP address is represented by a four-segment dotted decimal string. So, we can make a match by grouping the addresses. First, we'll match the first paragraph: 2[0-4]\d|25[0-5]| [01]?\d\d? This regular expression can match a number of IP addresses. The 2[0-4]\d match starts with 2, 10 bits 0 to 4, bits for any number of three-bit fields, 25[0-5] matches 25-bit fields from 0 to 5, and the three-digit field in a single digit, [01]?\d\d? Match any field with 1 0 heads, single digits, and 10 bits for any number of children. Indicates that 0 or more times appear. So, [01] and the last \d can not appear if we add a \ to the string after that. to match. Can be divided into a paragraph. Now, we put 2[0-4]\d|25[0-5]| [01]?\d\d?\. As a grouping, you can write (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) 。 Next we're going to use this grouping. Repeat this group two times, and then use the 2[0-4]\d|25[0-5]| [01]?\d\d? It's OK. The complete regular expression is: (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) {3}2[0-4]\d|25[0-5]| [01]?\d\d?

L Back Reference
After we understand the grouping, we can use the back reference. The so-called back reference is to use the results captured earlier to match the characters that follow. Used to match repeated characters. For example, repeat characters that match go goes. We can use the (GO) \1 to match.
By default, each grouping automatically has a group number, which is left to right, is marked with the left parenthesis of the group, the first group that appears is 1, the second is 2, and so on. Of course, you can also specify the group name of the subexpression yourself. To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or change the angle bracket to ' also OK: (? '). Word ' \w+ '), so that the \w+ group name is specified as Word. To reverse reference to the content captured by this grouping, you can use \k<word&gt, so the previous example can be written like this: \b (? <word>\w+) \b\s+\k<word>\b.
Custom group name There is another benefit, in our C # program, if we need to get the value of the grouping, we can very clearly use the name of the group we defined to get, without using subscript.
When we do not want to use the back reference, it is not necessary to capture the group memory of anything, in this case you can use (?: nocapture) syntax to actively tell the regular expression engine, do not take the contents of parentheses as capturing group, in order to improve efficiency.
L 0 Wide Assertion
In the previous metacharacters introduction, we already know that there is a class of characters that can match the start, end (^ $) of a sentence, or match the start, end (\b) of a word. These metacharacters match only one location, specifying that the position satisfies certain conditions rather than matching certain characters, and therefore they are declared as 0 wide. The so-called 0 width, refers to they do not match any character, and match a position; The alleged assertion refers to a judgment. A regular expression will continue to match only if the assertion is true.
In some cases, we accurately match a position, not just a sentence or a word, which requires us to write an assertion to match. The following is the syntax of the assertion:

Assertion syntax

Description
(? =pattern)
Forward positive assertion, matching the position in front of pattern
(?! Pattern
Forward negative assertion, matching the position behind the pattern
(? <=pattern)
Back to affirmative assertion, matching the position behind pattern
(? <!pattern)
A negative assertion that matches the position of the previous pattern.
Table 3, the syntax and description of assertions
Is it hard to understand? Let's take a look at an example.
There is a tag: <book>, we want to get the label <book> Tag name (book), and we can use assertions to handle this. Look at the following expression: (?<=\<) (<tag>\w*) (?=\>), using this expression to match the characters between < and >, which is the book here. Using assertions, you can also write more complex expressions, and here is no longer an example.
It is also important to assert that the parentheses used by the syntax are not used as a capturing group, so it cannot be referenced using numbers or names.

L Greed and laziness

When a regular expression contains a qualifier that accepts duplicates, the usual behavior is to match as many characters as possible (subject to the entire expression being matched). Take a look at this expression: A\w*b, with which to match the string aabab, the resulting match is aabab. This match is called a greedy match.
Sometimes, we want to make it as little as possible, that is, the result of the match with the above example is AAB, when we are going to use lazy matching. Lazy matching needs to add one after the repeat qualifier? Symbol, the above expression can be written: a\w*?b we then match the string aabab, the resulting match is AAB and AB.
Maybe this time you want to ask, AB than AaB repeat times less, why not first match AB? There are more rules in regular expressions than greedy/lazy precedence: The first match has the highest priority--the match that begins earliest wins.

L Notes

Syntax: (? #comment)
For example: 2[0-4]\d (? #200 -249) |25[0-5] (? #250-255) | [01]?\d\d? (? #0-199)
Note: If you use annotations, you need to be extra careful not to have spaces, line breaks, and so on in front of the parentheses, and if you can ignore them, it's best to use the "ignore whitespace in mode" option, which is C # The Ignorepatternwhitespace option for the RegexOptions enumeration in C # (described below in the RegexOptions enumeration in #).

L processing options in C #

In C #, you can use the RegexOptions enumeration to select how C # handles regular expressions. The following is an introduction to the members of the RegexOptions enumeration in MSDN:


L C # capture class, group class, Match class

Capture class: Represents the result in a single subexpression capture. The Capture class represents a substring of a single successful capture. This class does not have a public constructor, and you can get a collection of objects from a capture class from the group class or the match class. The capture class has three commonly used properties, namely, index, length, and value. Index represents the position of the first character of the captured substring. Length represents the lengths of the captured substring, value representing the captured substring.
Group class: Represents the information that is grouped in a regular expression. This class provides support for regular expressions that match a group. The class does not have a public constructor. You can get a collection of group classes from the match class. If the grouping in the regular expression is named, it can be accessed using the first name and, if not named, the subscript access. Note: the No. 0 element (Groups[0] in each match's Groups collection is the match-captured string and is also the value of capture.
Match class: Represents the result of a single regular expression match. The class also has no public constructors, you can get an instance of the class from the match () method of the Regex class, or you can use the matches () method of the Regex class to get a collection of the class.
These three classes can represent the results of a single regular expression match, but the match class gets more detailed, including capturing and grouping information. So, the match class is most commonly used in this three class.

C # regular-expression grammar encyclopedia

Only numbers can be entered: "^[0-9]*$".
Only n digits can be entered: "^\d{n}$".
You can enter at least n digits: "^\d{n,}$".
You can enter only the number of m~n digits:. "^\d{m,n}$"
Only numbers beginning with 0 and not 0 can be entered: ^ (0|[ 1-9][0-9]*) $ ".
You can only enter positive real numbers with two decimal digits: ^[0-9]+ (. [ 0-9]{2})? $ ".
You can only enter positive real numbers with 1~3 decimal places: ^[0-9]+ (. [ 0-9]{1,3})? $ ".
You can only enter a Non-zero positive integer: "^\+?" [1-9] [0-9]*$].
You can only enter a Non-zero negative integer: "^\-[1-9][]0-9" *$.
You can only enter characters with a length of 3: "^. {3}$ ".
You can only enter a string of 26 English letters: "^[a-za-z]+$".
You can only enter a string consisting of 26 uppercase letters: "^[a-z]+$".
You can only enter a string consisting of 26 lowercase English letters: "^[a-z]+$".
You can only enter a string consisting of numbers and 26 English letters: "^[a-za-z0-9]+$".
You can only enter a string of numbers, 26 English letters, or underscores: "^\w+$".
Verify user password: "^[a-za-z]\w{5,17}$" is the correct format: start with a letter, length between 6~18, can only contain characters, numbers, and underscores.
Verify that there are ^%& ',; =?$\ ' and other characters: "[^%& ',; =?$\x22]+".
Only Chinese characters can be entered: "^[\u4e00-\u9fa5]{0,}$"
Verify email Address: "^\w+ ([-+.] \w+) *@\w+ ([-.] \w+) *\.\w+ ([-.] \w+) *$ ".
Verify InternetURL: "^http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)? $ ".
Verify phone Number: "^ (\ (\d{3,4}-) |\d{3.4}-)? \d{7,8}$" The correct format is: "Xxx-xxxxxxx", "xxxx-xxxxxxxx", "xxx-xxxxxxx", "xxx-xxxxxxxx", " XXXXXXX "and" XXXXXXXX ".
Verify ID Number (15-bit or 18-digit): "^\d{15}|\d{18}$".
Verify 12 months of the year: "^" (0?[ 1-9]|1[0-2]) $ "The correct format is:" 01 "~" 09 "and" 1 "~" 12 ".
Verify one months of 31 days: "^ (0?[ 1-9]) | ((1|2) [0-9]) |30|31) $ "the correct format is;" 01 "~" 09 "and" 1 "~" 31 ".
Use regular expressions to restrict the entry of text boxes in a Web page's form:
Only Chinese can be entered with regular expression restrictions:
Only full-width characters can be entered with regular expression restrictions:
Only numbers can be entered with regular expression restrictions:
You can only enter numbers and English with regular expression restrictions:
You have to use regular expressions to extract the filename from the URL address of the JavaScript program, the following result is Page1
S= "Http://hi.baidu.com/accpandsvse"
S=s.replace (/(. *\/) {0,} ([^\.] +). */ig, "$"
Alert (s)
Match Double-byte characters (including Chinese characters): [^\x00-\xff]
Application: Computes the length of the string (a double-byte character length meter 2,ascii character 1)
String.prototype.len=function () {return This.replace ([^\x00-\xff]/g, "AA"). Length;}
A regular expression that matches a blank row: \n[\s|] *\r
Regular expression matching HTML tags:/< (. *) >.*<\/\1>|< (. *) \/>/
Matching a regular expression with a trailing space: (^\s*) | (\s*$)
String.prototype.trim = function ()
{
Return This.replace (/(^\s*) | ( \s*$)/g, "");
}
To decompose and transform an IP address using a regular expression:
The following is a JavaScript program that uses a regular expression to match an IP address and converts an IP address to a corresponding numeric value:
function IP2V (IP)
{
re=/(\d+) \. (\d+) \. (\d+) \. (\d+)/g//matching the regular expression of the IP address
if (Re.test (IP))
{
Return Regexp.$1*math.pow (255,3)) +regexp.$2*math.pow (255,2)) +regexp.$3*255+regexp.$4*1
}
Else
{
throw new Error ("not a valid IP address!")
}
}
However, if the above program does not use a regular expression, and the split function directly to decompose may be simpler, the program is as follows:
var ip= "10.100.20.168"
Ip=ip.split (".")
Alert ("IP value is:" + (IP[0]*255*255*255+IP[1]*255*255+IP[2]*255+IP[3]*1))
Symbolic Explanation:
Character description
\ marks the next character as a special character, or a literal character, or a backward reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches a newline character. Sequence ' \ ' matches ' \ ' and ' \ (' Matches ' (".
^ matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ' \ n ' or ' \ R '.
$ matches the end position of the input string. If the Multiline property of the RegExp object is set, the $ also matches the position before ' \ n ' or ' \ R '.
* Match the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * is equivalent to {0,}.
+ matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}.
? Match the preceding subexpression 0 times or once. For example, "Do (es)" can match "do" in "do" or "does". is equivalent to {0,1}.
{n} n is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{N,} n is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ' but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{n,m} m and n are non-negative integers, where n <= m. Matches n times at least and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' o '. Notice that there is no space between the comma and the two number.
? When the character is immediately following any of the other qualifiers (*, +,?, {n}, {n,}, {n,m}), the matching pattern is not greedy. Non-greedy patterns match as few strings as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "oooo", ' o+? ' will match a single "O", and ' o+ ' will match all ' o '.
. Matches any single character except "\ n". To match any character including ' \ n ', use a pattern like ' [. \ n] '.
(pattern) matches the pattern and gets the match. The obtained matches can be obtained from the resulting matches collection, use the Submatches collection in VBScript, and use the $0...$9 property in JScript. To match the parentheses character, use ' \ (' or ' \ ').
(?:p Attern) matches pattern but does not get matching results, which means that this is a non fetch match and is not stored for later use. This is useful for combining parts of a pattern with the "or" character (|). For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '.
(? =pattern) forward lookup, matching the find string at the beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match windows in Windows 2000, but cannot match windows in Windows 3.1. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.
(?! pattern), which matches the lookup string at the beginning of any string that does not match the pattern. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (?! 95|98| nt|2000) ' Can match windows in Windows 3.1, but cannot match windows in Windows 2000. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.
X|y matches x or Y. For example, ' Z|food ' can match "z" or "food". ' (z|f) Ood ' matches ' zood ' or ' food '.
[XYZ] Character set combination. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in ' plain '.
[^XYZ] Negative character set combination. Matches any characters that are not included. For example, ' [^ABC] ' can match ' P ' in ' plain '.
[A-z] character range. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the range ' a ' to ' Z '.
[^a-z] a negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not in the range of ' a ' to ' Z '.
\b Matches a word boundary, which refers to the position between the word and the space. For example, ' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '.
\b Matches a non word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.
\CX matches the control characters indicated by X. For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
\d matches a numeric character. equivalent to [0-9].
\d matches a non-numeric character. equivalent to [^0-9].
\f matches a page feed character. Equivalent to \x0c and \CL.
\ n matches a newline character. Equivalent to \x0a and \CJ.
\ r matches a carriage return character. Equivalent to \x0d and \cm.
\s matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s matches any non-white-space character. equivalent to [^ \f\n\r\t\v].
\ t matches a tab character. Equivalent to \x09 and \ci.
\v matches a vertical tab. Equivalent to \x0b and \ck.
\w matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.
\w matches any non word character. Equivalent to ' [^a-za-z0-9_] '.
\XN matches N, where n is the hexadecimal escape value. The hexadecimal escape value must be a determined two digits long. For example, ' \x41 ' matches ' A '. ' \x041 ' is equivalent to ' \x04 ' & ' 1 '. You can use ASCII encoding in regular expressions ...
\num matches num, where num is a positive integer. A reference to the match that was obtained. For example, ' (.) \1 ' matches two consecutive identical characters.
\ n identifies a octal escape value or a backward reference. n is a backward reference if you have at least n obtained subexpression before \ nthe. Otherwise, if n is an octal number (0-7), then N is an octal escape value.
\NM identifies a octal escape value or a backward reference. NM is a backward reference if at least NM has obtained the subexpression before \nm. If there are at least N fetches before \nm, then n is a backward reference followed by a literal m. If all the preceding conditions are not satisfied, if both N and M are octal digits (0-7), then \nm will match octal escape value nm.
\NML if n is an octal number (0-3) and both M and L are octal digits (0-7), then the octal escape value NML is matched.
\un matches N, where N is a Unicode character represented in four hexadecimal digits. For example, \u00a9 matches the copyright symbol (?).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.