Regular Expressions for C #

Last Update:2017-08-06 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

So far, many programming languages and tools contain support for regular expressions, and C # is no exception, and the C # base Class library contains a namespace (System.Text.RegularExpressions) and a series of classes that give full play to the power of regular expressions (Regex, Match, group, etc.). So, what is a regular expression, and how do you define a regular expression?

The basis of regular expression

L What is a regular expression

When you are writing a handler for a string, there is often a need to find a string that matches some complex rule. Regular Expressions are the tools used to describe these rules. In other words, the regular expression is the code that records the text rule.

Typically, when we use Windows to find files, we use wildcards (* and?). If you want to find all the Word documents in a directory, you can find them using *.doc, where * is interpreted as any string. Like wildcards, regular expressions are also tools for text matching, but rather than wildcards, they can describe your needs more precisely-and, of course, the cost is more complex.

L A simple example-verifying the phone number

The best way to learn a regular expression is to start with an example, and let's start with verifying the phone number and step-by-step understanding of the regular expression.

In our country, telephone numbers (such as: 0379-65624150) usually contain 3 to 4 as the area code starting with 0 and a number 7 or 8, usually separated by hyphens '-'. In this example, we first introduce a meta-character \d, which is used to match a number from 0 to 9. This regular expression can be written as: ^0\d{2,3}-\d{7,8}$

Let's analyze him, 0 matches the number "0", \d matches a number, {2,3} means repeat 2 to 3 times,-only "-" itself, and the next \d matches a number, while {7,8} represents 7 to 8 times of repetition. Of course, the phone number can also be written as (0379) 65624150, which is given to the reader to complete.

L Meta-character

In the example above, we are exposed to a meta-character \d, as you can imagine, there are many meta-characters like \d in regular expressions, and the following table lists some common metacharacters:

metacharacters	Description
.	matches any character except line break
\b	Match start or end of Word
\d	Match number
\s	matches any white space character
\w	Match letters or numbers or underscores or kanji
^	start of matching string
$	Match string end

Table 1, commonly used meta-characters

L Escape Character

If you want to find the meta-character itself, such as when you look up, or *, there's a problem: You can't specify them, because they'll be interpreted as something else. Then you have to use \ To cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \ \.

For example: unibetter\.com matches unibetter.com,c:\\windows matching C:\Windows.

L Qualifier

A qualifier is also called a repeating description character, which indicates the number of occurrences of a character. For example, the {3,4} we used to match the phone number represents 3 to 4 occurrences. The commonly used qualifiers are:

Qualifier	Description
*	Repeat 0 or more times
+	Repeat one or more times
?	Repeat 0 or one time
N	Repeat n times
{N,}	Repeat N or more times
{N,m}	Repeat N to M times

Table 2, commonly used qualifiers

Two. NET-Express support

The System.Text.RegularExpressions namespace contains classes that provide access to the. NET Framework regular expression engine. This namespace provides regular expression functionality that can be used from any platform or language running within the Microsoft. NET Framework.

1 , using regular expressions in C #

After learning about the classes in C # that support regular expressions, let's write a regular expression of the verification phone number mentioned above into C # code to verify the phone number.

The first step is to create a Windows project named Simplecheckphonenumber.

The second step is to introduce the System.Text.RegularExpressions namespace.

The third step is to write the regular expression. The regular expression here is the string of the verification number above. Because the above string can only verify the phone number that connects the area code and number with hyphens, we have made some modifications: 0\d{2,3}-\d{7,8}|\ (0\d{2,3}\) \d{7,8}. In this expression, | Part of the surface is the one we mentioned above, and the latter part is used to verify (0379) 65624150 the type of telephone number. Because (and) is also a meta-character, use the escape character. Represents a branch match, either matching a previous part, or matching a later part.

The fourth step, the regular expression constructs a Regex class.

Fifth step, use the IsMatch method of the Regex class to validate the match. The IsMatch () method of the Regex class returns a bool value that returns True if there is a match, otherwise false.

Third, the regular expression of the advanced

L Group

We have used repeated individual characters when matching the phone number. Let's learn how to use grouping to match an IP address.

It is well known that the IP address is a four-segment dotted decimal string representation. So, we can make a match by grouping the addresses. First, let's match the first paragraph: 2[0-4]\d|25[0-5]| [01]?\d\d? This regular expression can match a number of IP addresses. 2[0-4]\d matches a three-bit field that starts with 2, 10 bits are 0 to 4, bits are any number, and 25[0-5] matches 25-bit fields that start with 0 and bits 5 through three, [01]?\d\d? Match any field with 1 0 head, single digit and 10 bit for any number of digits. Represents 0 occurrences or one time. So, [01] and the last \d can not appear, if we add a \ to the string after. to match. Can be divided into a paragraph. Now, we put 2[0-4]\d|25[0-5]| [01]?\d\d?\. As a grouping, it can be written (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) 。 Next we'll use this grouping. Repeat this group two times, then use the 2[0-4]\d|25[0-5]| [01]?\d\d? You can do it. The complete regular expression is: (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) {3}2[0-4]\d|25[0-5]| [01]?\d\d?

L Back Reference

After we understand the grouping, we can use the back reference. The so-called back reference, is to use the result of the previous capture, the following characters are matched. Many are used to match repeating characters. Like a repeating character that matches go go. We can use the (go) \1 to make a match.

By default, each grouping automatically has a group number, with the rule: left-to-right, with the left parenthesis of the group as the flag, the group number for the first occurrence is 1, the second is 2, and so on. Of course, you can also specify the group name of the sub-expression yourself. To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or replace the angle brackets with ' also: (? ') Word ' \w+) so that the \w+ group name is specified as Word. To reverse reference this packet capture, you can use \k<word>, so the previous example can be written like this: \b (? <word>\w+) \b\s+\k<word>\b.

Custom group name There is another benefit, in our C # program, if we need to get a grouped value, we can use the name of the group we define clearly, without using subscript.

When we do not want to use a back reference, it is not necessary to capture the group to remember anything, in this case you can use the (?: nocapture) syntax to actively tell the regular expression engine, not the contents of the parentheses as a capturing group, in order to improve efficiency.

L 0 Wide assertion

In the previous metacharacters introduction, we already know that there are such a class of characters that can match the beginning, end (^ $) of a sentence, or match the beginning and end of a word (\b). These metacharacters match only one position, specifying that the location satisfies certain conditions rather than matching certain characters, so that they become 0 wide assertions . The so-called 0-wide, which means that they do not match any character, but match a position; the so-called assertion refers to a judgment. In a regular expression, the match is resumed only if the assertion is true.

In some cases, we precisely match a position, not just a sentence or a word, which requires us to write an assertion to match it. The following is the syntax for assertions:

Assertion syntax	Description
(? =pattern)	Forward positive assertion that matches the position in front of the pattern
(?! Pattern	Forward negative assertion that matches a position not followed by pattern
(? <=pattern)	Back-to-positive assertion that matches the position behind the pattern
(? <!pattern)	Back to negative assertion, matching the position of the previous not pattern

Table 3, the syntax and description of assertions

Is it hard to understand? Let's look at an example.

There is a tag: <book> we want to get label <book> tag name (book), this time, we can use the assertion to deal with. Look at this expression: (?<=\<) (? <tag>\w*) (?=\>), using this expression, you can match the characters between < and >, which is the book here. Using assertions can also write more complex expressions, which are no longer examples.

It is also important to assert that the parentheses used by the assertion syntax do not act as a capturing group, so you cannot use numbering or naming to reference it.

L Greed and laziness

When a regular expression contains a qualifier that can accept duplicates, the usual behavior is to match as many characters as possible (in order for the entire expression to be matched). Take a look at this expression: a\w*b, using it to match the string aabab, the resulting match is aabab. This match is called a greedy match .

There are times when we want to make it as small as possible, that is, the matching result from the above example is AAB, and we are going to use lazy matching . lazy matching need to add one after the repeat qualifier? Symbol, the above expression can be written as: A\w*?b we then match the string aabab, the resulting match is AAB and AB.

Perhaps this time you have to ask, AB than AAB repeat the number of fewer, why not first match AB? In fact there are more rules than greedy/lazy precedence in regular expressions: The first match has the highest priority--the match that begins earliest wins.

L Notes

Syntax: (? #comment)

Example: 2[0-4]\d (? #200 -249) |25[0-5] (? #250-255) | [01]?\d\d? (? #0-199)

Note: If you use annotations, you need to be careful not to have some characters such as spaces, line breaks, and so on before the parentheses in the comments, and if you can ignore those characters, it is best to use the "ignore whitespace in the pattern" option, which is the RegexOptions enumeration in C # The Ignorepatternwhitespace option (the RegexOptions enumeration in C # will be mentioned below).

L processing options in C #

In C #, you can use the RegexOptions enumeration to choose how C # handles regular expressions. The following is a description of the members of the MSDN RegexOptions enumeration :

L C # Capture class, group class, Match class

Capture class : Represents the result in a single subexpression capture. The Capture class represents a substring in a single successful capture. There is no public constructor for this class, and you can get a collection of objects from a capture class from the group class or the match class. The capture class has three common properties, index, length, and value, respectively. Index represents the position of the first character of a captured substring. The length of the captured substring indicates the value of the captured substring.

Group class : Represents the information that is grouped in a regular expression. This class provides support for grouping matching regular expressions. The class does not have a public constructor. You can get a collection of group classes from the match class. If a grouping in a regular expression is named, it can be accessed using a name, and if it is not named, it can be accessed by subscript. Note: the No. 0 element (Groups[0]) in each match's Groups collection is the string captured by this match and is also the value of capture.

Match class : Represents the result of an individual regular expression match. The class also does not have a public constructor, you can get an instance of the class from the match () method of the Regex class, or you can use the matches () method of the Regex class to get to a collection of classes.

These three classes can represent the result of a single regular expression match, but the match class gets more detailed, including capturing and grouping information. Therefore, the match class is most commonly used in this three class.

Regular Expressions for C #

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More