On the use of the _c# of C # 's regular Expressions tutorial

Source: Internet
Author: User
Tags tag name expression engine

So far, many programming languages and tools contain support for regular expressions, and C # is no exception, and the C # base Class library contains a namespace (System.Text.RegularExpressions) and a series of classes that can give full play to the power of regular expressions (Regex, Match, group, etc.). So, what is a regular expression, and how do you define a regular expression?

The basis of regular expressions

What is a regular expression

When writing a handler for a string, there is often a need to find strings that meet certain complex rules. Regular expressions are the tools used to describe these rules. In other words, regular expressions are code that records text rules.

Often, we use the wildcard character (* and?) when we use Windows to find files. If you want to find all the Word documents in a directory, you can use *.doc to find them, where the * is interpreted as any string. Like wildcard characters, regular expressions are also tools for text matching, but they describe your needs more precisely than wildcards--and, of course, the cost is more complex.

A simple example--Verifying the phone number

The best way to learn regular expressions is to start with an example, and we'll start with the validation phone number, step-by-step understanding of regular expressions.

In our country, telephone numbers (such as: 0379-65624150) usually contain 3 to 4 for an area code starting with 0 and a number 7 or 8, which is usually separated by a hyphen '-'. In this example, we first introduce a meta character \d, which is used to match a number from 0 to 9. This regular expression can be written as: ^0\d{2,3}-\d{7,8}$

We'll analyze him, 0 matches the number "0", \d matches a number, {2,3} says repeat 2 to 3 times,-matches only "-" itself, the next \d matches a number, and {7,8} says repeat 7 to 8 times. Of course, the phone number can also be written (0379) 65624150, which is given to the reader to complete.

Metacharacters

In the example above, we came across a meta character \d, and as you can imagine, regular expressions have a lot of metacharacters like \d, and the following table lists some common metacharacters:

Metacharacters

Description

.

Match any character except the line feed

\b

Match the start or end of a word

\d

Matching numbers

\s

Match any white space character

\w

Match letters or numbers or underscores or kanji

^

Match the start of a string

$

End of Match string

Table 1, commonly used meta characters

Escape character

If you want to look up the meta characters themselves, such as if you're looking for, or *, there's a problem: You can't specify them because they'll be interpreted as something else. Then you have to use \ To remove the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you have to use \.

For example: unibetter\.com matching unibetter.com,c:\\windows matching C:\Windows.

Qualifier

A qualifier is also called a repeating description character that represents the number of times a character will appear. For example, the {3,4} we use when matching phone numbers indicates that 3 to 4 times are present. The most commonly used qualifiers are:

Qualifier

Description

*

Repeat 0 or more times

+

Repeat one or more times

?

Repeat 0 times or once

N

Repeat n times

{N,}

Repeat N or more times

{N,m}

Repeat N to M times

Table 2, commonly used qualifiers

Two. NET Regular expression support

The System.Text.RegularExpressions namespace contains classes that provide access to the. NET Framework regular expression engine. This namespace provides regular expression functionality that you can use from any platform or language running within the Microsoft. NET Framework.

1. Using regular expressions in C #

After understanding the classes in C # that support regular expressions, we'll work together to write a regular expression of the validation phone number mentioned above to the C # code to validate the phone number.

The first step is to create a Windows project called Simplecheckphonenumber.

The second step is to introduce the System.Text.RegularExpressions namespace.

The third step is to write the regular expression. The regular expression here is the string of the validation number above. Since the string above can only verify the phone number used to connect the area code and number with a hyphen, we have made some modifications: 0\d{2,3}-\d{7,8}|\ (0\d{2,3}\) \d{7,8}. In this expression, | Part of the number is mentioned above, and the latter part is used to verify (0379) 65624150 This type of phone number. Because (and) is also a metacharacters, you use the escape character. | Represents a branch match, either matching a previous part or matching a portion of the following.

Step fourth, the regular expression constructs a Regex class.

Step fifth, use the IsMatch method of the Regex class to validate the match. The IsMatch () method of the Regex class returns a bool value, which returns true if there is a match, or false.

Third, the regular expression of advanced

Group

We have used repeated individual characters when matching the phone number. Let's learn how to use grouping to match an IP address.

As we all know, the IP address is represented by a four-segment dotted decimal string. So, we can make a match by grouping the addresses. First, we'll match the first paragraph: 2[0-4]\d|25[0-5]| [01]?\d\d? This regular expression can match a number of IP addresses. The 2[0-4]\d match starts with 2, 10 bits 0 to 4, bits for any number of three-bit fields, 25[0-5] matches 25-bit fields from 0 to 5, and the three-digit field in a single digit, [01]?\d\d? Match any field with 1 0 heads, single digits, and 10 bits for any number of children. Indicates that 0 or more times appear. So, [01] and the last \d can not appear if we add a \ to the string after that. to match. Can be divided into a paragraph. Now, we put 2[0-4]\d|25[0-5]| [01]?\d\d?\. As a grouping, you can write (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) 。 Next we're going to use this grouping. Repeat this group two times, and then use the 2[0-4]\d|25[0-5]| [01]?\d\d? It's OK. The complete regular expression is: (2[0-4]\d|25[0-5]|[ 01]?\d\d?\.) {3}2[0-4]\d|25[0-5]| [01]?\d\d?

Back reference

After we understand the grouping, we can use the back reference. The so-called back reference is to use the results captured earlier to match the characters that follow. Used to match repeated characters. For example, repeat characters that match go goes. We can use the (GO) \1 to match.

By default, each grouping automatically has a group number, which is left to right, is marked with the left parenthesis of the group, the first group that appears is 1, the second is 2, and so on. Of course, you can also specify the group name of the subexpression yourself. To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or change the angle bracket to ' also OK: (? '). Word ' \w+ '), so that the \w+ group name is specified as Word. To reverse reference to the content captured by this grouping, you can use \k<word>, so the previous example can be written like this: \b (? <word>\w+) \b\s+\k<word>\b.

Custom group name There is another benefit, in our C # program, if we need to get the value of the grouping, we can very clearly use the name of the group we defined to get, without using subscript.

When we do not want to use the back reference, it is not necessary to capture the group memory of anything, in this case you can use (?: nocapture) syntax to actively tell the regular expression engine, do not take the contents of parentheses as capturing group, in order to improve efficiency.

0 Wide Assertion

In the previous metacharacters introduction, we already know that there is a class of characters that can match the start, end (^ $) of a sentence, or match the start, end (\b) of a word. These metacharacters match only one location, specifying that the position satisfies certain conditions rather than matching certain characters, and therefore they are declared as 0 wide. The so-called 0 width, refers to they do not match any character, and match a position; The alleged assertion refers to a judgment. A regular expression will continue to match only if the assertion is true.

In some cases, we accurately match a position, not just a sentence or a word, which requires us to write an assertion to match. The following is the syntax of the assertion:

Assertion syntax

Description

(? =pattern)

Forward positive assertion, matching the position in front of pattern

(?! Pattern

Forward negative assertion, matching the position behind the pattern

(? <=pattern)

Back to affirmative assertion, matching the position behind pattern

(? <!pattern)

A negative assertion that matches the position of the previous pattern.

Table 3, the syntax and description of assertions

Is it hard to understand? Let's take a look at an example.

There is a tag: <book>, we want to get the label <book> Tag name (book), and we can use assertions to handle this. Look at the following expression: (?<=\<) (<tag>\w*) (?=\>), using this expression to match the characters between < and >, which is the book here. Using assertions, you can also write more complex expressions, and here is no longer an example.

It is also important to assert that the parentheses used by the syntax are not used as a capturing group, so it cannot be referenced using numbers or names.

Greed and laziness

When a regular expression contains a qualifier that accepts duplicates, the usual behavior is to match as many characters as possible (subject to the entire expression being matched). Take a look at this expression: A\w*b, with which to match the string aabab, the resulting match is aabab. This match is called a greedy match.

Sometimes, we want to make it as little as possible, that is, the result of the match with the above example is AAB, when we are going to use lazy matching. Lazy matching needs to add one after the repeat qualifier? Symbol, the above expression can be written: a\w*?b we then match the string aabab, the resulting match is AAB and AB.

Maybe this time you want to ask, AB than AaB repeat times less, why not first match AB? There are more rules in regular expressions than greedy/lazy precedence: The first match has the highest priority--the match that begins earliest wins.

Comments

Syntax: (? #comment)

For example: 2[0-4]\d (? #200 -249) |25[0-5] (? #250-255) | [01]?\d\d? (? #0-199)

Note: If you use annotations, you need to be extra careful not to have spaces, line breaks, and so on in front of the parentheses, and if you can ignore them, it's best to use the "ignore whitespace in mode" option, which is C # The Ignorepatternwhitespace option for the RegexOptions enumeration in C # (described below in the RegexOptions enumeration in #).

Processing options in C #

In C #, you can use the RegexOptions enumeration to select how C # handles regular expressions. The following is an introduction to the members of the RegexOptions enumeration in MSDN:

Capture class, group class, Match class in C #

Capture class: represents the result in a single subexpression capture. The Capture class represents a substring of a single successful capture. This class does not have a public constructor, and you can get a collection of objects from a capture class from the group class or the match class. The capture class has three commonly used properties, namely, index, length, and value. Index represents the position of the first character of the captured substring. Length represents the lengths of the captured substring, value representing the captured substring.

Group class: represents the information that is grouped in a regular expression. This class provides support for regular expressions that match a group. The class does not have a public constructor. You can get a collection of group classes from the match class. If the grouping in the regular expression is named, it can be accessed using the first name and, if not named, the subscript access. Note: the No. 0 element (Groups[0] in each match's Groups collection is the match-captured string and is also the value of capture.

Match class: represents the result of a single regular expression match. The class also has no public constructors, you can get an instance of the class from the match () method of the Regex class, or you can use the matches () method of the Regex class to get a collection of the class.

These three classes can represent the results of a single regular expression match, but the match class gets more detailed, including capturing and grouping information. So, the match class is most commonly used in this three class.

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.