Interpreting regular Expressions in C #

Source: Internet
Author: User
Tags foreach contains expression lowercase regular expression split tostring
For many years now, many programming languages and tools have included support for regular expressions. NET base Class library contains a namespace and a series of classes that can give full play to the power of regular expressions, and they are all compatible with the regular expressions in future Perl 5.

In addition, the RegExp class can perform other functions, such as right-to-left binding patterns and expression editing.

In this article, I'll briefly describe the classes and methods in System.Text.RegularExpression, examples of string matching and substitution, and details of the group structure, and finally, some of the common expressions you might use.

Basic knowledge that should be mastered

The knowledge of regular expressions may be one of the many programmers who "often forget" knowledge. In this article, we will assume that you have mastered the use of regular expressions, especially the expressions in Perl 5. NET's RegExp class is a superset of the expressions in Perl 5, so theoretically it will be a good starting point. We also assume that you have the syntax of C # and. NET Architecture Basics.

If you have no knowledge of regular expressions, I suggest you start with the grammar of Perl 5. The authoritative book on regular expressions is written by Jeffrey Freder, a book of mastering expressions, and we strongly recommend reading this book to readers who want to understand the expression deeply.

RegularExpression Group

The RegExp rule class is contained in the System.Text.RegularExpressions.dll file, and you must refer to the file when compiling the application, for example:

csc R:system.text.regularexpressions.dll Foo.cs

command to create the Foo.EXE file, it references the System.Text.RegularExpressions file.

Name Space Introduction

The namespace contains only 6 classes and a definition, which are:

Capture: Contains the result of the first match;
A sequence of capturecollection:capture;
Group: The results of a set of records, inherited from capture;
Match: The result of an expression that is inherited by group;
A sequence of matchcollection:match;
MatchEvaluator: The agent used when performing the substitution operation;
Regex: An instance of an expression that is compiled.

The Regex class also contains some static methods:

Escape: Escapes the escape character in a regex in a string;
IsMatch: If an expression matches in a string, the method returns a Boolean value;
Match: Returns the example of match;
Matches: Returns a series of match methods;
Replace: Replaces a matching expression with a replacement string;
Split: Returns a series of strings determined by an expression;
Unescape: Escape characters in String are not escaped.

Simple match

Let's start with a simple expression that uses the Regex and the match class.

Match m = Regex.match ("Abracadabra", "(a|b|r) +");

We now have an instance of the match class that can be used for testing, for example: if (m.success) ...
If you want to use a matching string, you can convert it to a string:

Console.WriteLine ("Match=" +m.tostring ());

This example can get the following output: Match=abra. This is the matching string.

Substitution of strings

The substitution of simple strings is very intuitive. For example, the following statement:

string s = Regex.Replace ("Abracadabra", "Abra", "zzzz");

It returns the string zzzzcadzzzz, and all matching strings are replaced with zzzzz.

Now let's look at a more complex example of string substitution:

string s = Regex.Replace ("Abra", @ "^\s*" (. *?) \s*$ "," $ ");

This statement returns the string Abra, with the leading and suffix spaces removed.

The above pattern is useful for removing leading and subsequent spaces in any string. In C #, we often use alphabetic strings, in an alphabetic string, the compiler does not treat the character "\" as an escape character. @ "..." is useful when you specify an escape character by using the character "\". Also worth mentioning is the use of string substitution, which indicates that the replacement string can contain only the replaced string.

Match engine details

Now, we understand a slightly more complex example through a group structure. Look at the following example:

String text = "Abracadabra1abracadabra2abracadabra3";

String Pat = @ "

(# The beginning of the first group

Abra # Match String Abra

(# The start of the second group

CAD # Matching string CAD

)? # End of second group (optional)

) # End of first group

+ # match one or more times

";

Ignore annotations with X modifiers

Regex r = new Regex (PAT, "X");

Get a list of group numbers

int[] Gnums = R.getgroupnumbers ();

First match

Match m = r.match (text);

while (m.success)

{

Starting from Group 1

for (int i = 1; i < gnums. Length; i++)

{

Group g = M.group (Gnums[i]);

Get this matching group

Console.WriteLine ("Group" +gnums[i]+ "=[" +g.tostring () + "]");

Calculate the starting position and length of this group

capturecollection cc = g.captures;

for (int j = 0; J < cc.) Count; J + +)

{

Capture C = cc[j];

Console.WriteLine ("Capture" + j + "=[" +c.tostring ()

+ "] index=" + c.index + "length=" + c.length);

}

}

Next match

m = M.nextmatch ();

}

The output of this example is shown below:
     
Group1=[abra]

CAPTURE0=[ABRACAD] Index=0 length=7

Capture1=[abra] Index=7 length=4

GROUP2=[CAD]

CAPTURE0=[CAD] index=4 length=3

Group1=[abra]

CAPTURE0=[ABRACAD] index=12 length=7

Capture1=[abra] index=19 length=4

GROUP2=[CAD]

CAPTURE0=[CAD] Index=16 length=3

Group1=[abra]

CAPTURE0=[ABRACAD] index=24 length=7

Capture1=[abra] index=31 length=4

GROUP2=[CAD]

CAPTURE0=[CAD] index=28 length=3

We start with the test string pat, and the PAT contains an expression. The first capture begins with the first parenthesis, and then the expression matches to a abra. The second capture group starts with the second parenthesis, but the first capture group is not finished, which means that the result of the first group match is Abracad, and the second group matches only CAD. So what if you use it? Symbol to make CAD an optional match, the result may be Abra or ABRACAD. The first group is then terminated, and the expression is required to match multiple occurrences by specifying the + symbol.

Now let's take a look at what happened during the match. First, you create an instance of an expression by calling the Regex's constructor method and specify various options in it. In this example, because there is a comment in the expression, the X option is selected, and some spaces are used. With the x option open, the expression ignores the comment and the space in which there are no escapes.

Then, get a list of the numbers of the groups defined in the expression. You can of course use these numbers in a dominant way, using the programming method here. This is also useful as a way to establish a quick index if you use a named group.

The next step is to complete the first match. Test whether the current match is successful through a loop, and then repeat the action on the group list starting from Group 1. The reason for not using group 0 in this example is that group 0 is a perfectly matched string, and group 0 is used if you want to collect all the matching strings as a single string.

We track the capturecollection in each group. Typically, there can be only one capture per match, each group, but in this case Group1 has two capture:capture0 and Capture1. If you only need Group1 tostring, you will only get Abra, and of course it will match Abracad. The value of ToString in a group is the value of the last capture in its capturecollection, which is exactly what we need. If you want the entire process to end after matching Abra, you should remove the + symbol from the expression and let the Regex engine know that we just need to match the expression.

Comparison based on process and expression methods

Under normal circumstances, users who use regular expressions can be grouped into the following two categories: The first category uses the procedure to perform some repetitive actions rather than using regular expressions, while the second type uses the process as little as possible with the functionality and power of the regular expression processing engine.

For most of our users, the best solution is to use both. I hope this article will explain. NET language, the role of the RegExp class and its advantages and disadvantages between performance and complexity.

Process-based patterns

One of the features that we often need to use in programming is to match a part of a string or some other string processing, and here is an example of a match to a word in a string:

String text = "The quick red fox jumped over the lazy brown dog."

System.Console.WriteLine ("text=[" + text + "]");

string result = "";

String pattern = @ "\w+|\w+";

foreach (Match m in regex.matches (text, pattern))

{

Get a matching string

string x = M.tostring ();

If the first character is lowercase

if (char. Islower (X[0])

into uppercase

x = char. ToUpper (x[0]) + x.substring (1, x.length-1);

Collect all the characters

result = x;

}

System.Console.WriteLine ("result=[" + result + "]");

As shown in the example above, we used the foreach statement in the C # language to process each matching character and complete the corresponding processing, in which case a new result string was created. The output of this example is as follows:

Text=[the Quick red fox jumped over the lazy brown dog.]

Result=[the Quick Red Fox jumped over the Lazy Brown Dog.]

Patterns based on expressions

Another way to complete the functionality in the previous example is through a matchevaluator, and the new code looks like this:

static string Captext (Match m)

{

Get a matching string

string x = M.tostring ();

If the first character is lowercase

if (char. Islower (X[0])

Convert to uppercase

return char. ToUpper (x[0]) + x.substring (1, x.length-1);

return x;

}

    

static void Main ()

{

String text = "The quick red fox jumped over the

Lazy Brown Dog. ";

System.Console.WriteLine ("text=[" + text + "]");

String pattern = @ "\w+";

string result = Regex.Replace (text, pattern,

New MatchEvaluator (Test.captext));

System.Console.WriteLine ("result=[" + result + "]");

}

It's also important to note that the pattern is simple because you need to modify the word without having to modify it.

Common expressions

In order to better understand how to use regular expressions in a C # environment, I write some rule expressions that might be useful to you, and these expressions are used in other environments, hoping to help you.

Roman numerals

string P1 = "^m* (D?C{0,3}|C[DM])" + "(L?X{0,3}|X[LC]) (V?I{0,3}|I[VX)) $";

string T1 = "VII";

Match m1 = Regex.match (T1, p1);

Swap the first two words

String t2 = "The quick brown fox";

string P2 = @ "(\s+) (\s+) (\s+)";

Regex x2 = new Regex (p2);

string r2 = x2. Replace (T2, "$3$2$1", 1);

Key word = value

string t3 = "Myval = 3";

String P3 = @ "(\w+) \s*=\s* (. *) \s*$";

Match m3 = regex.match (t3, p3);

Implement 80 characters per line

string t4 = "********************"

+ "******************************"

+ "******************************";

string P4 = ". {80,} ";

Match M4 = Regex.match (T4, p4);

Month/day/year hours: minutes: Seconds of time format

String T5 = "01/01/01 16:10:01";

string P5 = @ "(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";

Match M5 = Regex.match (T5, p5);

Change directory (Windows platform only)

string T6 = @ "C:\Documents and settings\user1\desktop\";

string r6 = Regex.Replace (t6,@ "\\user1\\", @ "\\user2\\");

Extended 16-bit escape characters

String t7 = "%41"; Capital A

String P7 = "% ([0-9a-fa-f][0-9a-fa-f])";

String R7 = Regex.Replace (T7, P7, Hexconvert);

Delete comments in C language (to be perfected)

string T8 = @ "

/*

* Traditional style annotation

*/

";

String P8 = @ "

/\* # matches the delimiter at the beginning of the comment

.*? # matching annotations

\*/# matching Comment End delimiter

";

String r8 = Regex.Replace (T8, P8, "", "XS");

Deletes a space at the beginning and end of a string

String t9a = "leading";

String p9a = @ "^\s+";

String r9a = Regex.Replace (t9a, p9a, "");

String t9b = "trailing";

String p9b = @ "\s+$";

String r9b = Regex.Replace (t9b, p9b, "");

Add the character n after the character \ to make it a true new line

string T10 = @ "\ntest\n";

string R10 = Regex.Replace (T10, @ "\\n", "\ n");

Convert IP Address

String T11 = "55.54.53.52";

String p11 = "^" +

@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +

@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +

@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +

@ "([01]?\d\d|2[0-4]\d|25[0-5])" +

"$";

Match M11 = Regex.match (T11, p11);

Delete the path that the file name contains

String T12 = @ "C:\file.txt";

String p12 = @ "^.*\\";

String R12 = Regex.Replace (T12, P12, "");

Join rows in multiple-line strings

string t13 = @ "This is

A split line ";

string p13 = @ "\s*\r?\n\s*";

String R13 = Regex.Replace (t13, P13, "");

Extracts all numbers in a string

string t14 = @ "

Test 1

Test 2.3

Test 47

";

String p14 = @ "(\d+\.? \d*|\.\d+) ";

MatchCollection mc14 = regex.matches (t14, p14);

Find all the Caps

String t15 = "This is a Test of all Caps";

String p15 = @ "(\b[^\wa-z0-9_]+\b)";

MatchCollection mc15 = regex.matches (t15, P15);

Find a lowercase word

String t16 = "This is A Test of lowercase";

string p16 = @ "(\b[^\wa-z0-9_]+\b)";

MatchCollection MC16 = regex.matches (t16, p16);

Find the word with the first letter as uppercase

String t17 = "This is A Test of Initial Caps";

String p17 = @ "(\b[^\wa-z0-9_][^\wa-z0-9_]*\b)";

MatchCollection MC17 = regex.matches (t17, p17);

Find a link in a simple HTML language


String t18 = @ "


<a href= "" first.htm "" >first tag text</a>

<a href= "" next.htm "" >next tag text</a>


";

string p18 = @ "<a[^>]*?" Href\s*=\s*[""]? "+ @" ([^ ' "" >]+?) ["]?>";

MatchCollection mc18 = regex.matches (t18, p18, "Si");





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.