Many programming languages and tools include support for regular expressions for many years. The net base Class library contains a namespace and a series of classes that give full play to the power of regular expressions, and they are also compatible with the regular expressions in future Perl 5.
In addition, the RegExp class is able to perform some other functions, such as right-to-left associative mode and expression editing.
In this article, I'll briefly describe the classes and methods in System.Text.RegularExPRession, examples of string matching and substitution, and the details of the group structure, and finally, some common expressions you might use.
Basic knowledge to be mastered
The knowledge of regular expressions may be one of the many programmers who "often forget" knowledge. In this article, we will assume that you have mastered the use of regular expressions, especially in Perl 5. NET's RegExp class is a superset of the expression in Perl 5, so theoretically it will be a good starting point. We also assume that you have the syntax of C # and. The basics of the net architecture.
If you don't have knowledge of regular expressions, I suggest you start with the syntax of Perl 5. The authoritative book on rule expressions is the book "Mastering Expressions" written by Jeffrey Freder, and we strongly recommend reading this book for readers who want to understand the expression in depth.
RegularExpression Group
The RegExp rule class is included in the System.Text.RegularExpressions.dll file, and you must refer to the file when compiling the application, for example:
csc R:system.text.regularexpressions.dll Foo.cs
The command creates the Foo.EXE file, which references the System.Text.RegularExpressions file.
Introduction to namespaces
The namespace contains only 6 classes and a definition, which are:
Capture: Contains the result of a match;
sequence of capturecollection:capture;
Group: The result of a set of records, which is inherited from capture;
Match: The matching result of an expression is inherited by group;
A sequence of matchcollection:match;
MatchEvaluator: The agent used when performing the replace operation;
Regex: An instance of the compiled expression.
The Regex class also contains some static methods:
Escape: Escapes the escape character in a regex in a string;
IsMatch: If an expression matches in a string, the method returns a Boolean value;
Match: Returns an instance of match;
Matches: Returns a series of match methods;
Replace: Replaces the matching expression with a replacement string;
Split: Returns a series of strings determined by an expression;
Unescape: Escape character in string is not escaped.
Simple match
Let's start with a simple expression that uses the Regex and match classes.
Match m = Regex.match ("Abracadabra", "(a|b|r) +");
We now have an instance of the match class that can be used for testing, for example: if (m.success) ...
If you want to use a matching string, you can convert it to a string:
Console.WriteLine ("Match=" +m.tostring ());
This example gives the following output: Match=abra. This is the matching string.
Substitution of strings
The substitution of simple strings is straightforward. For example, the following statement:
string s = Regex.Replace ("Abracadabra", "Abra", "zzzz");
It returns the string zzzzcadzzzz, and all matching strings are replaced with zzzzz.
Now let's look at a more complex example of string substitution:
string s = Regex.Replace ("Abra", @ "^\s*" (. *?) \s*$ "," $ ");
This statement returns the string Abra, with the leading and trailing spaces removed.
The above pattern is useful for deleting leading and trailing spaces in any string. In C #, we also often use alphabetic strings, where the compiler does not treat the character "\" as an escape character in an alphabetic string. When you use the character "\" To specify an escape character, the @ "..." is very useful. Also worth mentioning is the use of the string substitution, which indicates that the replacement string can only contain the substituted string.
Details of the matching engine
Now, let's understand a slightly more complex example through a group structure. Look at the following example:
String text = "Abracadabra1abracadabra2abracadabra3";
String Pat = @ "
(# The beginning of the first group
Abra # Match String Abra
(# The start of a second group
CAD # Match String CAD
)? # End of second group (optional)
) # End of first group
+ # matches one or more times
";
Ignore annotations with the X modifier
Regex r = new Regex (PAT, "X");
Get a list of group numbers
int[] Gnums = R.getgroupnumbers ();
First match
Match m = r.match (text);
while (m.success)
{
Starting from Group 1
for (int i = 1; i < gnums. Length; i++)
{
Group g = M.group (Gnums[i]);
Get this matching group
Console.WriteLine ("Group" +gnums[i]+ "=[" +g.tostring () + "]");
Calculate the starting position and length of this group
capturecollection cc = g.captures;
for (int j = 0; J < cc. Count; J + +)
{
Capture C = cc[j];
Console.WriteLine ("Capture" + j + "=[" +c.tostring ()
+ "] index=" + c.index + "length=" + c.length);
}
}
Next match
m = M.nextmatch ();
}
The output of this example is as follows:
Group1=[abra]
CAPTURE0=[ABRACAD] Index=0 length=7
Capture1=[abra] Index=7 length=4
GROUP2=[CAD]
CAPTURE0=[CAD] index=4 length=3
Group1=[abra]
CAPTURE0=[ABRACAD] index=12 length=7
Capture1=[abra] index=19 length=4
GROUP2=[CAD]
CAPTURE0=[CAD] Index=16 length=3
Group1=[abra]
CAPTURE0=[ABRACAD] index=24 length=7
Capture1=[abra] index=31 length=4
GROUP2=[CAD]
CAPTURE0=[CAD] index=28 length=3
We begin by examining the string pat, which contains an expression. The first capture starts with the first parenthesis, and then the expression is matched to a abra. The second capture group starts with the second parenthesis, but the first capture group is not finished, which means that the first group match results are Abracad, and the second group matches only CAD. So what if I use it? Symbol to make CAD an optional match, the result of the match may be Abra or ABRACAD. Then, the first group ends, and the expression is required to match multiple times by specifying the + symbol.
Now let's take a look at what happens in the matching process. First, an instance of the expression is established by calling the constructor method of the Regex and specifying various options in it. In this example, because there are comments in the expression, the X option is selected, and some spaces are used. When the x option is turned on, the expression ignores comments and spaces that are not escaped.
Then, get a list of the numbers of the groups defined in the expression. You can certainly use these numbers in a dominant way, and the programming method is used here. If a named group is used, this approach is also effective as a way to establish a fast index.
The next step is to complete the first match. Testing the success of the current match through a loop is followed by repeating the group listing from the start of the team 1. The reason for not using group 0 in this example is that group 0 is an exact string, and if you want to collect all the matching strings as a single string, you will use group 0.
We track the capturecollection in each group. Typically, there is only one capture per group, but Group1 in this example has two capture:capture0 and Capture1. If you only need Group1 tostring, you will only get Abra, and of course it will match Abracad. The value of ToString in the group is the last capture value in its capturecollection, which is exactly what we need. If you want the entire process to end after matching Abra, you should remove the + symbol from the expression and let the Regex engine know that we only need to match the expression.
Comparison of process-based and expression-based methods
In general, users who use regular expressions can be divided into the following two categories: the first class of users try not to use regular expressions, but instead use procedures to perform some operations that require duplication, while the second class takes advantage of the functionality and power of the regular expression processing engine and uses as few procedures as possible.
For most of our users, the best solution is to use both. I hope this article can explain. The role of the RegExp class in the net language and its superior and inferior points between performance and complexity.
Process-based patterns
One of the things we often need to do in programming is to match a part of a string or some other string processing, and here's an example that matches a word in a string:
String text = "The quick red fox jumped over the lazy brown dog.";
System.Console.WriteLine ("text=[" + text + "]");
string result = "";
String pattern = @ "\w+|\w+";
foreach (Match m in regex.matches (text, pattern))
{
Get a matching string
string x = M.tostring ();
If the first character is lowercase
if (char. Islower (X[0]))
into uppercase
x = char. ToUpper (x[0]) + x.substring (1, x.length-1);
Collect all the characters
result + = x;
}
System.Console.WriteLine ("result=[" + result + "]");
As shown in the example above, we use the foreach statement in the C # language to process each matching character and do the appropriate processing, in this case, a new result string is created. The output of this example is shown below:
Text=[the Quick red fox jumped over the lazy brown dog.]
Result=[the Quick Red Fox Jumped over the Lazy Brown Dog.]
Expression-based patterns
Another way to complete the functionality in the previous example is through a matchevaluator, and the new code looks like this:
static string Captext (Match m)
{
Get a matching string
string x = M.tostring ();
If the first character is lowercase
if (char. Islower (X[0]))
Convert to uppercase
return char. ToUpper (x[0]) + x.substring (1, x.length-1);
return x;
}
static void Main ()
{
String text = "The quick red fox jumped over the
Lazy Brown Dog. ";
System.Console.WriteLine ("text=[" + text + "]");
String pattern = @ "\w+";
string result = Regex.Replace (text, pattern,
New MatchEvaluator (Test.captext));
System.Console.WriteLine ("result=[" + result + "]");
}
It is also important to note that this pattern is very simple because only the words need to be modified without modifying the non-words.
Common expressions
To be able to better understand how to use rule expressions in a C # environment, I write some regular expressions that might be useful to you, which are used in other contexts and hopefully help you.
Roman numerals
string P1 = "^m* (D?C{0,3}|C[DM])" + "(L?X{0,3}|X[LC]) (V?I{0,3}|I[VX]) $";
string T1 = "VII";
Match m1 = Regex.match (T1, p1);
Swap the first two words
String t2 = "The quick brown fox";
string P2 = @ "(\s+) (\s+) (\s+)";
Regex x2 = new Regex (p2);
string r2 = x2. Replace (T2, "$3$2$1", 1);
Key word = value
string t3 = "Myval = 3";
String P3 = @ "(\w+) \s*=\s* (. *) \s*$";
Match m3 = regex.match (t3, p3);
Implementation of 80 characters per line
string t4 = "********************"
+ "******************************"
+ "******************************";
string P4 = ". {80,} ";
Match M4 = Regex.match (T4, p4);
Month/day/year hour: minutes: Time format for seconds
String T5 = "01/01/01 16:10:01";
string P5 = @ "(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match M5 = Regex.match (T5, p5);
Change directory (for Windows platforms only)
string T6 = @ "C:\Documents and settings\user1\desktop\";
string r6 = Regex.Replace (t6,@ "\\user1\\", @ "\\user2\\");
Extended 16-bit escape character
String t7 = "%41"; Capital A
String P7 = "% ([0-9a-fa-f][0-9a-fa-f])";
String R7 = Regex.Replace (T7, P7, Hexconvert);
Delete comments in the C language (needed to be perfected)
String T8 = @ "&NBSP;
/*
* Traditional style comment
*/
;
String P8 = @ "&NBSP;
/\* # match comment start delimiter
. *? # match Comment
\*/# match comment end delimiter
";
& NBSP;&NBSP
String r8 = Regex.Replace (T8, P8, "", "XS");
Delete Spaces at the beginning and end of the string
String t9a = "leading";
String p9a = @ "^\s+";
String r9a = Regex.Replace (t9a, p9a, "");
String t9b = "trailing";
String p9b = @ "\s+$";
String r9b = Regex.Replace (t9b, p9b, "");
Add the character n after the character \ to make it a true new line
string T10 = @ "\ntest\n";
string R10 = Regex.Replace (T10, @ "\\n", "\ n");
Convert IP Address
String T11 = "55.54.53.52";
String p11 = "^" +
@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +
@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +
@ "([01]?\d\d|2[0-4]\d|25[0-5]) \." +
@ "([01]?\d\d|2[0-4]\d|25[0-5])" +
"$";
Match M11 = Regex.match (T11, p11);
Delete the path that the file name contains
String T12 = @ "C:\file.txt";
String p12 = @ "^.*\\";
String R12 = Regex.Replace (T12, P12, "");
Joins rows in multiple lines of string
string t13 = @ "This is
A split line ";
string p13 = @ "\s*\r?\n\s*";
String R13 = Regex.Replace (t13, P13, "");
Extracts all numbers in a string
string t14 = @ "
Test 1
Test 2.3
Test 47
";
String p14 = @ "(\d+\.? \d*|\.\d+) ";
MatchCollection mc14 = regex.matches (t14, p14);
Find all the Capitals
String t15 = "This was a Test of all Caps";
String p15 = @ "(\b[^\wa-z0-9_]+\b)";
MatchCollection mc15 = regex.matches (t15, P15);
Find the lowercase words
String t16 = "This is A Test of lowercase";
string p16 = @ "(\b[^\wa-z0-9_]+\b)";
MatchCollection MC16 = regex.matches (t16, p16);
Find the first word with a capital letter
String t17 = "This is A Test of Initial Caps";
String p17 = @ "(\b[^\wa-z0-9_][^\wa-z0-9_]*\b)";
MatchCollection MC17 = regex.matches (t17, p17);
Find links in the simple HTML language
String t18 = @ "
<a href= "" first.htm "" >first tag text</a>
<a href= "" next.htm "" >next tag text</a>
";
string p18 = @ "<a[^>]*? href\s*=\s*["" ' ""? "+ @" ([^ ' "" ">]+?) [' "" "]?>";
MatchCollection mc18 = regex.matches (t18, p18, "Si");
Reprinted from: http://www.aspnetjia.com
C # The application of the expression in replace!