For yearsProgramming LanguageAnd tools support regular expressions ,. net base class library contains a namespace and a series of classes that can fully utilize the power of Rule expressions, and they are also compatible with future rule expressions in Perl 5.
In addition, the Regexp class can complete some other functions, such as the combination mode from right to left and expression editing.
In this articleArticle. Text. classes and methods in regularexpression, examples of string matching and replacement, and detailed information about the group structure. Finally, we will introduce some common expressions that you may use.
Basic knowledge to be mastered
The knowledge of Rule expressions may be one of the things that many programmers often forget. In this article, we will assume that you have mastered the usage of regular expressions, especially the expressions in Perl 5 .. Net Regexp class is a superset of expressions in Perl 5, So theoretically it will be a good start point. We also assume that you have basic knowledge of C # syntax and. NET architecture.
If you do not have knowledge about Rule expressions, I suggest you start with the Perl 5 syntax. The authoritative book on Rule expressions is written by Jeffrey fredel. We strongly recommend that you read this book for readers who wish to have a deep understanding of expressions.
Regularexpression combination
The Regexp rule class is included in the system. Text. regularexpressions. dll file. You must reference this file when compiling the application software. For example:
Csc r: system. Text. regularexpressions. dll Foo. CS
The command will create the foo.exe file, which references the system. Text. regularexpressions file.
Namespace Introduction
The namespace contains only six classes and one definition. They are:
Capture: contains a matching result;
Capturecollection: the sequence of capture;
Group: the result of a group record, inherited by capture;
Match: the matching result of an expression, inherited by the Group;
Matchcollection: a sequence of match;
Matchevaluator: the proxy used to perform the replacement operation;
RegEx: An Example of the compiled expression.
The RegEx class also contains some static methods:
Escape: escape the escape characters in the RegEx string;
Ismatch: If the expression matches a string, this method returns a Boolean value;
Match: returns the instance of the match;
Matches: returns a series of match methods;
Replace: Replace the matching expression with the replacement string;
Split: returns a series of strings determined by expressions;
Unescape: do not escape characters in strings.
Simple Matching
First, we start to learn from simple expressions of the RegEx and match classes.
Match m = RegEx. Match ("abracadabra", "(a | B | r) + ");
We now have an instance of the match class that can be used for testing, for example: If (M. Success )...
If you want to use a matched string, you can convert it into a string:
. Writeline ("match =" + M. tostring ());
In this example, the following output is obtained: match = Abra. This is the matched string.
String replacement
Simple string replacement is very intuitive. For example, the following statement:
String S = RegEx. Replace ("abracadabra", "Abra", "ZZZZ ");
It returns the string zzzzzzcadzzzz, and all matched strings are replaced with zzzzzzz.
Now let's look at a complicated string replacement example:
String S = RegEx. Replace ("Abra", @ "^ \ s *(.*?) \ S * $ "," $1 ");
This statement returns the string Abra, with leading and trailing spaces removed.
The preceding mode is useful for deleting leading and trailing spaces in any string. In C #, we often use letter strings. In a letter string, compileProgramThe character "\" is not treated as an escape character. When the character "\" is used to specify the Escape Character, @ "..." is very useful. It is also worth mentioning that $1 is used in string replacement, which indicates that the replacement string can only contain the replaced string.
Matching engine details
Now, we use a group structure to understand a slightly complex example. See the following example:
String text = "abracadabra1abracadabra2abracadabra3 ";
String PAT = @"
(# Start of the first group
Abra # match the string Abra
(# Start of the second group
CAD # matching string CAD
)? # End of the second group (optional)
) # End of the first group
+ # Match once or multiple times
";
// Ignore comments using the x modifier
RegEx r = new RegEx (Pat, "x ");
// Obtain the group number list
Int [] gnums = R. getgroupnumbers ();
// Match for the first time
Match m = R. Match (text );
While (M. Success)
{
// Start with Group 1
For (INT I = 1; I <gnums. length; I ++)
{
Group G = M. Group (gnums [I]);
// Obtain the matched group
Console. writeline ("group" + gnums [I] + "= [" + G. tostring () + "]");
// Calculate the start position and length of the Group
Capturecollection cc = G. captures;
For (Int J = 0; j <cc. Count; j ++)
{
Capture c = Cc [J];
Console. writeline ("capture" + J + "= [" + C. tostring ()
+ "] Index =" + C. index + "length =" + C. Length );
}
}
// Next match
M = M. nextmatch ();
}
The output of this example is as follows:
Group1 = [Abra]
Capture0 = [abracad] Index = 0 length = 7
Capture1 = [Abra] Index = 7 length = 4
Group2 = [CAD]
Capture0 = [CAD] Index = 4 length = 3
Group1 = [Abra]
Capture0 = [abracad] Index = 12 length = 7
Capture1 = [Abra] Index = 19 length = 4
Group2 = [CAD]
Capture0 = [CAD] Index = 16 length = 3
Group1 = [Abra]
Capture0 = [abracad] Index = 24 length = 7
Capture1 = [Abra] Index = 31 length = 4
Group2 = [CAD]
Capture0 = [CAD] Index = 28 length = 3
First, we start by examining the string Pat, which contains an expression. The first capture starts with the first parentheses, and then the expression matches with an Abra. The second capture group starts from the second parentheses, but the first capture group is not over yet. This means that the first group matches abracad, the matching result of the second group is only CAD. Therefore, if you use? To make CAD an optional match, the matching result may be Abra or abracad. Then, the first group ends and the expression is required to be matched multiple times by specifying the + symbol.
Now let's take a look at what happens in the matching process. First, call the constructor method of RegEx to create an instance of the expression and specify various options. In this example, because there is a comment in the expression, the X option is selected, and some spaces are used. When the X option is enabled, the expression ignores comments and spaces without escape.
Then, retrieve the list of group numbers defined in the expression. Of course you can use these numbers explicitly. Here you use the programming method. If a named group is used, this method is also very effective as a way to create a fast index.
The next step is to complete the first matching. Use a loop to test whether the current matching is successful. Next, repeat this operation on the group list from Group 1. In this example, group 0 is not used because group 0 is a fully matched string. To collect all matched strings as a single string, group 0 is used.
We track the capturecollection in each group. Normally, each group can have only one capture, but group1 in this example has two capture: capture0 and capture1. If you only need the tostring of group1, you will get only abra. Of course, it will also match abracad. The value of tostring in the group is the value of the last capture in its capturecollection, which is exactly what we need. If you want the entire process to end after matching Abra, you should delete the + symbol from the expression to let the RegEx engine know that we only need to match the expression.
Comparison between process-based and expression-based methods
Generally, users who use rule expressions can be divided into the following two categories: the first type of users should try not to use rule expressions, but use procedures to perform operations that need to be repeated; the second type of users make full use of the functions and power of the Rule Expression Processing Engine, and use the process as little as possible.
For most of our users, the best solution is to use both of them. I hope this article will illustrate the role of the Regexp class in the. NET language and its advantages and disadvantages between performance and complexity.
Process-based model
We often need to use a function in programming to match a part of a string or process other strings. Below is an example of matching words in a string:
String text = "the quick red fox jumped over the lazy brown dog .";
System. Console. writeline ("text = [" + TEXT + "]");
String result = "";
String Pattern = @ "\ W + | \ W + ";
Foreach (Match m in RegEx. Matches (text, pattern ))
{
// Obtain the matched string
String x = M. tostring ();
// If the first character is lowercase
If (char. islower (X [0])
// Convert to uppercase
X = Char. toupper (X [0]) + X. substring (1, x. Length-1 );
// Collect all characters
Result + = X;
}
System. Console. writeline ("result = [" + Result + "]");
As shown in the preceding example, we use the foreach statement in C # to process each matching character and complete corresponding processing. In this example, a new result string is created. The output of this example is as follows:
TEXT = [the quick red fox jumped over the lazy brown dog.]
Result = [the quick red fox jumped over the lazy brown dog.]
Expression-based mode
Another way to complete the functions in the above example is through a matchevaluator, the new Code As follows:
Static string captext (Match m)
{
// Obtain the matched string
String x = M. tostring ();
// If the first character is lowercase
If (char. islower (X [0])
// Convert to uppercase
Return Char. toupper (X [0]) + X. substring (1, x. Length-1 );
Return X;
}
Static void main ()
{
String text = "the quick red fox jumped over
Lazy brown dog .";
System. Console. writeline ("text = [" + TEXT + "]");
String Pattern = @ "\ W + ";
String result = RegEx. Replace (text, pattern,
New matchevaluator (test. captext ));
System. Console. writeline ("result = [" + Result + "]");
}
At the same time, it is important to note that this mode is very simple because you only need to modify words without modifying non-words.
Now we have an instance of the match class that can be used for testing, such as if (M. Success)... if you want to use a matched string, you can convert it into a string: In this example, the following output is obtained: match = Abra. This is the matched string. Simple string replacement is very intuitive. For example, the following statement: It returns the string zzzzzzcadzzzz, and all matched strings are replaced with zzzzzzz. Now let's look at a complicated string replacement example: This statement returns the string Abra, with leading and trailing spaces removed. The preceding mode is useful for deleting leading and trailing spaces in any string. In C #, we often use letter strings. In a letter string, the compiler does not treat the character "\" as an escape character. When the character "\" is used to specify the Escape Character, @ "..." is very useful. It is also worth mentioning that $1 is used in string replacement, which indicates that the replacement string can only contain the replaced string. Now, we use a group structure to understand a slightly complex example. See the following example: The output of this example is as follows: First, we start by examining the string Pat, which contains an expression. The first capture starts with the first parentheses, and then the expression matches with an Abra. The second capture group starts from the second parentheses, but the first capture group is not over yet. This means that the first group matches abracad, the matching result of the second group is only CAD. Therefore, if you use? To make CAD an optional match, the matching result may be Abra or abracad. Then, the first group ends and the expression is required to be matched multiple times by specifying the + symbol. Now let's take a look at what happens in the matching process. First, call the constructor method of RegEx to create an instance of the expression and specify various options. In this example, because there is a comment in the expression, the X option is selected, and some spaces are used. When the X option is enabled, the expression ignores comments and spaces without escape. Then, retrieve the list of group numbers defined in the expression. Of course you can use these numbers explicitly. Here you use the programming method. If a named group is used, this method is also very effective as a way to create a fast index. The next step is to complete the first matching. Use a loop to test whether the current matching is successful. Next, repeat this operation on the group list from Group 1. In this example, group 0 is not used because group 0 is a fully matched string. To collect all matched strings as a single string, group 0 is used. We track the capturecollection in each group. Normally, each group can have only one capture, but group1 in this example has two capture: capture0 and capture1. If you only need the tostring of group1, you will get only abra. Of course, it will also match abracad. The value of tostring in the group is the value of the last capture in its capturecollection, which is exactly what we need. If you want the entire process to end after matching Abra, you should delete the + symbol from the expression to let the RegEx engine know that we only need to match the expression. Generally, users who use rule expressions can be divided into the following two categories: the first type of users should try not to use rule expressions, but use procedures to perform operations that need to be repeated; the second type of users make full use of the functions and power of the Rule Expression Processing Engine, and use the process as little as possible. For most of our users, the best solution is to use both of them. I hope this article will illustrate the role of the Regexp class in the. NET language and its advantages and disadvantages between performance and complexity. We often need to use a function in programming to match a part of a string or process other strings. Below is an example of matching words in a string: As shown in the preceding example, we use the foreach statement in C # to process each matching character and complete corresponding processing. In this example, a new result string is created. The output of this example is as follows: Another way to complete the functions in the above example is through a matchevaluator. The new Code is as follows: At the same time, it is important to note that this mode is very simple because you only need to modify words without modifying non-words.