C # use regular expressions for string SEARCH

Source: Internet
Author: User
Tags expression engine
1. Introduction to Regular Expressions

Regular Expressions provide a powerful, flexible, and efficient way to process text. The full pattern matching expression of a regular expression can quickly analyze a large number of texts to find a specific character pattern; extract, edit, replace or delete the character string; or add the extracted string to the set to generate the report. Many applications that process strings (such as HTML processing, Log File Analysis, and HTTP header analysis)ProgramRegular Expressions are indispensable tools.

. Net Framework regular expressions are incorporated into the most common functions implemented by other regular expressions and are designed to be compatible with Perl 5 regular expressions ,. net Framework regular expressions also include some functions that are not provided in other implementations ,. the. NET Framework regular expression class is part of the base class library and can be used together with any language or tool for the public Language Runtime Library.

2. string SEARCH

The Regular Expression Language consists of two basic character types: literal (normal) text characters and metacharacters. It is precisely the metacharacters that provide processing capabilities for regular expressions. Currently, all text editors have some search functions. You can open a dialog box and type the string to be located in a text box. If you want to replace the string at the same time, you can enter a replacement string. For example, this function is available in notepad in the Windows operating system and the document editor in the office series. This is the simplest way to search. This type of problem can be easily solved using the string. Replace () method of the string class. But what if you need to identify a duplicate in the document? It is complicated to write a routine and select repeated words from a string class. In this case, the language is suitable.

A general expression language is a language that allows you to write search expressions. In this language, you can combine the text to be searched, escape sequences, and other characters with specific meanings in the document, for example, sequence \ B indicates the beginning and end of a word (subboundary). If you want to represent the word that is being searched for starting with the character th, you can write a general expression \ BTH (that is, the sequence character field is-t-h ). If you want to search for all words ending with th, you can write th \ B (sequence T-h-word boundary ). However, general expressions are much more complex than this. For example, you can find the tool Program (facility) that stores some text in a search operation ).

3. Regular Expression class of the. NET Framework

Next we will introduce the regular expression class of the. NET Framework to familiarize ourselves with the use of regular expressions in the. NET Framework.

3.1 RegEx class indicates read-only regular expression

The RegEx class contains various static methods that allow other regular expression classes to be used without explicitly instantiating objects of other classes. BelowCodeThe example creates an instance of the RegEx class and defines a simple regular expression when initializing the object. Note that an additional backslash is used as the escape character, which specifies the backslash in the \ s matching character class as the original literal character.

RegEx R; // declare a variable of the RegEx class
R = new RegEx ("\ S2000"); // defines the expression

3.2 match class indicates the result of regular expression matching operation

The following example uses the match method of the RegEx class to return a match object to find the first match in the input string. This example uses the match. Success attribute of the match class to indicate whether a match has been found.

RegEx r = new RegEx ("ABC"); // defines a RegEx object instance
Match m = R. Match ("123abc456"); // match the string
If (M. Success)
{
Console. writeline ("found match at position" + M. Index); // enter the position of the matching character
}

3.3 matchcollection class indicates a non-overlapping matching Sequence

This set is read-only and does not have a public constructor. The matchcollection instance is returned by the RegEx. Matches attribute. Use the matches method of the RegEx class to fill the matchcollection with all the matches found in the input string. The following code example demonstrates how to copy a set to a string array (retain each match) and an integer array (indicating each matched position.

Matchcollection MC;
String [] Results = new string [20];
Int [] matchposition = new int [20];
RegEx r = new RegEx ("ABC"); // defines a RegEx object instance
MC = R. Matches ("123abc4abcd ");
For (INT I = 0; I <MC. Count; I ++) // find all matches in the input string
{
Results [I] = mc [I]. value; // Add the matched string to the string array.
Matchposition [I] = mc [I]. Index; // record the position of matching characters
}

3.4 groupcollection class indicates the set of captured groups

This set is read-only and does not have a public constructor. Groupcollection instances are returned in the Set returned by the match. Groups attribute. The following console application finds and outputs the number of groups captured by regular expressions.

Using system;
Using system. Text. regularexpressions;
Public class regextest
{
Public static void runtest ()
{
RegEx r = new RegEx ("(A (B) c"); // define a group
Match m = R. Match ("abdabc ");
Console. writeline ("number of groups found =" + M. Groups. Count );
}
Public static void main ()
{
Runtest ();
}
}

This example generates the following output:

Number of groups found = 3

3.5 capturecollection class indicates the sequence of captured substrings

Because of the qualifier, the capture group can capture multiple strings in a single match. The captures attribute (the object of the capturecollection class) is provided as a member of the match and group classes to facilitate access to the collection of captured substrings. For example, if you use a regular expression (A (B) c) + (the + qualifier specifies one or more matches) to capture matches from the string "abcabcabc, the capturecollection of each matched group of the substring contains three members.

The following program uses the regular expression (ABC) + to find one or more matches in the string "xyzabcabcabcxyzabcab". It illustrates how to use the captures attribute to return multiple groups of captured substrings.

Using system;
Using system. Text. regularexpressions;
Public class regextest
{
Public static void runtest ()
{
Int counter;
Match m;
Capturecollection cc;
Groupcollection GC;
RegEx r = new RegEx ("(ABC) +"); // search for "ABC"
M = R. Match ("xyzabcabcabcxyzabcab"); // you can specify the string to be searched.
GC = M. Groups;
// Output the number of search groups
Console. writeline ("captured groups =" + GC. Count. tostring ());
// Loop through each group.
For (INT I = 0; I <GC. Count; I ++) // find each group
{
Cc = GC [I]. captures;
Counter = cc. count;
Console. writeline ("captures COUNT =" + counter. tostring ());
For (int ii = 0; II <counter; II ++)
{
// Print capture and position.
Console. writeline (CC [II] + "starts at character" +
CC [II]. Index); // enter the capture position
}
}
}
Public static void main (){
Runtest ();
}
}

In this example, the following output result is returned:

Captured groups = 2
Captures COUNT = 1
Abcabcabc starts at character 3
Captures COUNT = 3
ABC starts at character 3
ABC starts at character 6
ABC starts at character 9

3.6 The capture class contains results captured by a single subexpression

Loop in the group set, extract the capture set from each member of the group, and assign the variable posn and length to locate the character position in the initial string of each string, respectively, and the length of each string.

RegEx R;
Match m;
Capturecollection cc;
Int posn, length;
R = new RegEx ("(ABC )*");
M = R. Match ("bcabcabc ");
For (INT I = 0; M. Groups [I]. value! = ""; I ++)
{
Cc = M. Groups [I]. captures;
For (Int J = 0; j <cc. Count; j ++)
{
Posn = Cc [J]. Index; // capture Object Location
Length = Cc [J]. length; // capture object Length
}
}


Figure 1: object relationship

After combining the composite characters, a group object is returned every time, which may not be the expected result. If you want to use composite characters as part of the search mode, there will be considerable system overhead. A single group can be used as a character sequence "? : "Groups starting with" Do not do this, as in the URI example. For all groups, you can specify the regexoptions. explicitcapture flag on the RegEx. Matches () method.

4. Use Regular Expressions for string SEARCH

4.1 Use the. NET General Expression Engine in C #

The following is an example of development, execution, and display of some search results, describes some features of the general expression, and how to use the. NET General Expression Engine in C. Note that the @ symbol should be added before the string @.

String text = @ "I can not find my position in Beijing ";

This text is called an input string. To illustrate the general expression. Net class, this article first performs a text-only search without any escape sequence or general expression commands. Assume that you want to search for all the strings ion and call this search string the pattern. Use the general expression and the variable Text declared above to compile the following code:

String Pattern = "ion ";
Matchcollection matches = RegEx. Matches (text, pattern, regexoptions );
Foreach (match nextmatch in matches)
{Console. writeline (nextmatch. Index );}

In this Code, the static method of the RegEx class in the system. Text. regularexpressions namespace is used to match (). The parameters of this method are a set of optional flags in some input text, a mode, and regexoptions. Matches () returns the matchcollection. Each match is represented by a match object. In the above Code, only iterations are performed in the set. The index attribute of the Match Class is used to return the matching index in the input text. Run this code to obtain a match.

The function of a general set mainly depends on the mode string. The reason is that the mode string not only contains plain text. As mentioned above. It also contains metacharacters and escape sequences. metacharacters are special characters that give commands, and escape sequences work in the same way as escape sequences in C, they all start with a backslash (\) and have special meanings. For example, if you want to search for words starting with N, you can use the escape sequence \ B, which represents the boundary of a word (the boundary of a word starts with a letter or digit, or followed by a blank character or punctuation), write the following code:

String Pattern = @ "\ bn ";
Matchcollection matches = RegEx. Matches (text, pattern, regexoptions. ignorecase |
Regexoptions. explicitcapture );

To pass \ B to the. NET General Expression Engine at runtime, The backslash \ should not be interpreted by the C # compiler as an escape sequence. To search for words ending with the sequence ion, use the following code:

String Pattern = @ "Ion \ B ";

If you want to find all the words starting with N and ending with the sequence ion, you need a pattern starting with \ bn and ending with ion \ B. What should we do with the intermediate content? You need to tell the computer that the content in the middle of N and ion can be any length characters, as long as the characters are not blank, the correct mode is as follows:

String Pattern = @ "\ bn \ s * Ion \ B ";

4.2 specific character or escape sequence

Most important Regular Expression Language operators are non-escaped single characters. The Escape Character \ (single backslash) notifies the regular expression analyzer that the characters following the backslash are not operators. For example, the analyzer treats the star number (*) as a duplicate qualifier, And the backslash (\ *) followed by the asterisk (*) as the Unicode Character 002a.

One thing you should get used to when using a general expression is to look at a weird Character Sequence like this, but the work of this sequence is very logical. Escape Sequence \ s indicates any character that is not blank. * It is called a quantizer. It means that the preceding characters can be repeated for any number of times, including 0. Sequence \ s * indicates any character that is not suitable for blank spaces. Therefore, the above pattern matches any single word starting with N and ending with ion. The escape characters listed in the following table are recognized in both regular expressions and replacement modes.

Table 1: specific character or escape sequence

Specific character or escape sequence Description Example Matching example
^ Start of input text ^ B B, but only the first character in the text
$ End of input text X $ X, but only the last character in the text
. All individual characters except the newline character (\ n) I. ation Isation and ization
* Can repeat 0 or multiple leading characters Ra * t Rat, raat, etc.
+ Can be repeated once or multiple leading characters RA + T RT, rat, raat, etc.
? It can repeat 0 or 1 Leading Character Ra? T Only RT and rat match
\ S Any blank characters \ SA [Space] A, \ Ta, \ Na (\ t and \ n have the same meaning as \ t and \ n of C)
\ S Any character that is not blank \ SF AF, RF, CF, but not \ TF
\ B Word boundary Ion \ B Any word ending with Ion
\ B Not the position of the word boundary \ BX \ B Any X in the middle of the word

If you want to search for a metacharacter, it can also be expressed by escape characters with a backslash. For example, "." indicates any character except the newline character, and "\" indicates a vertex.
You can put replaceable characters in square brackets and request matching includes these characters. For example, [1 | C] indicates that the character can be 1 or C. If you want to search for map or man, you can use the sequence "ma [n | p]" (only the characters in the Guide code are the same below ). You can also define a range in square brackets. For example, "[A-Z]" indicates all lowercase letters (hyphens (-) are used to specify the range of consecutive characters ), "[B-F]" represents all uppercase letters between B and F, "[0-9]" represents a number, if you want to search for an integer (the sequence contains only 0 to 9 characters), you can write "[0-9] +" (note, use the + character to indicate that at least one such number is required, but there can be multiple numbers, so 9, 83, and 3443 are matched .)
Let's take a look at the results of a general expression and write an instance regularexpressionsworkflow. Create several common expressions to display the results, so that you can see how the expressions work.

the core of the instance is the writematches () method, which displays all the matches in the matchcollection in details. For each match, it will display the index of the matched string in the input string, the matched string and a slightly longer string, which contains up to 8 peripheral characters in the input text, at least five characters are placed before the match, and at most five characters are placed after the match. (If the matched position is at the beginning or end of the input text, the number of matched characters in the result is less than 4 ). In other words, the match near the end of the input text should be "and messaging ofd", each of which has 5 characters, however, the match on the last word of the input text should be "g of Data", and only one character after the match is entered. Because the character is followed by the end of the string. This long string can clearly indicate where the general expression is matched:

static void writematches (string text, matchcollection matches)
{< br> console. writeline ("original text was: \ n" + TEXT + "\ n");
console. writeline ("no. of matches: "+ matches. count);
foreach (match nextmatch in matches)
{< br> int Index = nextmatch. index;
string result = nextmatch. tostring ();
int charsbefore = (index <5 )? Index: 5;
int fromend = text. Length-index-result. length;
int charsafter = (fromend <5 )? Fromend: 5;
int charstodisplay = charsbefore + charsafter + result. length;
console. writeline ("index: {0}, \ tstring: {1}, \ t {2}", index, result,
text. substring (index-charsbefore, charstodisplay);
}< BR >}

In this method, the processing is to determine how many characters can be displayed in a long string, without exceeding the start or end of the input text. Note that another attribute value is used on the match object, which contains the matching string. In addition, regularexpressionssiononly contains methods named find_po and find_n, these methods perform some search operations based on this article.

4.3 Regular Expression options

You can use the option that affects Matching Behavior to modify the regular expression mode. You can set the regular expression option in two basic ways: one is that you can specify it in the options parameter of the RegEx (pattern, options) constructor, options is the bitwise or combination of regexoptions enumerated values. The other is inline (? Imnsx-imnsx :) group structure or (? Imnsx-imnsx) Other constructor sets them in the regular expression mode.

In inline option construction, minus signs (-) before an option or a set of options are used to disable these options. For example, inline Constructor (? IX-MS) The ignorecase and ignorepatternwhitespace options are enabled, and the multiline and singleline options are disabled.

Table 2: regexoptions enumerated members and equivalent inline option characters

Regexoption Member Inline characters Description
None None Specify no options.
Ignorecase I Specifies case-insensitive matching.
Multiline M Specify the multiline mode. Change the meaning of ^ and $ so that they match the beginning and end of any row, not just the beginning and end of the entire string.
Explicitcapture N Specify that the unique valid capture is explicitly named or numbered (? <Name>. This allows parentheses to act as a non-capturing group, thus avoiding (? .
Compiled None The specified regular expression is compiled into an assembly. The Microsoft intermediate language (msil) code that generates this regular expression. The execution speed is faster at the cost of a long start time.
Singleline S Specifies the single-row mode. Change the meaning of the period (.) to match each character (not all characters except \ n.
Ignorepatternwhitespace X Specify the Escape mode unless the blank space is escaped and the comment after the digit sign (#) is enabled. Note that the white space will never be removed from the character class.
Righttoleft None Specify that the search is performed from right to left rather than from left to right. Regular Expressions with this option will be moved to the left of the starting position rather than the right. (Therefore, the start position should be specified as the end of the string rather than the start position .) To avoid the possibility of constructing a regular expression with an infinite loop, this option cannot be specified in the stream. However ,(? <) The structure after review provides a similar alternative that can be used as a subexpression.
Ecmascript None Specifies that ecmascript-compliant behavior has been enabled for the expression. This option can only be used with the ignorecase and Multiline flag. Using ecmascript with any other flag causes exceptions.

For example, find_po searches for strings starting with "po" at the beginning of the word:

Static void find_po ()
{
String text = @ "I can not find my position in Beijing ";
String Pattern = @ "\ BPO \ s * Ion \ B ";
Matchcollection matches = RegEx. Matches (text, pattern, regexoptions. ignorecase
| Regexoptions. ignorepatternwhitespace | regexoptions. explicitcapture );
Writematches (text, matches );
}

This Code also uses the namespace regularexpressions:

Using system;
Using system. Text. regularexpressions;

4.4 matching, group, and capture

A good feature of general expressions is that they can combine characters in the same way as compound statements in C. In C #, you can combine any number of statements in curly brackets. The result is like a composite statement. In normal expression mode, you can combine any character (including metacharacters and escape sequences) to process them as if you were processing a single character. The only difference is that parentheses instead of curly brackets are used to form a group.

For example, the pattern "(an) +" locates the order of sequence an to be repeated. Quantifiers + only apply to one character before it, but because we have combined the characters, it now treats the repeating an as a unit. "(An)." apply to the input text "bananas came to Europe late in the annals of history" and select Anan from bananas. On the other hand, if you use an +, select Ann from Annals and select two an from bananas. Why does (an) + select Anan instead of using a single an as a match. Matching rules cannot be repeated. If there is a possibility of repetition, a long match is selected by default.

However, the group function is much more powerful than this. By default, when a part of the mode is combined into a group, the general expression engine is required to remember to match according to the group or the entire mode. In other words, groups can be treated as a pattern to be matched. This pattern is very effective if strings are to be divided into various parts.
For example, the URI format is "<protocol>: // <address >:< port>", where the port is optional. An example of it is http://www.comprg.com.cn: 8080. If you want to extract the protocol, address, and port from a URI, and there may be white space (but no punctuation marks) next to the URI, you can use the following expression: "\ B (\ s +): // (\ s + )(? : :( \ S + ))? \ B"

The expression works as follows: first, the \ B sequence at the beginning and end ensures that only the text part of the word needs to be considered completely. In this text part, the first group "(\ s +): // "one or more unsuitable characters are selected, followed "://". Http: // is selected at the beginning of httpuri ://. Brackets indicate storing HTTP as a group. In the URI above, select WWW. comprg.com.cn. This group ends at the end of the word or when the colon "(:)" of the other group is marked.

Select a port for the next group (in this example, 8080 ). ? This indicates that this group is optional in the match. If no: XXXX is available, the matching tag is not blocked.

This is very important because the port is generally not specified in the URI. In fact, in most cases, the URI does not have a port number. However, things are complicated. If a colon is required to appear, it does not exist, but you do not want to store the colon in the group. Therefore, two groups can be nested: the internal "(\ s +)" group selects the content after the colon (8080 in this example), and the external group contains the internal group, followed by a colon, which is in the sequence "? . This sequence indicates that the group should not be saved (only "8080" needs to be saved, and ": 8080" does not need to be saved "). Do not confuse the two colons. The first colons are sequences "? : ", Indicates that the group is not saved, and the second colon is the text to be searched.

Run this mode on the string: I always visit http: // www. comprg.com.cn. The matching result is http: // www. comprg.com.cn. In this match, only three groups are mentioned, and the fourth group indicates matching itself. Theoretically, each group can be matched 0 times, once, or multiple times. A single match is called capture. In the first group "(\ s +)", there is an HTTP capture. The second group also has a capture www. comprg.com.cn, but the third group does not, because there is no port number in this URI. Note that the string itself contains the second http ://. Although it matches the first group, it is not searched because the entire search expression does not match the text.
In the following example, match. result is used to extract the protocol and port number from the URL. For example, "http://www.yahoo.com.cn: 8080/index.html" will return "http: 8080 ".

String extension (string URL)
{
RegEx r = new RegEx (@ "^ (? <Proto> \ W +): // [^/] +? (? <Port>: \ D + )? /",
Regexoptions. Compiled );
Return R. Match (URL). Result ("$ {proto }$ {port }");
}

5. Summary

The. NET Framework regular expression class is part of the base class library and can be used together with any language or tool for the Common Language Runtime Library (including ASP. NET and Visual Studio. NET. This article provides a method to use regular expressions in C # To implement string search.. NET Framework, and summarizes the rules and options of regular expressions to facilitate future application.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.