C # Getting Started share (vii)--regular expression and string search

Last Update:2015-05-03 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Regular Expressions

Regular expressions provide a powerful, flexible, and efficient way to work with text. The full pattern-matching notation of regular expressions allows you to quickly parse large amounts of text to find specific character patterns, extract, edit, replace, or delete text substrings, or add extracted strings to the collection to generate reports. Regular expressions are an indispensable tool for many applications that handle strings such as HTML processing, log file parsing, and HTTP header parsing.

The. NET Framework regular expression incorporates the most common functionality implemented by other regular expressions, is designed to be compatible with Perl 5 regular expressions, and the. NET Framework regular expression also includes features that are not yet available in other implementations, and the. NET Framework regular expression class is part of the base Class library. And can be used with any language or tool that targets the common language runtime.

2 string Search

The regular expression language consists of two basic character types: literal (normal) text word and metacharacters character. It is the meta-character group that provides the processing power for regular expressions. Currently, all text editors have some search capabilities, you can usually open a dialog box, type the string you want to locate in one of the text boxes, and if you want to do the same, type a replacement string, such as Notepad in the Windows operating system, This functionality is available in document editors in the Office family. The simplest way to do this is to use the String.Replace () method of the string class to solve this kind of problem, but what if you need to identify a duplicate in the document? Writing a routine that chooses a repeating word from a string class is more complex, and it is appropriate to use the language at this point.

The general expression language is a language that can write search expressions. In this language, you can combine text, escape sequences, and other characters of a specific meaning in a document, such as a sequence \b that represents the beginning and end of a word (the boundary of a child), and if you want to represent a word that starts with a character th, you can write a generic expression \bth (that is, the sequence character bounds is-t-h). If you want to search for all words ending in th, you can write th\b (sequence t-h-word boundaries). However, the general expression is much more complex than this, for example, you can find a tool for storing part of the text in a search operation (facility).

3. NET Framework Regular Expression Classes

Here's a look at the regular expression classes of the. NET Framework. NET Framework for the use of regular expressions.

(1) used in C #. NET General expression engine

The following is a sample development that executes and displays the results of some search, explaining some of the characteristics of the general expression, and how to use it in C #. NET generic expression engine. Indicates that the string should be preceded by a symbol @.

String [email protected]"I can not find my position in Beijing";

This text is called an input string, in order to illustrate the general expression. NET class, this article first a plain text search, this time the search without any escape sequence or general expression command, the search string is called a pattern. Use the general expression and the variable text declared above to write the following code:

In this code, the static method match () for the Regex class in the System.Text.RegularExpressions namespace is used. The parameter of this method is some input text, A set of optional flags in a pattern and regexoptions each sentence. Matches () returns MatchCollection, each of which is represented by a match object. In the above code, just iterate over the collection, using the Match class's Index property, Returns the index of the match in the input text. Running this code will get 1 matches.
The functionality of a generic collection depends primarily on the pattern string. The reason is that the pattern string contains more than just plain text. Also contains metacharacters and escape sequences, metacharacters are special characters that give commands, and escape sequences work in the same way as C # escape sequences, which are characters that begin with a backslash. Have a special meaning. For example, suppose you want to find a word that begins with N, you can use an escape sequence that represents the boundary of a word (the boundary of a word is preceded by a character with an alphanumeric number, or is followed by a white-space character or punctuation mark), and the following code is written:

String Pattern = @"n";
MatchCollection Matches = regex.matches (text,pattern,regexoptions.ignorecase| Regexoptions.explicitcapture);

To be passed to at run time. NET generic expression engine, backslashes should not be interpreted as escape sequences by the C # compiler. If you are looking for a word that ends with a sequence of ion, you can use the following code:

String Pattern = @"ion";

What if you want to find all the words that start with the letter N and end with a sequence of ion, and need a middle content that starts with N and ends with ion? You need to tell the computer that the middle of N and ion can be any length of character, as long as the character is not blank, the correct pattern is as follows:

String Pattern = @"ns*ion";

(2) specific characters or escape sequences

Most important regular expression language operators are non-escaping single characters. The escape character (a single backslash) notifies the regular expression parser that the character following the backslash is not an operator. For example, the parser treats an asterisk (*) as a repeating qualifier, and a backslash (*) followed by an asterisk as a Unicode character of 002A.
One thing to get used to with general expressions is to look at the odd sequence of characters like this, but the work of this sequence is very logical. The escape sequence s represents any character that is not in the white space. * Called the number of words, meaning that the preceding character can be repeated any time, including 0 of times. The sequence s* represents any characters that are not blank The above pattern matches any single word that begins with N and ends with Ion. The character escapes listed in the following table are recognized in both regular expressions and substitution patterns.

The following table is a commonly used specific character or escape sequence:

If you are searching for a meta-character, you can also express it by using an escape character with a backslash. For example,. Represents any character other than the newline character, and \. Represents a point.
The replaceable characters can be placed in square brackets, and the request matches the containing characters. For example, [1|c] indicates that a character can be 1 or c. If you are searching for map or man, you can use the sequence "ma[n|p" (only the characters within the guideline number, the same as below). In square brackets, you can also set a range, such as "[A-z]" to denote all lowercase letters (using hyphens (-) to allow the specified contiguous range of characters), "[B-f]" for all uppercase letters between B and F, "[0-9]" for a number, If you are searching for an integer that contains only 0 to 9 characters, you can write "[0-9]+" (Note that using the + character means at least one number, but you can have more than one number, so 9, 83, and 3443 are all matched.) ）
Let's take a look at the results of the general expression and write an instance of Regularexpressionszzy. Create several general expressions that show the results and let the user know how the expression works.

The core of the example is a method writematches (), which displays all the matches in the MatchCollection in a more detailed manner. For each match, it displays the index of the match in the input string, the matched string, and a slightly longer string containing up to 8 peripheral characters in the input text, where at least 5 characters are placed in front of the match. Up to 5 characters are placed after the match (if the matching position is within 5 characters at the beginning or end of the input text, the result will be less than 4 characters before and after the match). In other words, the match near the end of the input text should be "and messaging Ofd" with 5 characters before and after the match, but the match on the last word of the input text should be "G of data" with only one character after the matching word. Because the character is followed by the end of the string. This long string can more clearly indicate where the general expression is found to match:

In this method, the process is to determine how many characters in a longer string can be displayed without the beginning or end of the overrun input text. Note that another property, value, is used on the match object that contains the string that identifies the match, and that the Regularexpressionszzy contains only methods named Find_po,find_n, which perform some search operations based on this article.

(3) Regular Expression options

You can modify the regular expression pattern with options that affect the matching behavior. You can set the regular expression options in two basic ways: one can be specified in the options parameter in the Regex (pattern, options) constructor, where options is the bitwise OR combination of the RegexOptions enumeration values, and the other is the use of inline (? IMNSX-IMNSX:) grouping constructs or (? IMNSX-IMNSX) Other constructs set them within the regular expression pattern.

In an inline option construct, an option or a minus sign (-) in front of a set of options is used to turn off these options. For example, an inline construct (? ix-ms) turns on the IgnoreCase and ignorepatternwhitespace options, turning off the Multiline and Singleline options.

The following table is a member of the RegexOptions enumeration and the equivalent inline option character:

For example, Find_po looks for a string that begins with "Po" at the beginning of the word:

This code also uses the namespace RegularExpressions:

(4) matching, group, and capture

A good feature of general expressions is the ability to combine characters in the same way as compound statements in C #. In C #, you can combine any number of statements by putting them in curly braces. The result is like a compound statement. In the general expression pattern, you can also combine any character (including metacharacters and escape sequences), processing them as if they were a character. The only difference is to use parentheses instead of curly braces to get the sequence to become a group.

For example, the mode "(AN) +" locates the sequence of an to any repetition. Quantifier + applies only to one character in front of it, but because we combine the characters, it now treats the duplicate an as a unit. "(AN)." Applied to the input text "bananas came to Europe late on the Annals of History", Bananas is selected from Anan. On the other hand, if you use an+, the Ann is selected from the annals, and two an is selected from the bananas. Why (AN) + selected is Anan, and does not put a single as a match. Matching rules cannot be duplicated, and if it is possible to repeat, a longer match is selected by default.

However, the functionality of the group is much more powerful than this. By default, when a part of a pattern is combined into a group, the general expression engine is required to remember that it can be matched by this group, or it can match the entire pattern. In other words, you can think of a group as a pattern to be matched, which is very effective if you want to decompose the string into parts.
For example, the format of the URI is "<protocol>://<address>:<port>", where the port is optional. A sample of it is http://www.comprg.com.cn:8080. Assuming that you want to extract the protocol, address, and port from a URI, and there may be white space behind the URI (but no punctuation), you can use the following expression: "\b (\s+)://(\s+) (?::(\s+))? \b"

The expression works as follows: first, the leading and trailing \b sequences ensure that only the text parts that are completely literal are considered, in which the first set of "(\s+)://" selects one or more characters that are not suitable for whitespace, followed by "://". At the beginning of Httpuri, you will choose http://. The curly braces indicate that HTTP is stored as a group. The following "(\s+)" selects www in the above URI. comprg.com.cn, this group ends when it encounters the end of a word or marks the colon "(:)") of another group.

The next group selects the port (in this case: 8080). The back? Indicates that the group is optional in the match, and if not: xxxx, does not hinder matching tokens.

This is important because the port is not generally specified in the URI, in fact, in most cases, the URI does not have a port number. But things are going to be more complicated. If you ask for a colon to appear, you may not appear, but you do not want the colon to be stored in the group as well. To do this, you can nest two groups: the internal "(\s+)" group selects the content after the colon (in this case, 8080), the outer group contains the inner group, followed by a colon, and the colon is followed by the sequence "?:". This sequence indicates that the group should not be saved (only "8080" should be saved and ": 8080" is not required). Instead of confusing the two colons, the first colon is part of the sequence "?:", which means that the group is not saved and the second colon is the text to search for.

Run the pattern on this string: I always visit http://www. Comprg.com.cn get the match is http://www. comprg.com.cn. In this match, only three groups are mentioned, and a fourth group represents the match itself. In theory, each group can select 0, 1, or multiple matches. A single match is called a capture. In the first group "(\s+)", there is a catch http. The second group also has a capture www. Comprg.com.cn, but the third group is not captured because there is no port number in this URI. Note that the string contains a second http://on itself. Although it matches the first group, it is not searched because the entire search expression does not match this part of the text.
For example, the following code example uses Match.result to extract the protocol and port number from a URL. For example, "http://www.yahoo.com.cn:8080/index.html" will return "http:8080".

. NET, the regular expression in the first introduction to this, they are in C # programming has a wide range of applications, I hope we can be used on the basis of gradually mastered. In the next blog, we introduce LINQ in C #.

C # Getting Started share (vii)--regular expression and string search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More