C # Regular Expression Programming (4): Regular Expressions

Last Update:2018-12-03 Source: Internet

Author: User

Tags iis access logs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expressions provide a powerful, flexible, and efficient way to process text. The full pattern matching expression of Regular Expressions allows you to quickly analyze a large number of texts to find a specific character pattern; extract, edit, replace or delete the character string; or add the extracted string to the set to generate the report. Regular Expressions are an indispensable tool for many applications that process strings (such as HTML processing, Log File Analysis, and HTTP header analysis. Regular Expressions are a very useful technique. Some people once call it one of the top ten technologies that can prevent programmers from losing their jobs. This shows the importance of regular expressions.

A friend familiar with DoS or command line may have used similar functions, for example, we want to find all the Word files below word2007 under drive D (because the suffix of the Word files below word2007 is. the file Suffix of the word2007 version of the Word file is. docx), we can execute this name in the command line:

Dir D:/* Doc

Of course, if you want to find all such files in any sub-directories under drive D, you should execute DIR/s d:/* Doc.

Note that regular expressions are not exclusive to C #. They have already been implemented in other languages, such as Perl (maybe many people have never heard of this programming language, I learned a little bit about it a decade ago.) other programming languages such as Java, PHP, and JavaScript also support regular expressions. Regular Expressions are almost as standard as SQL, similar to SQL, the degree of support for SQL standards is not exactly the same for different database vendors. The same is true for regular expressions. Most of the regular expressions can be used across languages, however, there are also minor differences in different languages, which requires our attention.

Regular Expression metacharacters

The Regular Expression Language consists of two basic character types: literal (normal) text characters and metacharacters. Metacharacters enable the regular expression to process. Metacharacters can be any single character in [] (for example, [a] indicates matching a single lowercase character ), it can also be a character sequence (for example, [a-d] indicates matching any character between A, B, C, and D, And/W indicates any English letter, number, and underline ), below are some common metacharacters:

Metacharacters	Description
.	Match any character except/N (note that the metacharacter is a decimal point ).
[ABCDE]	Match any character in ABCDE
[A-H]	Match any character between A and H
[^ Fgh]	Does not match any character in fgh
/W	Match any one of the upper and lower case English characters and numbers 0 to 9 and underline, equivalent to [a-zA-Z0-9 _]
/W	Does not match any one of the upper and lower case English characters and numbers 0 to 9, equivalent to [^ a-zA-Z0-9 _]
/S	Matches any blank character, equivalent to [/f/n/R/T/V]
/S	Matches any non-blank characters, equivalent to [^/S]
/D	Matches a single number between 0 and 9, which is equivalent to [0-9].
/D	Does not match any single number between 0 and 9, which is equivalent to [^ 0-9]
[/U4e00-/u9fa5]	Match any single Chinese character (here unicode encoding is used to represent Chinese characters)

Regular Expression qualifier

The above metacharacters are matched for a single character. To match multiple characters at the same time, you also need to use a qualifier. Below are some common delimiters (N and m in the following table both represent integers and 0 <n <m ):

Floating limit	Description
*	Matches 0 to multiple metacharacters, equivalent to {0 ,}
?	Matches 0 to 1 metacharacters, equivalent to {0, 1}
{N}	Match n metacharacters
{N ,}	Match at least N metacharacters
{N, m}	Match n to M metacharacters
+	Match at least 1 metacharacters, equivalent to {1 ,}
/B	Match word boundary
^	The string must start with a specified character.
$	The string must end with a specified character.

Note:

(1) because the regular expressions include "/" and "? "," * "," ^ "," $ "," + "," (",") "," | "," {"," [", And other characters have some special significance. If you need to use their original meanings, escape them, for example, if you want to have at least one "/" in the string, the regular expression should be written as follows: // +.

(2) You can enclose multiple metacharacters or literal text characters in parentheses to form a group, such as ^ (13) [4-9]/d {8} $ indicates any mobile phone number starting with 13.

(3) In addition, Chinese characters are matched using the corresponding unicode encoding. For a single UNICODE character, for example,/u4e00 indicates the Chinese character "1 ", /u9fa5 indicates the Chinese character "second". In unicode encoding, this is the first and last unicode encoding of the Chinese characters that can be expressed, and 20901 Chinese characters can be expressed in Unicode encoding.

(4) For/B usage, it indicates the start or end of a word. The string "123a 345b 456 789d" is used as the sample string, if the regular expression is "/B/d {3}/B", it can only match 456.

(5) "|" can be used to represent or. For example, [z | j | q] indicates matching any letter in Z, J, and Q.

Regular Expression grouping

Enclose a part of a regular expression with () to form a group, also known as a submatch or a capture group. For example, for a time in the format of "08:14:27", we can write the following regular expression:

(0 [1-9]) | (1 [0-9]) | (2 [0-3]) (: [0-5] [1-9]) {2}

If this expression is used, it extracts the access time from the following IIS access log (of course, the best tool for analyzing IIS logs is log parser, a Microsoft tool ):

00:41:23 get/admin_save.asp 202.108.212.39 404 1468 176

01:04:36 get/userbuding. asp 202.108.212.39 404 1468 176

10:00:59 get/upfile_flash.asp 202.108.212.39 404 1468 178

12:59:00 get/CP. php 202.108.212.39 404 1468

19:23:04 get/sqldata. php 202.108.212.39 404 1468 173

23:00:00 get/Evil-Skwiz.htm 202.108.212.39 404 1468

23:59:59 get/bil.html 202.108.212.39 404 1468

If you want to analyze the preceding IIS logs, extract the access time, access page, Client IP address, and server response code (corresponding to httpstatuscode in C #) from each log ), we can obtain it by group.

The Code is as follows:

View plaincopy to clipboardprint?

Private string text = @ "00:41:23 get/admin_save.asp 202.108.212.39 404 1468 176
01:04:36 get/userbuding. asp 202.108.212.39 404 1468 176
10:00:59 get/upfile_flash.asp 202.108.212.39 404 1468 178
12:59:00 get/CP. php 202.108.212.39 404 1468
19:23:04 get/sqldata. php 202.108.212.39 404 1468 173
23:00:00 get/Evil-Skwiz.htm 202.108.212.39 404 1468
23:59:59 get/bil.html 202.108.212.39 404 1468 ";
/// <Summary>
/// Analyze IIS logs and extract the client access time, URL, IP address, and server response code
/// </Summary>
Public void analyzeiislog ()
{
// Extract the regular expression of the access time, URL, IP address, and server response code
// You can see that the subexpressions for extracting time are complex because of strict time matching restrictions.
// For the sake of simplicity, the Client IP format is not strictly verified, because no IP addresses that do not meet the requirements in IIS access logs
RegEx = new RegEx (@ "(0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2})/S (get)/S ([^/S] +) /s (/d {1, 3 }(/. /d {1, 3}) {3})/S (/d {3}) ", regexoptions. none );
Matchcollection = RegEx. Matches (text );
For (INT I = 0; I <matchcollection. Count; I ++)
{
Match match = matchcollection [I];
Console. writeline ("Match [{0}] ===========================", I );
For (Int J = 0; j <match. Groups. Count; j ++)
{
Console. writeline ("groups [{0}] = {1}", J, match. Groups [J]. value );
}
}
}

Private string text = @ "00:41:23 get/admin_save.asp 202.108.212.39 404 1468 17601: 04: 36 get/userbuding. ASP 202.108.212.39 404 1468 17610: 00: 59 get/upfile_flash.asp 202.108.212.39 404 1468 17812: 59: 00 get/CP. PHP 202.108.212.39 404 1468 16819: 23: 04 get/sqldata. PHP 202.108.212.39 404 1468 17323: 00: 00 get/Evil-Skwiz.htm 202.108.212.39 404 1468 17623: 59: 59 get/bil.html 202.108.212.39 404 1468 170 "; /// <summary> /// analyze IIS logs and extract the client access time, URL, IP address, and server response code. /// </Summary> Public void analyzeiislog () {// obtain the regular expression of the access time, URL, IP address, and server response code. // you can see that the subexpression about the extraction time is complex, due to strict time matching restrictions, // note that the client IP format is not strictly verified for simplicity, because the IIS access log does not contain the IP address RegEx = new RegEx (@ "(0 [0-9] | 1 [0-9] | 2 [0-9- 3]) (: [0-5] [0-9]) {2})/S (get)/S ([^/S] +) /s (/d {1, 3 }(/. /d {1, 3}) {3})/S (/d {3}) ", regexoptions. none); matchcollection = RegEx. matches (text); For (INT I = 0; I <matchcollection. count; I ++) {match = matchcollection [I]; console. writeline ("Match [{0}] =====================", I ); for (Int J = 0; j <match. groups. count; j ++) {console. writeline ("groups [{0}] = {1}", J, match. groups [J]. value );}}}

The output result of this Code is as follows:

Match [0] ============================

Groups [0] = 00:41:23 get/admin_save.asp 202.108.212.39 404

Groups [1] = 00:41:23

Groups [2] = 00

Groups [3] =: 23

Groups [4] = get

Groups [5] =/admin_save.asp

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [1] ============================

Groups [0] = 01:04:36 get/userbuding. asp 202.108.212.39 404

Groups [1] = 01:04:36

Groups [2] = 01

Groups [3] =: 36

Groups [4] = get

Groups [5] =/userbuding. asp

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [2] ============================

Groups [0] = 10:00:59 get/upfile_flash.asp 202.108.212.39 404

Groups [1] = 10:00:59

Groups [2] = 10

Groups [3] =: 59

Groups [4] = get

Groups [5] =/upfile_flash.asp

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [3] ==============================

Groups [0] = 12:59:00 get/CP. php 202.108.212.39 404

Groups [1] = 12:59:00

Groups [2] = 12

Groups [3] =: 00

Groups [4] = get

Groups [5] =/CP. php

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [4] ============================

Groups [0] = 19:23:04 get/sqldata. php 202.108.212.39 404

Groups [1] = 19:23:04

Groups [2] = 19

Groups [3] =: 04

Groups [4] = get

Groups [5] =/sqldata. php

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [5] ============================

Groups [0] = 23:00:00 get/Evil-Skwiz.htm 202.108.212.39 404

Groups [1] = 23:00:00

Groups [2] = 23

Groups [3] =: 00

Groups [4] = get

Groups [5] =/Evil-Skwiz.htm

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

Match [6] ============================

Groups [0] = 23:59:59 get/bil.html 202.108.212.39 404

Groups [1] = 23:59:59

Groups [2] = 23

Groups [3] =: 59

Groups [4] = get

Groups [5] =/bil.html

Groups [6] = 202.108.212.39

Groups [7] =. 39

Groups [8] = 404

From the above output, we can see that in each matching result, the 2nd groups are the client access time (because the index starts from 0, so the index order is 1, the same is true ), the 6th groups are the access URLs (6 in the index order) and the 7th groups are the Client IP addresses (6 in the index order ), the 9th groups are server-side response code (the index order is 9 ). If we want to extract these elements, we can directly access these values according to the index, which is more convenient than we do not use regular expressions.

Naming capture group

Although the above method is convenient, it also has some inconvenience: If you need to extract more information and increase or decrease the capture group, the corresponding value of the capture group index will change, we need to re-modify the code, which is also a hard code. Is there any better way? The answer is yes, that is, naming the capture group.

Just as we use datareader to access the database or access the data in the able, we can use the index method (the index also starts from 0 ), however, if the number or order of fields in the SELECT statement is changed, the data obtained in this way needs to be changed again. To adapt to this change, the field name can also be used as an index to access data, if this field exists in the data source, the correct value is obtained regardless of the order. Naming a capture group in a regular expression also plays the same role.

Common capture group representation: (Regular Expression), for example (/d {8, 11 });

Name capture group representation :(? <Capture group name> regular expression), such (? <Phone>/d {8, 11 })

For a common capturing group, you can only obtain the corresponding value by index, but for a naming capturing group, you can also access it by name, for example (? <Phone>/d {8, 11}), you can follow the match in the code. the groups ["phone"] method makes the code more intuitive and the encoding more flexible. For the analysis of IIS logs just now, the code for naming the capture group is as follows:

View plaincopy to clipboardprint?

Private string text = @ "00:41:23 get/admin_save.asp 202.108.212.39 404 1468 176
01:04:36 get/userbuding. asp 202.108.212.39 404 1468 176
10:00:59 get/upfile_flash.asp 202.108.212.39 404 1468 178
12:59:00 get/CP. php 202.108.212.39 404 1468
19:23:04 get/sqldata. php 202.108.212.39 404 1468 173
23:00:00 get/Evil-Skwiz.htm 202.108.212.39 404 1468
23:59:59 get/bil.html 202.108.212.39 404 1468 ";
/// <Summary>
/// Use a named capture group to extract information in IIS logs
/// </Summary>
Public void analyzeiislog2 ()
{
RegEx = new RegEx (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2})/S (get)/S (? <URL> [^/S] +)/S (? <Ip>/d {1, 3} (/./d {1, 3}) {3})/S (? <Httpcode>/d {3}) ", regexoptions. None );
Matchcollection = RegEx. Matches (text );
For (INT I = 0; I <matchcollection. Count; I ++)
{
Match match = matchcollection [I];
Console. writeline ("Match [{0}] ===========================", I );
Console. writeline ("Time: {0}", match. Groups ["time"]);
Console. writeline ("url: {0}", match. Groups ["url"]);
Console. writeline ("IP: {0}", match. Groups ["ip"]);
Console. writeline ("httpcode: {0}", match. Groups ["httpcode"]);
}
}

Private string text = @ "00:41:23 get/admin_save.asp 202.108.212.39 404 1468 17601: 04: 36 get/userbuding. ASP 202.108.212.39 404 1468 17610: 00: 59 get/upfile_flash.asp 202.108.212.39 404 1468 17812: 59: 00 get/CP. PHP 202.108.212.39 404 1468 16819: 23: 04 get/sqldata. PHP 202.108.212.39 404 1468 17323: 00: 00 get/Evil-Skwiz.htm 202.108.212.39 404 1468 17623: 59: 59 get/bil.html 202.108.212.39 404 1468 170"; /// <Summary> /// use the naming capture group to extract information from IIS logs. // </Summary> Public void analyzeiislog2 () {RegEx = new RegEx (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2})/S (get)/S (? <URL> [^/S] +)/S (? <Ip>/d {1, 3} (/./d {1, 3}) {3})/S (? <Httpcode>/d {3}) ", regexoptions. none); matchcollection = RegEx. matches (text); For (INT I = 0; I <matchcollection. count; I ++) {match = matchcollection [I]; console. writeline ("Match [{0}] =====================", I); console. writeline ("Time: {0}", match. groups ["time"]); console. writeline ("url: {0}", match. groups ["url"]); console. writeline ("IP: {0}", match. groups ["ip"]); console. writeline ("httpcode: {0}", match. groups ["httpcode"]);}

The code execution result is as follows:

Match [0] ============================

Time: 00: 41: 23

URL:/admin_save.asp

IP: 202.108.212.39

Httpcode: 404

Match [1] ============================

Time: 01: 04: 36

URL:/userbuding. asp

IP: 202.108.212.39

Httpcode: 404

Match [2] ============================

Time: 10: 00: 59

URL:/upfile_flash.asp

IP: 202.108.212.39

Httpcode: 404

Match [3] ==============================

Time: 12: 59: 00

URL:/CP. php

IP: 202.108.212.39

Httpcode: 404

Match [4] ============================

Time: 19: 23: 04

URL:/sqldata. php

IP: 202.108.212.39

Httpcode: 404

Match [5] ============================

Time: 23: 00: 00

URL:/Evil-Skwiz.htm

IP: 202.108.212.39

Httpcode: 404

Match [6] ============================

Time: 23: 59: 59

URL:/bil.html

IP: 202.108.212.39

Httpcode: 404

After naming a capture group, the value of the access capture group is more intuitive. As long as the value of the name capture group does not change, other changes do not affect the original code.

Non-capturing Group

If you often look at the source code of other people's regular expressions, you may see the form (? : Sub-expression). This is a non-capturing group. We can understand the capture group, that is, the index or name can be used in the subsequent code (if it is a named capture group) to access the matching value, because the corresponding value is saved to the memory during the matching process, if we do not need to access matched values later, we can tell the program not to save the matched values in the memory to improve efficiency and reduce memory consumption, in this case, you can use a non-capturing group. For example, when analyzing IIS logs, we don't care about the client's request submission method. here we can use a non-capturing group, as shown below:

View plaincopy to clipboardprint?

RegEx = new RegEx (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2})/S (? : Get)/S (? <URL> [^/S] +)/S (? <Ip>/d {1, 3} (/./d {1, 3}) {3})/S (? <Httpcode>/d {3 })";

RegEx = new RegEx (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2})/S (? : Get)/S (? <URL> [^/S] +)/S (? <Ip>/d {1, 3} (/./d {1, 3}) {3})/S (? <Httpcode>/d {3 })";

Zero-width assertion

There are many claims for zero-width assertion, including loop view and pre-search. Here I use the call method in msdn. There are several claims for zero-width assertion:

(? = Subexpression): Zero-width positive prediction predicate. The child expression continues matching only when it matches the right side of the position. For example, 19 (? = 99) matches the 19 instances that are earlier than 99.

(?! Sub-expression): Zero-width negative prediction first asserted. The child expression continues matching only when it does not match the right side of the position. For example ,(?! 99) It does not match the word that does not end with 99. Therefore, it does not match 1999.

(? <= Sub-expression): assertion after reviewing with zero width. The child expression continues matching only when it matches the left side of the position. For example ,(? <= 19) 99 matches the 99 instance following 19. This construction will not be traced back.

(? <! Sub-expression): assertion after review. The child expression continues matching only when it does not match on the left side of the position. For example (? <= 19) It does not match a word that does not start with 19, so it does not match 1999.

Regular Expression options

When using a regular expression, in addition to using the regexoptions enumeration to assign some additional options to the regular expression, you can also use these options in the expression, such:

View plaincopy to clipboardprint?

RegEx = new RegEx ("(? I) Def ");

RegEx = new RegEx ("(? I) Def ");

It is equivalent to the following sentence:

View plaincopy to clipboardprint?

RegEx = new RegEx ("def", regexoptions. ignorecase );

RegEx = new RegEx ("def", regexoptions. ignorecase );

Use (? I) This form is called the inline mode. As the name suggests, it means that the regular expression option has been embodied in the regular expression. The Inline characters correspond to regexoptions as follows:

Ignorecase: Specifies a case-insensitive match when the inline character is I.

Multiline: Specifies the multiline mode when the inline character is M. Change the meaning of ^ and $ so that they match the beginning and end of any row, not just the beginning and end of the entire string.

Explicitcapture: The Inline character is N, and the only valid capture is explicitly named or numbered (? <Name>... . This allows parentheses to act as a non-capturing group, thus avoiding (? :... .

Singleline: Specifies the single-line mode when the inline character is S. Change the meaning of the period (.) to match each character (not all characters except/N.

Ignorepatternwhitespace: The Inline character is X, which specifies to exclude escape spaces from the mode and enable comments after the digit sign. (For a list of escape space characters, see escape character .) Note that the white space will never be removed from the character class.

Example:

View plaincopy to clipboardprint?

Regexoptions option = regexoptions. ignorecase | regexoptions. singleline;
RegEx = new RegEx ("def", option );

Regexoptions option = regexoptions. ignorecase | regexoptions. singleline; RegEx = new RegEx ("def", option );

Inline representation:

View plaincopy to clipboardprint?

RegEx = new RegEx ("(? Is) Def ");

RegEx = new RegEx ("(? Is) Def ");

Note: In fact, there are more details about regular expressions, such as reverse references, matching order, and differences and connections between several matching modes, however, these are not used too much in daily development (if it is used for text analysis or processing), so we will not continue to talk about them for the time being. Although the four articles in this series are not too long (I am afraid to stay up late, because I have to get up at 5 o'clock every day ), however, through these basic learning, we can still master the essence of regular expressions. As for how to use regular expressions in development, we need to use them flexibly in combination with the actual situation. In my personal experience, if it is used to verify whether the requirements are met, write the regular expression strictly. If it is to extract data from the text in the standard format, you can write it a little loose, for example, the verification time must be written
(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2}) in this form, if someone else enters 26: 99: 99, the verification will fail, but if the time is extracted from the IIS log mentioned above, use (? <Time>/d {2} (:/d {2}) {2, of course, if writing strict verification is troublesome, you can also write a relatively loose format, and then use other methods for verification. There is a regular expression for the verification date on the Internet, the authors fully consider the different days of each month, or even the different days of the year in the year of the Year of the year, write a quite complex regular expression to verify, I personally think that the text value can be converted into a date for joint verification, so that we can better understand and accept it.

At this point, the article on regular expressions has been written here for the time being. There are some other things that are not used too much. I will summarize them later. Next I may want to compare ADO.. NET and Orm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More