We have collected regular expressions commonly used in asp.net. If you need them, you can refer to them.
When it comes to regular expressions, I believe many people will not be unfamiliar, and many times we have used them, such as verifying the correctness of email addresses or mobile phone numbers. NET provides a powerful regular expression helper class. The most important thing is to count the Regex class. With this class, you can easily operate on matching of Regular Expressions:
The Code is as follows: |
Copy code |
String matchText = "this | is | test "; Regex reg = new Regex (@ "[a-z] + | "); MatchCollection mc = reg. Matches (matchText ); Foreach (Match myValue in mc) { MessageBox. Show (myValue. Value ); } |
The above is a very simple example to match English characters and vertical bars. The matching result here is this | and is |. Note that here we need to introduce the namespace:
Using System. Text. RegularExpressions;
The following describes the regular expression.
First, for regular expressions, we need to know three commands, which are represented by English letters, namely BCD?
B Refers to three character types: brackets [], which are used to match the strings you need to match; braces {}, which are used to specify the matching length; parentheses (), it is used for grouping.
C Refers to the escape character (^), which is used to specify the start of the match.
D refers to the dollar sign, that is, $, which is used to specify the end of the match.
The following is an example:
First, I have a set of mail addresses:
"User 1" <491204829@qq.com>,
"User 2" <12340352@qq.com>,
"User 3" <962390304@qq.com>,
User 4 <xylw2y2011@163.com>,
User 5 <443225735@qq.com>,
User 6 519733331@qq.com
Then we only need to extract the actual email address following it. How can we do this?
First, we can observe that the email addresses that follow are a collection of numbers or letters, followed by the @ symbol, followed by a combination of numbers or letters, and finally are. com ends. After knowing the general rules, we will perform matching.
The first is in the <number or letter set, [a-zA-Z0-9] refers to match a to z or A to Z or 0-9 string, the length is generally about 10 characters, then match 1 to 10: {}, then @ symbol behind, first before the point is also a combination of letters or numbers [a-zA-Z0-9], similarly, we are limited to 10 characters in length between {}, followed by the dot. We can use (com | org) for matching. In this way, the matching result is:
The Code is as follows: |
Copy code |
<[A-zA-Z0-9] {} @ [a-zA-Z0-9] {}. (com | org)> |
The following code is used to explain:
The Code is as follows: |
Copy code |
String myMailInfo = this.txt UserMail. Text; Regex reg = new Regex (@ "<[a-zA-Z0-9] {} @ [a-zA-Z0-9] {}. (com | org)> "); // Regex reg = new Regex (@ "<S {1, 20} @ S {6}> "); MatchCollection mc = reg. Matches (myMailInfo ); This.txt UserMail. Text = string. Empty; For (int I = 0; I <mc. Count; I ++) { This.txt UserMail. AppendText (mc [I]. Value + ","); } |
In this way, it is easy to extract the results. The results are as follows:
491204829@qq.com>,
12340352@qq.com>,
962390304@qq.com>,
Xylw2y2011@163.com>,
443225735@qq.com>,
519733331@qq.com>,
Therefore, it is easier to find a certain rule. The following lists the matched ing relationships:
Match letters or numbers
A-zA-Z0-9
Match a combination of letters or numbers ranging from 1 to 10
[A-zA-Z0-9}
Match fixed string com or org
(Com | org)
Of course, if it does not match any character, number, or other strings, what should we do? In fact, it can be determined by using ^. If you do not want to start with a letter, you can use ^ w to indicate that you do not want to start with a number. You can use ^ d to indicate that you do not want to start with any blank character, it can be expressed by ^ s. In fact, there is a simpler way, that is, ^ w corresponds to W; ^ d corresponds to D; ^ s corresponds to S.
Below are some common metacharacters:
Metacharacters
Description
.
Match any character except n (note that the metacharacter is a decimal point ).
[Abcde]
Match any character in abcde
[A-h]
Match any character between a and h
[^ Fgh]
Does not match any character in fgh
W
Match any one of the upper and lower case English characters and numbers 0 to 9 and underline, equivalent to [a-zA-Z0-9 _]
W
Does not match any one of the upper and lower case English characters and numbers 0 to 9, equivalent to [^ a-zA-Z0-9 _]
S
Matches any blank character, equivalent to [fnrtv]
S
Matches any non-blank characters, equivalent to [^ s]
D
Matches a single number between 0 and 9, which is equivalent to [0-9].
D
Does not match any single number between 0 and 9, which is equivalent to [^ 0-9]
U4e00-u9fa5
Match any single Chinese Character
The following part of the content is from Zhou Gong's blog. I personally think the explanation is very good:
The above metacharacters are matched for a single character. To match multiple characters at the same time, you also need to use a qualifier. Below are some common delimiters (n and m in the following table both represent integers and 0 <n <m ):
Delimiter description
-----------------------------------------------
* Matches 0 to multiple metacharacters, equivalent to {0 ,}
-----------------------------------------------
? Matches 0 to 1 metacharacters, equivalent to {0, 1}
-----------------------------------------------
{N} matches n metacharacters
-----------------------------------------------
{N,} matches at least n metacharacters
-----------------------------------------------
{N, m} matches n to m metacharacters
-----------------------------------------------
+ Match at least 1 metacharacters, equivalent to {1 ,}
-----------------------------------------------
B matching word boundary
-----------------------------------------------
^ The string must start with a specified character
-----------------------------------------------
$ The string must end with a specified character
-----------------------------------------------
Note:
(1) because the regular expressions include "" and "?" , "*", "^", "$", "+", "(", ") "," | "," {"," [", And other characters have some special significance. If you need to use their original meanings, escape them, for example, if you want to have at least one "" in the string, the regular expression should be written as follows: \ +.
(2) You can enclose multiple metacharacters or literal text characters in parentheses to form a group, such as ^ (13) [4-9] d {8} $ indicates any mobile phone number starting with 13.
(3) In addition, Chinese characters are matched using the corresponding Unicode encoding. For a single Unicode character, for example, u4e00 indicates the Chinese character "1", and u9fa5 indicates the Chinese character "?". In Unicode encoding, this is the first and last Unicode encoding of the Chinese characters that can be expressed. In Unicode encoding, this can represent 20901 Chinese characters.
(4) For B's usage, it indicates the start or end of a word. The string "123a 345b 456 789d" is used as the sample string. If the regular expression is "bd {3} B ", then only 456 can be matched.
(5) "|" can be used to represent or. For example, [z | j | q] indicates matching any letter in z, j, and q.
Regular Expression grouping (this part and the following are from Zhou Gong's blog :)
Enclose a part of a regular expression with () to form a group, also known as a submatch or a capture group. For example, for a time in the format of "08:14:27", we can write the following regular expression:
(0 [1-9]) | (1 [0-9]) | (2 [0-3]) (: [0-5] [1-9]) {2}
If this expression is used, it extracts the access time from the following IIS access Log (of course, the best tool for analyzing IIS logs is Log Parser, a Microsoft tool ):
00:41:23 GET/admin_save.asp 202.108.212.39 404 1468 176
01:04:36 GET/userbuding. asp 202.108.212.39 404 1468 176
10:00:59 GET/upfile_flash.asp 202.108.212.39 404 1468 178
12:59:00 GET/cp. php 202.108.212.39 404 1468
19:23:04 GET/sqldata. php 202.108.212.39 404 1468 173
23:00:00 GET/Evil-Skwiz.htm 202.108.212.39 404 1468
23:59:59 GET/bil.html 202.108.212.39 404 1468
If you want to analyze the preceding IIS logs, extract the access time, access page, Client IP address, and server response code (corresponding to HttpStatusCode in C #) from each log ), we can obtain it by group.
The Code is as follows:
Private String text = @ "00:41:23 GET/admin_save.asp 202.108.212.39 404 1468 176
01:04:36 GET/userbuding. asp 202.108.212.39 404 1468 176
10:00:59 GET/upfile_flash.asp 202.108.212.39 404 1468 178
12:59:00 GET/cp. php 202.108.212.39 404 1468
19:23:04 GET/sqldata. php 202.108.212.39 404 1468 173
23:00:00 GET/Evil-Skwiz.htm 202.108.212.39 404 1468
23:59:59 GET/bil.html 202.108.212.39 404 1468 ";
/// <Summary>
/// Analyze IIS logs and extract the client access time, URL, IP address, and server response code
/// </Summary>
Public void AnalyzeIISLog ()
{
// Extract the regular expression of the access time, URL, IP address, and server response code
// You can see that the subexpressions for extracting time are complex because of strict time matching restrictions.
// For the sake of simplicity, the Client IP format is not strictly verified, because no IP addresses that do not meet the requirements in IIS access logs
The Code is as follows: |
Copy code |
Regex regex = new Regex (@ "(0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2}) s (GET) s ([^ s] +) s (d {1, 3 }(. d {1, 3}) {3}) s (d {3}) ", RegexOptions. none ); MatchCollection matchCollection = regex. Matches (text ); For (int I = 0; I <matchCollection. Count; I ++) { Match match = matchCollection [I]; Console. WriteLine ("Match [{0}] ===========================", I ); For (int j = 0; j <match. Groups. Count; j ++) { Console. WriteLine ("Groups [{0}] = {1}", j, match. Groups [j]. Value ); } } } |
The output result of this Code is as follows:
Match [0] ============================
Groups [0] = 00:41:23 GET/admin_save.asp 202.108.212.39 404
Groups [1] = 00:41:23
Groups [2] = 00
Groups [3] =: 23
Groups [4] = GET
Groups [5] =/admin_save.asp
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [1] ============================
Groups [0] = 01:04:36 GET/userbuding. asp 202.108.212.39 404
Groups [1] = 01:04:36
Groups [2] = 01
Groups [3] =: 36
Groups [4] = GET
Groups [5] =/userbuding. asp
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [2] ============================
Groups [0] = 10:00:59 GET/upfile_flash.asp 202.108.212.39 404
Groups [1] = 10:00:59
Groups [2] = 10
Groups [3] =: 59
Groups [4] = GET
Groups [5] =/upfile_flash.asp
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [3] ==============================
Groups [0] = 12:59:00 GET/cp. php 202.108.212.39 404
Groups [1] = 12:59:00
Groups [2] = 12
Groups [3] =: 00
Groups [4] = GET
Groups [5] =/cp. php
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [4] ============================
Groups [0] = 19:23:04 GET/sqldata. php 202.108.212.39 404
Groups [1] = 19:23:04
Groups [2] = 19
Groups [3] =: 04
Groups [4] = GET
Groups [5] =/sqldata. php
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [5] ============================
Groups [0] = 23:00:00 GET/Evil-Skwiz.htm 202.108.212.39 404
Groups [1] = 23:00:00
Groups [2] = 23
Groups [3] =: 00
Groups [4] = GET
Groups [5] =/Evil-Skwiz.htm
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
Match [6] ============================
Groups [0] = 23:59:59 GET/bil.html 202.108.212.39 404
Groups [1] = 23:59:59
Groups [2] = 23
Groups [3] =: 59
Groups [4] = GET
Groups [5] =/bil.html
Groups [6] = 202.108.212.39
Groups [7] =. 39
Groups [8] = 404
From the above output, we can see that in each matching result, the 2nd groups are the client access time (because the index starts from 0, so the index order is 1, the same is true ), the 6th groups are the access URLs (6 in the index order) and the 7th groups are the Client IP addresses (6 in the index order ), the 9th groups are server-side response code (the index order is 9 ). If we want to extract these elements, we can directly access these values according to the index, which is more convenient than we do not use regular expressions.
Naming capture group
Although the above method is convenient, it also has some inconvenience: If you need to extract more information and increase or decrease the capture group, the corresponding value of the capture group index will change, we need to re-modify the code, which is also a hard code. Is there any better way? The answer is yes, that is, naming the capture group.
Just as we use DataReader to access the database or access the data in the able, we can use the index method (the index also starts from 0 ), however, if the number or order of fields in the select statement is changed, the data obtained in this way needs to be changed again. To adapt to this change, the field name can also be used as an index to access data, if this field exists in the data source, the correct value is obtained regardless of the order. Naming a capture group in a regular expression also plays the same role.
Common capture group representation: (Regular Expression), for example (d {8, 11 });
Naming and capturing Group Representation :(? <Capture group name> regular expression), such (? <Phone> d {8, 11 })
For a common capturing group, you can only obtain the corresponding value by index, but for a naming capturing group, you can also access it by name, for example (? <Phone> d {8, 11}), you can follow the match in the code. the Groups ["phone"] method makes the code more intuitive and the encoding more flexible. For the analysis of IIS logs just now, the code for naming the capture group is as follows:
Private String text = @ "00:41:23 GET/admin_save.asp 202.108.212.39 404 1468 176
01:04:36 GET/userbuding. asp 202.108.212.39 404 1468 176
10:00:59 GET/upfile_flash.asp 202.108.212.39 404 1468 178
12:59:00 GET/cp. php 202.108.212.39 404 1468
19:23:04 GET/sqldata. php 202.108.212.39 404 1468 173
23:00:00 GET/Evil-Skwiz.htm 202.108.212.39 404 1468
23:59:59 GET/bil.html 202.108.212.39 404 1468 ";
/// <Summary>
/// Use a named capture group to extract information in IIS logs
/// </Summary>
The Code is as follows: |
Copy code |
Public void AnalyzeIISLog2 () { Regex regex = new Regex (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2}) s (GET) s (? <Url> [^ s] +) s (? <Ip> d {1, 3} (. d {1, 3}) {3} s (? <HttpCode> d {3}) ", RegexOptions. None ); MatchCollection matchCollection = regex. Matches (text ); For (int I = 0; I <matchCollection. Count; I ++) { Match match = matchCollection [I]; Console. WriteLine ("Match [{0}] ===========================", I ); Console. WriteLine ("time: {0}", match. Groups ["time"]); Console. WriteLine ("url: {0}", match. Groups ["url"]); Console. WriteLine ("ip: {0}", match. Groups ["ip"]); Console. WriteLine ("httpCode: {0}", match. Groups ["httpCode"]); } } |
The code execution result is as follows:
Match [0] ============================
Time: 00: 41: 23
Url:/admin_save.asp
Ip: 202.108.212.39
HttpCode: 404
Match [1] ============================
Time: 01: 04: 36
Url:/userbuding. asp
Ip: 202.108.212.39
HttpCode: 404
Match [2] ============================
Time: 10: 00: 59
Url:/upfile_flash.asp
Ip: 202.108.212.39
HttpCode: 404
Match [3] ==============================
Time: 12: 59: 00
Url:/cp. php
Ip: 202.108.212.39
HttpCode: 404
Match [4] ============================
Time: 19: 23: 04
Url:/sqldata. php
Ip: 202.108.212.39
HttpCode: 404
Match [5] ============================
Time: 23: 00: 00
Url:/Evil-Skwiz.htm
Ip: 202.108.212.39
HttpCode: 404
Match [6] ============================
Time: 23: 59: 59
Url:/bil.html
Ip: 202.108.212.39
HttpCode: 404
After naming a capture group, the value of the access capture group is more intuitive. As long as the value of the name capture group does not change, other changes do not affect the original code.
Non-capturing Group
If you often look at the source code of other people's regular expressions, you may see the form (? : Sub-expression). This is a non-capturing group. We can understand the capture group, that is, the index or name can be used in the subsequent code (if it is a named capture group) to access the matching value, because the corresponding value is saved to the memory during the matching process, if we do not need to access matched values later, we can tell the program not to save the matched values in the memory to improve efficiency and reduce memory consumption, in this case, you can use a non-capturing group. For example, when analyzing IIS logs, we don't care about the client's request submission method. here we can use a non-capturing group, as shown below:
The Code is as follows: |
Copy code |
Regex regex = new Regex (@"(? <Time> (0 [0-9] | 1 [0-9] | 2 [0-3]) (: [0-5] [0-9]) {2}) s (? : GET) s (? <Url> [^ s] +) s (? <Ip> d {1, 3} (. d {1, 3}) {3} s (? <HttpCode> d {3 })"; |
Zero-width assertion
There are many claims for zero-width assertion, including loop view and pre-search. Here I use the call method in MSDN. There are several claims for zero-width assertion:
(? = Subexpression): Zero-width positive prediction predicate. The child expression continues matching only when it matches the right side of the position. For example, 19 (? = 99) matches the 19 instances that are earlier than 99.
(?! Sub-expression): Zero-width negative prediction first asserted. The child expression continues matching only when it does not match the right side of the position. For example ,(?! 99) It does not match the word that does not end with 99. Therefore, it does not match 1999.
(? <= Sub-expression): assertion after reviewing with zero width. The child expression continues matching only when it matches the left side of the position. For example ,(? <= 19) 99 matches the 99 instance following 19. This construction will not be traced back.
(? <! Sub-expression): assertion after review. The child expression continues matching only when it does not match on the left side of the position. For example (? <= 19) It does not match a word that does not start with 19, so it does not match 1999.
Regular Expression options
When using a regular expression, in addition to using the RegexOptions enumeration to assign some additional options to the regular expression, you can also use these options in the expression, such:
The Code is as follows: |
Copy code |
Regex regex = new Regex ("(? I) def "); Regex regex = new Regex ("(? I) def "); |
It is equivalent to the following sentence:
The Code is as follows: |
Copy code |
Regex regex = new Regex ("def", RegexOptions. IgnoreCase ); |
Regex regex = new Regex ("def", RegexOptions. IgnoreCase );
Use (? I) This form is called the inline mode. As the name suggests, it means that the regular expression option has been embodied in the regular expression. The Inline characters correspond to RegexOptions as follows:
IgnoreCase: Specifies a case-insensitive match when the inline character is I.
Multiline: Specifies the Multiline mode when the inline character is m. Change the meaning of ^ and $ so that they match the beginning and end of any row, not just the beginning and end of the entire string.
ExplicitCapture: The Inline character is n, and the only valid capture is explicitly named or numbered (? <Name> ...) Form group. This allows parentheses to act as a non-capturing group, thus avoiding (? :...) Resulting in clumsy syntax.
Singleline: Specifies the single-line mode when the inline character is s. Change the meaning of the period (.) to match each character (not all characters except n.
IgnorePatternWhitespace: The Inline character is x, which specifies to exclude escape spaces from the mode and enable comments after the digit sign. (For a list of escape space characters, see escape character .) Note that the white space will never be removed from the character class.
Example:
The Code is as follows: |
Copy code |
RegexOptions option = RegexOptions. IgnoreCase | RegexOptions. Singleline; Regex regex = new Regex ("def", option ); |
Inline representation:
The Code is as follows: |
Copy code |
Regex regex = new Regex ("(? Is) def "); |