Magical escaping of Regular Expressions

Source: Internet
Author: User

1 Overview
This may be a confusing or even confusing topic, but it is also necessary for discussion.
In a regular expression, some special characters or character sequences are called metacharacters, such as "?". Indicates that the modified Sub-expression matches 0 times or 1 time. I) "indicates case-insensitive matching modes. When these metacharacters are required to match themselves, they must be escaped.
In different languages or application scenarios, the regular expressions and metacharacters appear in different locations. The Escape methods are the same.
2. escape characters in NET regular expressions
2.1 escape characters in. NET regular expressions
In most languages, "\" is used as escape characters to escape special characters or character sequences. For example, "\ n" indicates line breaks, "\ t" indicates horizontal tabs. This escape will be applied to the regular expression, and there will be some unexpected changes.
The topic is caused by a regular expression in C #.
Copy codeThe Code is as follows:
String [] test = new string [] {"\", "\\\\"};
Regex reg = new Regex ("^ \\\\ $ ");
Foreach (string s in test)
{
RichTextBox2.Text + = "Source string:" + s. PadRight (5, '') +" matching result: "+ reg. IsMatch (s) +" \ n ";
}
/* -------- Output --------
Source string: \ match result: True
Source string: \ matched result: False
*/

Some may be confused about this result. Isn't "\" in the string representing an escaped "\" character? Should "\" represent two escape characters? So the first result of the above regular expression match should be False, and the second result should be True?
For this question, it may not be easy to understand. Let's just explain it in another way.
For example, the character to be matched is
String test = "(";
How to Write regular expressions? Because "(" has special significance in the regular expression, it must be escaped when writing a regular expression, that is, "\ (". In the string, use "\" to represent "\" itself, that is
Regex reg = new Regex ("^ \ ($ ");
If you understand this, replace "(" with "\". Similarly, in the string, use "\" to represent "\" itself, that is
Regex reg = new Regex ("^ \\\\ $ ");
Through this analysis, we can see that, in the regular expression declared in the string form, "\" is actually a separate "\" character. Summarize the relationships between them:
String output to the console or interface :\
String declared in the program: string test = "\\";
The regular expression declared in the program: Regex reg = new Regex ("^ \\\\$ ");
Can this explanation be understood? Does it feel so clumsy? Yes, the regular expressions declared in the form of strings in the program are so clumsy when it comes to escape characters.
Therefore, in C #, another string declaration method is provided. You can ignore the escape by adding "@" before the string.
Copy codeThe Code is as follows:
String [] test = new string [] {@ "\", @ "\"};
Regex reg = new Regex (@ "^ \ $ ");
Foreach (string s in test)
{
RichTextBox2.Text + = "Source string:" + s. PadRight (5, '') +" matching result: "+ reg. IsMatch (s) +" \ n ";
}
/* -------- Output --------
Source string: \ match result: True
Source string: \ matched result: False
*/

In this way, it is much more concise and meets the general understanding.
But it also brings about another problem, that is, escape processing of double quotation marks. In a normal string declaration, double quotation marks can be escaped using.
String test = "<a href = \" www.test.com \ "> only a test </a> ";
However, after "@" is added in front of the string, "\" will be recognized as the "\" character itself, so that double quotation marks cannot be escaped using, double quotation marks must be escaped using.
String test = @ "<a href =" "www.test.com" "> only a test </a> ";
In VB. NET, the regular expression is defined in only one form. It is consistent with the definition method after "@" is added in C.
Copy codeThe Code is as follows:
Dim test As String () = New String (){"\","\\"}
Dim reg As Regex = New Regex ("^ \ $ ")
For Each s As String In test
RichTextBox2.Text + = "Source string:" & s. PadRight (5, "" c) & "matched Result:" & reg. IsMatch (s) & vbCrLf
Next
'-------- Output --------
'Source string: \ matched result: True
'Source string: \ matched result: False
'--------------------

2.2. NET metacharacters to be escaped in Regular Expressions
In MSDN, the following characters are used as metacharacters in the Regular Expression and must be escaped when matching themselves.
. $ ^ {[(|) * +? \
However, in actual application, it must be determined based on the actual situation. The above characters may not need to be escaped, or the above characters may need to be escaped.
In the normal regular expression writing process, the escape of the above characters can be normally handled by the writer, but special attention is required when the regular expression is generated dynamically. Otherwise, when the variable contains metacharacters, A dynamically generated regular expression may throw an exception during compilation. Fortunately, the. NET provides the Regex. Escape method to handle this problem. For example, extract the corresponding div TAG content based on the dynamically obtained id.
String id = Regex. Escape (textBox1.Text );
Regex reg = new Regex (@"(? Is) <div (? :(?! Id =).) * id = (['"]?) "+ Id + @" \ 1 [^>] *> (?> <Div [^>] *> (? <O>) | </div> (? <-O>) | (? :(?! </? Div \ B ).)*)*(? (O )(?!)) </Div> ");
If no escape processing is performed, if the dynamically obtained id is in the "abc (def" format, an exception will be thrown during the program running.
Escape Character groups in 2.3. NET regular expressions
In character groups [], metacharacters do not need to be escaped, and even "[" does not need to be escaped.
Copy codeThe Code is as follows:
String test = @ "the test string:. $ ^ {[(|) * +? \";
Regex reg = new Regex (@ "[. $ ^ {[(|) * +? \] ");
MatchCollection mc = reg. Matches (test );
Foreach (Match m in mc)
{
RichTextBox2.Text + = m. Value + "\ n ";
}
/* -------- Output --------
.
$
^
{
[
(

)
*
+
?
\
*/

However, during regular expression writing, we recommend that you use "\ [" in the character group to escape it. The regular expression itself is very abstract and has low readability, if the character group is added with such "[" without escaping, the readability will be worse. In addition, incorrect nesting may cause Regular Expression compilation exceptions. The following regular expressions may cause exceptions during compilation.
Regex reg = new Regex (@ "[. $ ^ {[(] |) * +? \] ");
However, in. NET character groups, the set Subtraction is supported. In this normal syntax form, the character group can be nested.
Copy codeThe Code is as follows:
String test = @ "abcdefghijklmnopqrstuvwxyz ";
Regex reg = new Regex (@ "[a-z-[aeiou] + ");
MatchCollection mc = reg. Matches (test );
Foreach (Match m in mc)
{
RichTextBox2.Text + = m. Value + "\ n ";
}
/* -------- Output --------
Bcd
Fgh
Jklmn
Pqrst
Vwxyz
*/

This method is very readable and rarely used. Even if you have such a requirement, you can use other methods to understand it without further research.
When the topic returns to the escape mode, only "\" must be escaped in the character group, and "[" and "]" must be escaped when they appear in the character group. There are two other characters, "^" and "-". If you want to match the character group at a specific position, you also need to escape it.
"^" Appears at the starting position of the character group, indicating the excluded character group. "[^ Char]" is used to match any character other than the characters contained in the character group, for example, "[^ 0-9]" indicates any character except a number. Therefore, to match the character "^" in a character group, either do not place it at the beginning of the character group or escape it with "\ ^.
Regex reg1 = new Regex (@ "[0-9 ^]");
Regex reg2 = new Regex (@ "[\ ^ 0-9]");
Both expressions match any number or the common character "^ ".
As for the particularity of "-" in character groups, an example is given.
Copy codeThe Code is as follows:
String test = @ "$ ";
Regex reg = new Regex (@ "[#-* % &]");
RichTextBox2.Text = "matching result:" + reg. IsMatch (test );
/* -------- Output --------
Matching result: True
*/

There is no "$" in the regular expression. Why is the matching result "True?
[] A hyphen (-) can be used to connect two characters to indicate a character range. Note that the two characters before and after "-" are ordered. When the same encoding is used, the following encoding bit should be greater than or equal to the preceding character bit.
Copy codeThe Code is as follows:
For (int I = '#'; I <= '*'; I ++)
{
RichTextBox2.Text + = (char) I + "\ n ";
}
/* -------- Output --------
#
$
%
&
'
(
)
*
*/

Because "#" and "*" meet the requirements, "[#-*]" can represent a character range, which contains the character "$ ", therefore, the above regular expression can match "$". If you only treat "-" as a common character, you can either change the position or escape.
Regex reg1 = new Regex (@ "[# * % &-]");
Regex reg2 = new Regex (@ "[# \-* % &]");
Both methods match any of the characters listed in the character group.
There is also a special Escape Character in the character group. When "\ B" appears in a regular expression at a general position, it indicates the word boundary, that is, one side is a character that constitutes a word, the other side is not. When "\ B" appears in the character group, it indicates a return character, which is the same as "\ B" in a common string.
Similarly, there is also an easily overlooked Escape Character "|". When "|" appears in a regular expression, it indicates the relationship between the left and right sides. When "|" appears in the character group, it only indicates the character "|", which has no special meaning, therefore, it is incorrect to try to use "|" in the character group instead of matching "|. For example, the regular expression "[a | B]" represents any one of "a", "B", and "|", rather than "a" or "B ".
2.4 invisible character escape processing in. NET regular expressions
For some invisible characters, escape characters must be used to represent them in strings. common characters include "\ r", "\ n", and "\ t, the application of these characters in regular expressions becomes somewhat magical. Let's look at a piece of code first.
Copy codeThe Code is as follows:
String test = "one line. \ n another line .";
List <Regex> list = new List <Regex> ();
List. Add (new Regex ("\ n "));
List. Add (new Regex ("\ n "));
List. Add (new Regex (@ "\ n "));
List. Add (new Regex (@ "\ n "));
Foreach (Regex reg in list)
{
RichTextBox2.Text + = "regular expression:" + reg. ToString ();
MatchCollection mc = reg. Matches (test );
Foreach (Match m in mc)
{
RichTextBox2.Text + = "matched content:" + m. Value + "matched starting position:" + m. Index + "matched Length:" + m. Length;
}
RichTextBox2.Text + = "Total number of Matches:" + reg. Matches (test). Count + "\ n -------------- \ n ";
}
/* -------- Output --------
Regular Expression:
Matching content:
Matching start position: 10 matching length: 1 matching Total number: 1
----------------
Regular Expression: \ n Matching content:
Matching start position: 10 matching length: 1 matching Total number: 1
----------------
Regular Expression: \ n Matching content:
Matching start position: 10 matching length: 1 matching Total number: 1
----------------
Regular Expression: \ n total number of matches: 0
----------------
*/

As you can see, although the output regular expressions are different in the first three writing methods, the execution results are identical. Only the last one is unmatched.
The regular expression Regex ("\ n") is actually declared as a regular string, and the Regex ("") to match the character "a" is the same principle and is not escaped by the regular engine.
Regular Expression 2 Regex ("\ n"), which declares a regular expression in the form of a regular expression, just as "\" in a regular expression is equivalent to "\" in a string, "\ n" in a regular expression is equivalent to "\ n" in a string ", it is escaped by the regular engine.
Regular Expression 3 Regex (@ "\ n"), which is the second-class price of the regular expression. It is written by adding "@" before the string.
Regular Expression 4 Regex (@ "\ n"), which represents the character "\" followed by the character "n", which is two characters, the match cannot be found in the source string.
Here, we need to pay special attention to "\ B". The meaning of "\ B" varies with the declaration methods.
Copy codeThe Code is as follows:
String test = "one line. \ n another line .";
List <Regex> list = new List <Regex> ();
List. Add (new Regex ("line \ B "));
List. Add (new Regex ("line \ B "));
List. Add (new Regex (@ "line \ B "));
List. Add (new Regex (@ "line \ B "));
Foreach (Regex reg in list)
{
RichTextBox2.Text + = "regular expression:" + reg. ToString () + "\ n ";
MatchCollection mc = reg. Matches (test );
Foreach (Match m in mc)
{
RichTextBox2.Text + = "matched content:" + m. Value + "matched starting position:" + m. Index + "matched Length:" + m. Length + "\ n ";
}
RichTextBox2.Text + = "Total number of Matches:" + reg. Matches (test). Count + "\ n -------------- \ n ";
}
/* -------- Output --------
Regular Expression: line _
Total number of matches: 0
----------------
Regular Expression: line \ B
Matching content: line matching start position: 4 matching length: 4
Match content: line Match start position: 20 match length: 4
Total number of matches: 2
----------------
Regular Expression: line \ B
Matching content: line matching start position: 4 matching length: 4
Match content: line Match start position: 20 match length: 4
Total number of matches: 2
----------------
Regular Expression: line \ B
Total number of matches: 0
----------------
*/

Regular Expression 1 Regex ("line \ B"), where "\ B" is a return character, which is not escaped by the regular engine. The source string does not exist, so the matching result is 0.
Regular Expression 2 Regex ("line \ B") declares a regular expression in the form of a regular expression. Here, "\ B" is the word boundary and is escaped by the regular engine.
Regular Expression 3 Regex (@ "line \ B"), second-class price with regular expression, refers to the word boundary.
Regular Expression 4 Regex (@ "line \ B"). In fact, this represents the character "\" followed by a character "B", which is two characters, the match cannot be found in the source string.
2.5 Other escape processing in. NET regular expressions
There are some other escape methods in. NET regular expressions. Although they are not used much, let's mention them by the way.
Requirement: Add "$" before the number "<" and ">" in the string"
Copy codeThe Code is as follows:
String test = "one test <123>, another test <321> ";
Regex reg = new Regex (@ "<(\ d +)> ");
String result = reg. Replace (test, "<$1> ");
RichTextBox2.Text = result;
/* -------- Output --------
One test <$1>, another test <$1>
*/
You may be surprised to find that the replacement result is not to add "$" before the number, but to replace all the numbers with "$1.
Why is this? In the replacement structure, "$" is of special significance. It is followed by a number to indicate reference to the matching result of the corresponding number capturing group, in some cases, the "$" character must appear in the replacement result, but it is followed by a number. In this case, you need to use "$" to escape it. The above example is precisely because of this escape effect that causes abnormal results. To avoid this problem, the replacement results do not contain references to the capture group.
String test = "one test <123>, another test <321> ";
Regex reg = new Regex (@"(? <=< )(? = \ D +> )");
String result = reg. Replace (test, "$ ");
RichTextBox2.Text = result;
/* -------- Output --------
One test <$123>, another test <$321>
*/

3. escape characters in JavaScript and Java
The escape characters in JavaScript and Java are basically the same as those in. NET when declared as strings.
In JavaScript, regular expressions are declared in the form of strings, which are the same as those in C # and can also be very clumsy.
Copy codeThe Code is as follows:
<Script type = "text/javascript">
Var data = ["\\", "\\\\"];
Var reg = new RegExp ("^ \\\$ ","");
For (var I = 0; I <data. length; I ++)
{
Document. write ("Source string:" + data [I] + "matching result:" + reg. test (data [I]) + "<br/> ");
}
</Script>
/* -------- Output --------
Source string: \ match result: true
Source string: \ matched result: false
*/

Although the string declaration method in C # is not provided in JavaScript, it provides a proprietary declaration method for a regular expression.
Copy codeThe Code is as follows:
<Script type = "text/javascript">
Var data = ["\\", "\\\\"];
Var reg =/^ \ $ /;
For (var I = 0; I <data. length; I ++)
{
Document. write ("Source string:" + data [I] + "matching result:" + reg. test (data [I]) + "<br/> ");
}
</Script>
/* -------- Output --------
Source string: \ match result: true
Source string: \ matched result: false
*/

In JavaScript
Var reg =/Expression/igm;
This declaration can also simplify the regular expressions containing escape characters.
Of course, when a regular expression is declared in this form, "/" naturally becomes a metacharacter. When this character appears in a regular expression, it must be escaped. For example, match the regular expression of the domain name in the Link
Var reg =/http: \/:( [^ \/] +)/ig;
Unfortunately, Java currently only provides a regular expression declaration method, that is, a string declaration method.
Copy codeThe Code is as follows:
String test [] = new String [] {"\", "\\\\"};
String reg = "^ \\\\$ ";
For (int I = 0; I <test. length; I ++)
{
System. out. println ("Source string:" + test [I] + "matching result:" + Pattern. compile (reg ). matcher (test [I]). find ());
}
/* -------- Output --------
Source string: \ match result: true
Source string: \ matched result: false
*/

We can only expect later Java versions to provide optimization in this regard.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.