The magical escape of regular foundations _ regular expressions

Source: Internet
Author: User
Tags throw exception
1 overview
This may be a confusing, even confusing topic, but that is why the discussion is necessary.
In regular, some characters that have special meaning, or sequences of characters, are called Meta characters, such as "?" Indicates that the decorated subexpression matches 0 or 1 times, and that "(? i)" indicates that the matching pattern of the case is ignored, and so on. And when these meta characters are required to match themselves, the escape processing is necessary.
Different language or application scene, the regular definition way, the position of the meta character appears different, the escape way is also various, differ with.
2. NET regular character escape
2.1. NET the escape character in the regular
In most languages, "\" is used as an escape character to escape some characters or sequences of characters of special significance, such as "\ n" for line breaks, "\ T" for horizontal tabs, and so on. And this escape, applied to the regular, there will be some unexpected changes.
The topic is drawn from a regular problem in C #
Copy Code code as follows:

string[] Test = new string[]{"\", "\\\\"};
Regex reg = new Regex ("^\\\\$");
foreach (string s in test)
{
Richtextbox2.text + = "Source string:" + s.padright (5, ') + "match result:" + Reg. IsMatch (s) + "\ n";
}
/*--------Output--------
SOURCE string: \ Match Result: True
SOURCE string: \ Match Result: False
*/

For this result, perhaps someone will be puzzled, "\ \" In the string does not represent an escaped "\" character? and "\\\\" should not represent two escaped "\" characters? So the result of the above regular match should be the first false and the second true?
For this question, the direct explanation may not be easy to understand, or to explain it in a different way.
For example, the character to match is this
String test = "(";
So how is it written? Because "(" has special meaning in the regular, it must be escaped when writing the regular, that is, "\" and in the string, "\" is used to denote "\" itself, i.e.
Regex reg = new Regex ("^\\ ($");
If this is understood, then "(" Change Back "," in the same way, in a string, to use "\" to represent "\" itself, that is,
Regex reg = new Regex ("^\\\\$");
With this analysis, it can be seen that in the regular form of a string declaration, the "\\\\" match is actually a separate "\" character. Summarize the relationship between them:
The string to output to the console or interface: \
String declared in program: string test = "\";
Regular that is declared in the program: Regex reg = new Regex ("^\\\\$");
Is this an understandable explanation, and is it not as awkward as it might be? Yes, a regular that is declared as a string in a program is clumsy when it comes to escape characters.
So in C #, there is another way of string declaration, where you can ignore the escape by adding a "@" before the string.
Copy Code code as follows:

string[] test = new string[] {@ "\", @ "\"};
Regex reg = new Regex (@ "^\\$");
foreach (string s in test)
{
Richtextbox2.text + = "Source string:" + s.padright (5, ') + "match result:" + Reg. IsMatch (s) + "\ n";
}
/*--------Output--------
SOURCE string: \ Match Result: True
SOURCE string: \ Match Result: False
*/

This is much simpler and is in line with the usual understanding.
But it also brings up another problem, which is the escape of double quotes. In a normal string declaration, you can use "\" to escape double quotes.
String test = "<a href=\" www.test.com\ ">only a test</a>";
But after adding "@" to the string, "\" is recognized as the "\" character itself, so that you cannot escape the double quotes with "\", you need to escape the double quotes with "" ".
String test = @ "<a href=" "www.test.com" ">only a test</a>";
In vb.net, the regular definition has only one form, which is consistent with the definition of "@" added in C #.
Copy Code code as follows:

Dim test as String () = New String () {"\", "\"}
Dim reg as Regex = New regex ("^\\$")
For all S as String in test
Richtextbox2.text + = "source string:" & S.padright (5, "" C) & "match result:" & reg. IsMatch (s) & VbCrLf
Next
'--------output--------
' Source string: \ Match Result: True
' Source string: \ Match Result: False
'--------------------

2.2. The meta character to be escaped in the net regular
In MSDN, the following characters act as metacharacters in the regular, and need to be escaped when they match themselves
. $ ^ { [ ( | ) * + ? \
However, the actual application, but also according to the actual situation to judge, the above characters may not need to escape, or more than the characters need to be escaped.
In the normal process of regular writing, the above characters can usually be escaped by the author of the normal processing, but in the dynamic generation of regular, you need to pay special attention, otherwise the variable contains metacharacters, dynamically generated positive at compile time may throw exception. Fortunately. NET provides a Regex.escape method to deal with this problem. For example, according to the dynamically acquired ID to extract the corresponding div tag content.
String id = regex.escape (textbox1.text);
Regex reg = new regex (? is) <div (????) (?:(?! id=).) *id= ([' "]?)" + ID + @ "\1[^>]*> (?><div[^>]*> (?<o>) |</div> (?<-o>) | (?:(?! </?div\b).) *)* (? (o) (?!)) </div> ");
If the escape processing is not done, the dynamically acquired ID, if it is in the form of ABC (DEF), throws an exception when the program is running.
2.3. Escape of character groups in net regular
In character groups [], meta characters are usually not required to be escaped, and even "[" is not required to be escaped.
Copy Code code as follows:

String test = @ "The test string:. $ ^ { [ ( | ) * + ? \";
Regex reg = new Regex (@) [. $^{[(|) *+?\\]");
MatchCollection mc = Reg. Matches (test);
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n";
}
/*--------Output--------
.
$
^
{
[
(

)
*
+
?
\
*/

But in regular writing, the "[" in the character group or the recommendation to escape it by using "\[" is already very abstract and very readable, and it makes the readability worse if it is doped into the character group so that it is not escaped. And when an incorrect nesting occurs, it may cause a regular compilation exception, and the following are thrown unexpectedly at compile time.
Regex reg = new Regex (@ "[. $^{[(]|) *+?\\]");
However. NET, is to support collection subtraction, in this normal grammatical form, is to allow the character group nesting.
Copy Code code as follows:

String test = @ "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
Regex reg = new Regex (@ "[a-z-[aeiou]]+");
MatchCollection mc = Reg. Matches (test);
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n";
}
/*--------Output--------
Bcd
Fgh
Jklmn
Pqrst
Vwxyz
*/

This usage is very poor readability, the application is rarely seen, even if there is such a demand can be achieved by other means, to understand, you do not have to delve into.
The topic is returned to escape, the character group must be escaped only "\", and "[" and "]" in the character group, it is also recommended to do escape processing. There are also two characters "^" and "-", which are also required to be escaped if they are to match themselves, when they appear in a particular position in the character group.
"^" appears at the beginning of the character group, representing an excluded character group, "[^char]" that matches any character other than the character contained in the character group, such as "[^0-9]", which represents any character except the number. So in the character group, you want to match the "^" character itself, either not at the beginning of the character group, or by "\^".
Regex reg1 = new Regex (@ "[0-9^]");
Regex reg2 = new Regex (@ "[\^0-9]");
Both of these ways are expressed to match any number or ordinary character "^".
As for "-" in the character set of specificity, give an example.
Copy Code code as follows:

String test = @ "$";
Regex reg = new Regex (@ "[#-*%&]");
Richtextbox2.text = "Match result:" + Reg. IsMatch (test);
/*--------Output--------
Match Result: True
*/

There is no "$" in the regular expression, why does the match result be "True"?
[] supports a hyphen "-" connection of two characters to represent a range of characters. It should be noted that the two characters before and after "-" are sequential, and when using the same encoding, the following character code should be greater than or equal to the code bit of the preceding character.
Copy Code code as follows:

for (int i = ' # '; I <= ' * '; i++)
{
Richtextbox2.text + = (char) i + "\ n";
}
/*--------Output--------
#
$
%
&
'
(
)
*
*/

Because "#" and "*" Meet the requirements, "[#-*]" can represent a range of characters, which contains the character "$", so the above positive can match "$", if only "-" as a normal character processing, then either change position, or "-" escape.
Regex reg1 = new Regex (@ "[#*%&-]");
Regex reg2 = new Regex (@ "[#\-*%&]");
Both of these methods represent any of the characters listed in the matching character group.
In the character group, there is also a special escape character, "\b" out of the regular expression in the general position, represents a word boundary, where one side is the character that makes up the word, and the other is not, and when "\b" appears in the character group, the expression is backspace and the "\b" meaning that appears in the normal string is the same.
Similarly, there is an escape character "|" That is easily overlooked and often overlooked, when "|" The relationship between the left and right sides of the "or" is represented when the normal position is present in the regular expression; When it appears in a character group, it only represents "|" Character itself, without any special meaning, so if not to match "|" itself, while attempting to use "|" in the character group When, is wrong. For example, the regular expression "[a|b]" means "a", "B", "|" Instead of "a" or "B".
2.4. Non-visible word escape processing in net regular application
For some invisible characters, the escape character is required to be represented in the string, and the more common one is "\ r", "\ n", "\ T", and so on, and these words are regular in the application, which makes it a bit magical to look at a piece of code first.
Copy Code code as follows:

String test = "one line." \ another line. ";
list<regex> list = new list<regex> ();
List. ADD (new Regex ("\ n"));
List. ADD (New Regex ("\\n"));
List. ADD (The new Regex (@ "\ n"));
List. ADD (New Regex (@ "\\n"));
foreach (Regex reg in list)
{
Richtextbox2.text + = "Regular expression:" + Reg. ToString ();
MatchCollection mc = Reg. Matches (test);
foreach (Match m in MC)
{
Richtextbox2.text + = "Match content:" + M.value + "match starting Position:" + m.index + "match length:" + m.length;
}
Richtextbox2.text + + "Match total:" + Reg. Matches (test). Count + "\ n----------------\ n";
}
/*--------Output--------
Regular expression:
Matching content:
Match starting Position: 10 match length: 1 match total: 1
----------------
Regular expression: \ n Match content:
Match starting Position: 10 match length: 1 match total: 1
----------------
Regular expression: \ n Match content:
Match starting Position: 10 match length: 1 match total: 1
----------------
Regular expression: \\n Total number of matches: 0
----------------
*/

As you can see, the first three kinds of writing, the output is different, but the results are exactly the same, only the last one is not matched.
The regular expression a regex ("\ n"), in fact, is to declare the regular in the form of a normal string, and to match the character "a" with a regex ("a") is the same, without the regular engine escaping.
Regular expression two regex ("\\n"), is a regular expression in the form of the declaration of the regular, just as the "\\\\" in the regular is equivalent to "\" in the string, the regular "\\n" is equivalent to "\ n" in the string, is escaped through the regular engine.
The regular expression three regex (@ "\ n") is equivalent to the regular expression, and is the literal character of "@" before the string.
Regular Expression four regex (@ "\\n"), in fact, this represents the character "\" followed by a character "n", is two characters, which in the source string is naturally unable to find a match.
What needs special attention here is "\b", the different way of declaration, the meaning of "\b" is different.
Copy Code code as follows:

String test = "one line." \ another line. ";
list<regex> list = new list<regex> ();
List. ADD (New Regex ("line\b"));
List. ADD (New Regex ("line\\b"));
List. ADD (New Regex (@ "line\b"));
List. ADD (New Regex (@ "line\\b"));
foreach (Regex reg in list)
{
Richtextbox2.text + = "Regular expression:" + Reg. ToString () + "\ n";
MatchCollection mc = Reg. Matches (test);
foreach (Match m in MC)
{
Richtextbox2.text + = "Match content:" + M.value + "match starting Position:" + m.index + "match length:" + m.length + "\ n";
}
Richtextbox2.text + + "Match total:" + Reg. Matches (test). Count + "\ n----------------\ n";
}
/*--------Output--------
Regular expression: Line_
Total number of matches: 0
----------------
Regular expression: line\b
Matching content: line match start position: 4 Match length: 4
Matching content: line match start position: 20 Match Length: 4
Total number of matches: 2
----------------
Regular expression: line\b
Matching content: line match start position: 4 Match length: 4
Matching content: line match start position: 20 Match Length: 4
Total number of matches: 2
----------------
Regular expression: line\\b
Total number of matches: 0
----------------
*/

Regular expression a regex ("line\b"), where the "\b" is backspace and is escaped without a regular engine. is not in the source string, so the match result is 0.
The regular expression two regex ("line\\b") is a regular expression that declares the regular, where the "\\b" is the word boundary, which is escaped by the regular engine.
Regular expression three regex (@ "line\b"), equivalent to regular expression two, refers to the word boundary.
Regular Expression four regex (@ "line\\b"), in fact this represents the character "\" followed by a character "B", is two characters, which in the source string is naturally unable to find a match.
2.5. Other escape processing in net regular application
. NET regular application There are some other escape methods, although not very much, but also by the way.
Requirements: Add "$" to the number before "<" and ">" in the String
Copy Code code as follows:

String test = "one test <123> another test <321>";
Regex reg = new Regex (@ "< (\d+) >");
string result = Reg. Replace (Test, "<$$1>");
Richtextbox2.text = result;
/*--------Output--------
One Test <$1> another test <$1>
*/
You may be surprised to find that instead of adding "$" to the number, the replacement result replaces all numbers with "$".
Why, this is because in the replacement structure, "$" is of special significance, it is followed by a number that represents a reference to the matching result of the corresponding number capture group, and in some cases the "$" character itself is needed in the replacement result, but followed by a number, which requires "$$" to escape it. The above example is precisely because of this escape effect caused by the exception results, to circumvent this problem, you can make the substitution result of the reference to the capturing group does not appear.
String test = "one test <123> another test <321>";
Regex reg = new Regex (@ "(?<=<) (?=\d+>)");
string result = Reg. Replace (Test, "$");
Richtextbox2.text = result;
/*--------Output--------
One Test <$123> another test <$321>
*/

3 escape characters in JavaScript and Java
JavaScript and Java-positive escape character processing, in the form of a string declaration, is basically the same as. NET in a consistent, simple introduction.
In JavaScript, declaring a regular as a string is the same as in C #, and it can be awkward.
Copy Code code as follows:

<script type= "Text/javascript" >
var data = ["\ \", "\\\\"];
var reg = new RegExp ("^\\\\$", "");
for (Var i=0;i<data.length;i++)
{
document.write ("source string:" + Data[i] + "match result:" + reg.test (data[i)) + "<br/>");
}
</script>
/*--------Output--------
SOURCE string: \ Match Result: true
SOURCE string: \ Match Result: false
*/

While JavaScript does not provide a string declaration of this "@" approach in C #, it provides a proprietary way of declaring another regular expression.
Copy Code code as follows:

<script type= "Text/javascript" >
var data = ["\ \", "\\\\"];
var reg =/^\\$/;
for (Var i=0;i<data.length;i++)
{
document.write ("source string:" + Data[i] + "match result:" + reg.test (data[i)) + "<br/>");
}
</script>
/*--------Output--------
SOURCE string: \ Match Result: true
SOURCE string: \ Match Result: false
*/

In JavaScript
var reg =/expression/igm;
This way of declaring, you can simplify the regular that contains the escape character as well.
Of course, when declaring a regular in this form, "/" naturally becomes a meta character, and when that character appears in the regular, it must be escaped. such as matching the regular of the domain name in the link
var reg =/http:\/\/:([^\/]+)/ig;
Unfortunately, in Java, there is currently only a regular declaration method, which is the form of a string declaration
Copy Code code as follows:

String test[] = new string[]{"\", "\\\\"};
String reg = "^\\\\$";
for (int i=0;i<test.length; i++)
{
System.out.println ("source string:" + Test[i] + "match result:" + Pattern.compile (reg). Matcher (Test[i)). find ());
}
/*--------Output--------
SOURCE string: \ Match Result: true
SOURCE string: \ Match Result: false
*/

Only future versions of Java can be expected to provide optimizations in this area.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.