The regular expressions in JavaScript are different from those in other languages.

Source: Internet
Author: User
Tags character classes
I have been familiar with many languages, and I am very concerned about the strong regular expressions in a language, and whether the combination of regular expressions and syntax is closely related. At this point, JavaScript does a good job, at least with regular words. Of course, Perl is the most powerful one. Recently, I found that the regular expressions in JavaScript are... syntaxHighlighter. all (); I have been familiar with many languages, and I highly appreciate the strong regular expressions in a language, and whether the combination of regular expressions and syntax is closely related. At this point, JavaScript does a good job, at least with regular words. Of course, Perl is the most powerful one. Recently, I found that regular expressions in JavaScript are different in some places from those in other languages or tools. Although you can hardly write or use the regular expressions I mentioned below, it is good to understand. The sample code in this article is executed in a JavaScript environment compatible with ES5. That is to say, the performance of versions earlier than IE9 and around Fx4 may be different from what I will talk about below. 1. the character class [], which does not contain any character, is called the empty char class. I believe you have never heard of this because in other languages, this syntax is invalid. No illegal syntax is introduced in all documents and tutorials. the following shows how this error is reported in other languages or tools: $ echo | grep '[] 'grep: unmatched [or [^ $ echo | sed '/[]/' sed:-e expression #1, character 4: unterminated address Regular Expression $ echo | awk '/[]/'awk: cmd. line: 1:/[]/awk: cmd. line: 1: ^ unterminated regexpawk: cmd. line: 1: error: Unmatched [or [^:/[] // $ echo | perl-ne '/[]/'unmatched [in regex; marked by <-- HERE in m/[<-- HERE]/at-e line 1. $ echo | ruby-ne '/[]/'-e: 1: empty char-class:/[]/$ python-C' import re; re. match ("[]", "") 'traceback (most recent call last): File" ", Line 1, in File "E: \ Python \ lib \ re. py ", line 137, in match return _ compile (pattern, flags ). match (string) File "E: \ Python \ lib \ re. py ", line 244, in _ compile raise error, v # invalid expressionsre_constants.error: unexpected end of regular expression. in JavaScript, the null character class is a valid regular expression, however, the effect is "never match", that is, matching fails. it is equivalent to an empty negative Positive loop (empty negative lookahead )(?!) Effect: js> "whatever \ n ". match (/[]/g) // null character class, never match nulljs> "whatever \ n ". match (/(?!) /G) // null negates forward view and never matches null. Obviously, this type of thing is useless in JavaScript. 2. the negative null character class does not contain any characters. The negative character class [^] is called the negative empty char class or the empty negative char class ), yes, because this term is self-developed by me. It is similar to the null character class mentioned above. This method is also illegal in other languages: $ echo | grep '[^] 'grep: Unmatched [or [^ $ echo | sed'/[^]/'sed:-e expression #1, character 5: unterminated address Regular Expression $ echo | awk '/[^]/'awk: cmd. line: 1:/[^]/awk: cmd. line: 1: ^ unterminated regexpawk: cmd. line: 1: error: Unmatched [or [^:/[^] // $ echo | perl-ne '/[^]/'unmatched [in regex; marked by <-- HERE in m/[<-- HERE ^]/at-e line 1. $ echo | ruby-ne '/[^]/'-e: 1: empty char-class:/[^]/$ python-c 'import re; re. match ("[^]", "") 'traceback (most recent call last): File" ", Line 1, in File "E: \ Python \ lib \ re. py ", line 137, in match return _ compile (pattern, flags ). match (string) File "E: \ Python \ lib \ re. py ", line 244, in _ compile raise error, v # invalid expressionsre_constants.error: unexpected end of regular expression $. in JavaScript, denying the null character class is a legal regular expression, its effect is just the opposite of that of the null character class. It can match any character, including the line break "\ n", that is, it is equivalent to the common [\ s \ S] and [\ w \ W]: js> "whatever \ n ". match (/[^]/g) // denies null character classes, matches any character ["w", "h", "a", "t", "e", "V", "e", "r", "\ n"] js> "whatever \ n ". match (/[\ s \ S]/g) // complementary character class, matching any character ["w", "h", "a", "t ", "e", "v", "e", "r", "\ n"] note that it cannot be called "permanent matching regular ", this is because the character class must have one character before it can be matched. If the target string is empty or has been consumed by the regular expression on the left, the matching will fail. For example: js>/abc [^]/. test ("abc") // c does not have any characters after it. Matching failed. false: If you want to know the true "always match regular", you can refer to an article I translated earlier: "empty" regular 3. [] and [^]: in Perl and the regular expressions of some other linux commands, if the character class [] contains a right brace [] followed by the left brace, the right brace will be treated as a common character, that is, only matching "]", in JavaScript, this regular expression will be recognized Do not form an empty character class followed by an angle bracket. The empty character class does not match anything. [^] similar: In JavaScript, it matches an arbitrary character (negative null character class) followed by a right parenthesis, such as "a]", "B]", other languages match any non-] characters. $ perl-e 'print "]" = ~ /[]/'$ Js-e' print (/[]/. test ("]")' false $ perl-e' print "x" = ~ /[^]/'$ Js-e' print (/[^]/. test ("x") 'false4. $ anchor some beginners think that $ matches the linefeed "\ n", which is a big mistake. $ is a zero-width assertion ), it cannot match a real character. It can only match one position. the difference I want to talk about occurs in non-multi-line mode: Do you think that in non-multi-line mode, $ matches the position behind the last character? In most other languages, if the last character in the target string is the linefeed "\ n", $ matches the position before the linefeed, that is, match the two positions on both sides of the line break at the end. many languages have the \ Z and \ z notation. If you know the difference between them, you should understand that in other languages (Perl, Python, php, java, c #...), In non-multi-row mode, $ is equivalent to \ Z, while in JavaScript, $ is equivalent to \ z in non-multi-row mode (only matches the position at the end, whether or not the last character is a linefeed ). ruby is a special case, because it is the multi-row mode by default. In the multi-row mode, $ will match the position before each line break. Of course, it will also include the line break that may appear at the end. these points are also mentioned in Yu Sheng's book "Regular Expression Guide. $ perl-e 'print "whatever \ n" = ~ S/$/replace character/rg '// globally replace whatever replace character // Replace the location before the line break with the replacement character // Replace the location after the line break $ js-e 'print ("whatever \ n ". replace (/$/g, "replace character") '// The location after the global replace whatever replacement character // The location after the line break is replaced 5. we all know that there is a back reference in the regular expression, that is, a backslash + number is used to reference a matched string in the previous capture group, it is used for re-matching or replacement (\ to $ ). however, there is a special case that if the referenced capture group hasn't started (the left bracket is the circle), the reverse reference will be used. for example, regular/(\ 2 (a) {2}/(a) is the second capture group, but the matching result \ 2 referenced by it is used on the left side of it, we know that regular expressions are matched from left to right, which is the origin of forwards reference in this section. Is a strict concept. now you can think about what the following JavaScript code will return: js>/(\ 2 (a) {2 }/. exec ("aaa ")??? Before answering this question, let's take a look at the performance in other languages. similarly, in other languages, this write is basically invalid: $ echo aaa | grep '(\ 2 (a) {2} 'grep: invalid back reference $ echo aaa | sed-R'/(\ 2 (a) {2}/'sed:-e expression #1, character 12: invalid Back Reference $ echo aaa | awk '/(\ 2 (a) {2}/' $ echo aaa | perl-ne 'print/(\ 2 ()) {2}/'$ echo aaa | ruby-ne' print $ _ = ~ /(\ 2 (a) {2}/'$ python-C' import re; print re. match ("(\ 2 (a) {2}", "aaa") 'None does not report an error in awk because awk does not support reverse reference, here, \ 2 is interpreted as a character with an ASCII code of 2. in Perl Ruby Python, there is no error. I don't know why this design should be based on Perl, but the results are the same. In this case, the matching is impossible. in JavaScript, not only do you not report an error, but you can also make a successful match. Let's see the same answer as you just thought: js>/(\ 2 (a) {2 }/. exec ("aaa") ["aa", "a", "a"] prevents you from forgetting what the exec method returns. the first element is a complete matching string, that is, RegExp ["$ &"], followed by the matching content of each capture group, that is, RegExp. $1 and RegExp. $2. why is the matching successful? What is the matching process? My understanding is: first enter the first capture group (left parenthesis on the left), where the first valid match is \ 2, but then the second capture group () no round, so RegExp. the value of $2 is still undefined, SO \ 2 matches an empty character or "location" on the left of the First a in the target string, just like ^ and other zero-width assertions. the point is that the matching is successful. the second capture group (a) matches the first a, RegExp in the target string. the value of $2 is also assigned as "a", followed by the end of the first capture group (rightmost right Parenthesis), RegExp. the value of $1 is also "". then there is the quantizer {2}, that is, a new round of matching of the regular (\ 2 (a) starts from the first a in the target string, the key point here is RegExp. the value of $2, that is, the value matched by \ 2, is not the value assigned at the end of the first round of matching. The answer is: "no", RegExp. $1 and RegExp. the values of $2 will be cleared to undefined, and \ 1 and \ 2 will be the same as those of the first time. An empty character ). the second a in the target string is successfully matched, and then RegExp. $1 and RegExp. the value of $2 becomes "a" again, and the value of RegExp ["$ &"] becomes a complete matching string, the first two a: "aa ". in earlier versions of Firefox (3.6), the value of the capture group will not be cleared after a match of quantifiers. That is to say, during the second round of matching, \ 2 will match the second a, and thus: js>/(\ 2 (a) {2 }/. exec ("aaa") ["aaa", "aa", "a"] In addition, the end of a capture group depends on whether the right parenthesis is closed, for example,/(a \ 1) {3}/, although \ 1 is used, the first capture group has already started matching, but it is not over yet. This is also a forward reference, so the matching for \ 1 is still NULL: js>/(a \ 1) {3 }/. exec ("aaa") ["aaa", "a"] Another example: js> /(? :( F) (o) | (B) (a) (r ))*/. exec ("foobar") ["foobar", undefined, "B", "a", "r"] * is a quantifiers. After the first round of matching: $1 is "f", $2 is "o", $3 is "o", $4 is undefined, $5 is undefined, and $6 is undefined. when the second round of matching starts: all captured values are reset to undefined. after the second round of matching: $1 is undefined, $2 is undefined, $3 is undefined, $4 is "B", $5 is "", $6 is "r ". $ & assigned as "foobar", matching ended. the last question: js> /(?: ^ (A) | \ 1 (a) | (AB) {2}/. exec ("aab ")????
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.