In-depth JS regular first and then assertion

Source: Internet
Author: User

Here is mastering Lookahead and lookbehind The simple translation of the article, this article is in their own search questions StackOverflow answer questions on the recommendation of the people, read it to write very well. Here the simple translation is omitted some JS does not have the content, moreover the original text is too long, so also removed some no substance content, but also joined a lot of their own understanding. If you need to understand JS's assertion mechanism, it is recommended to read the basis of the MDN first to see this article (http://www.rexegg.com/regex-lookarounds.html) effect will be better.



The first is a simple concept introduction to the 0 wide assertion, omitted.

Precedent Assertion Example: simple password verification

The password needs to meet four conditions:

    1. 6 to 10 single character \w
    2. Contains at least one lowercase letter [a-z]
    3. Contains at least three uppercase letters [A-z]
    4. Contains at least one number \d

The original idea was to detect each condition four times at the beginning of the string, each time.

Condition One

Here the article with \a match string beginning, with \z match string end, and JS not the same, changed a bit
The first condition is simple: ^\w{6,10}$ . To join the antecedent assertion: the (?=^\w{6,10}$) antecedent assertion: After the position at the beginning of the string, is 6 to 10 characters, and the end of the string.

(at the current position in the string, what follows is the beginning of the string, six to ten word characters, and the V Ery end of the string. )

We want to assert at the beginning of the string, so we need to do an anchor position with ^, do not need to repeat the beginning of the declaration, so take ^ from the assertion:

^(?=\w{6,10}$)

Notice that although we have detected the entire string with an antecedent assertion, our position has not changed, and the regular validation anchor point still stays at the beginning of the string, just to make the first judgment. means we can also continue to detect the entire string.

Condition Two

Detecting lowercase letters The easiest to think of is .*[a-z] , but this is the notation. * The beginning of the match to the end of the string, resulting in backtracking, it is easy to think of the wording is .*?[a-z] that this will lead to more backtracking. The recommended notation is [^a-z]*[a-z] (you can refer to this generic notation when you need to include certain characters), adding conditions to the antecedent assertion: (?=[^a-z]*[a-z]) , so the regular becomes:

^(?=\w{6,10}$)(?=[^a-z]*[a-z])

The assertion still does not match any characters, and the position of the two assertions is interchangeable.

Condition Three

Similar condition two:(?=(?:[^A-Z]*[A-Z]){3})
The regular becomes:

Condition Four

Similar to:(?=\D*\d)
The regular becomes:

^(?=\w{6,10}$)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)

At this point, we assert at the beginning of the string, and detect four times before the four conditions, still do not match any characters, but verify the password.

Match valid string

After checking, the position of the regular detection remains at the beginning of the string, which can be used .* to match the entire string in a simple way, because no matter .* what is matched, it is verified. So:

^(?=\w{6,10}$)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d).*
Fine-tuning: Remove a condition

Check the antecedent assertion in this regular, and notice that \w{6,10}$ the expression checks all the characters of the string, so it can be used to match the entire string instead .* of using it, so you can reduce a pre-judged simplified regular:

^(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)\w{6,10}$

Summing up this result, if you check n a condition, the regular is at most need n-1 a first judgment. Can even combine several antecedent judgments.
In fact, in addition to \w{6,10}$ just matching the entire string, a few other antecedent judgments can be rewritten to match the entire string, such as (? =\d*\d) can add a simple .*$ match to the end of the string:

^(?=\w{6,10}$)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})\D*\d.*$

Also, why .* not add to the $ end of the string? Because the dot symbol does not match the newline character (except DOTALL mode under, that is, the point matches All), it .* can only match to the end of the first line, and if there is a newline, it cannot be matched, $ guaranteeing that we not only reach the end of a line, but also the end of the string.

In this regular expression, the beginning (?=\w{6,10}$) has been matched to the end, so the latter is $ not necessary.

The position of the antecedent assertion has little effect

In this case, because the three antecedent assertions do not change position, they are interchangeable. Although the results have no impact, they can affect performance and should be preceded by an antecedent assertion that is easy to verify failure.
In fact, we put it ^ in front to consider the situation, because there is ^ no match any character moving the regular matching anchor point, he can also be interchangeable with other antecedent assertions, but this will cause problems.
First, at the bottom, the DOTALL mode negative assertion (?<!.) can match the beginning, that is, there are no characters in front, not DOTALL mode below, and a (?<![\D\d]) match begins.
Now assuming that it is ^ placed in the fourth position, after three antecedent assertions, then if the third assertion fails, then the regular engine will continue to match from the first antecedent assertion to the second position, so that the position match is continually changed until all the positions fail. Although ^ it is not possible to continue judging from other locations as long as it matches, the regular engine is unreachable because of premature failure ^ .
First place, in addition to the initial position, the other position in the first match ^ will fail, and therefore more efficient.

0 Wide assertion does not change position

Here are some of the mistakes that beginners often make.
For example A(?=5) , to match AB25 , do not understand the place is in the antecedent assertion 5 is immediately A after the position, if you want to match the position behind, need to use (?=[^5]*5) .
With A(?=5)(?=[A-Z]) the match A5B , the position remains the same problem, should be usedA(?=5[A-Z])

0 usage validation of wide assertions

That is, an example of the above password validation, where a string satisfies multiple conditions. Each condition detects the entire string.

Limit character Range

such as matching non- Q character characters outside the single character \w . There are several ways to do this:

    1. Character subtraction, [\w-[q]] (JS not supported)
    2. [_0-9a-za-pr-z]
    3. [^\WQ]
      Antecedent assertion notation:(?!Q)\w
      A character is matched after the antecedent assertion is not Q after the current position \w . This is not only easy to understand, but also easy to add, such as does not include Q and K, then is:

      (?![QK])\w`

      Following assertion:

      \w(?<!Q)
Tempering the scope of a token logo range adjustment

The matching range of the restriction flag (token).
For example, if you want to match any character that is not followed by {END}, you can use:

(?:(?!{END}).)*

Each flag is . (?!{END}) adjusted, and the assertion point flag cannot be {END} the beginning of this technique calledtempered greedy token
Another option is a bit too complex to omit.

Delimiter delimiter

After the first #START# occurrence, all the characters following the match are spelled:

(?<=#START#).*

or any character that matches a string, except#END#?

.*?(?=#END#)

Two assertions can be combined:

(?<=#START#).*?(?=#END#)
Inserting text at a Position insertion

Give you a file with the hump named movie title, for example, to HaroldAndKumarGoToWhiteCastle facilitate reading, you need to insert a space between the case, the following regular match these positions:

(?<=[a-z])(?=[A-Z])

In the editor's regular match lookup, you can use this to match these locations and replace them with spaces. (You can think /[a-z][A-Z]/g of the same can be found here, but not the location, so the replacement is not so convenient.)

Splitting a string at a Position to split strings at a location

Similar to the above example, you can split the position between the case, in many languages, with the Split function plus a regular can return a word array.

Finding overlapping Matches find overlapping matches

Sometimes you need to do multiple matches in the same word, for example, to ABCD match abcd,bcd,cd and D in, you can use:

(?=(\w+))

This is good to understand, will match four positions, "", "A",, "", "B", "", "C", "", "D", "". But as for how to extract the four parts, we haven't found a suitable method.

Zero-width Matches 0 Width matching

0 Wide assertion, Anchor point, boundary in a regular expression containing flags, allows the regular engine to return a matching string. (?<=start_)\d+For example, the regular engine returns a number, but does not include a prefix start_ .
Here are some applications:

Validation Verification

Similar Password Authentication example

Inserting Insertion

Examples of inserting spaces like

Splitting segmentation

Examples of inserting spaces like

Overlapping Matches overlap matching

Examples of multiple matches in the same word

Positioning the Lookaround 0 wide assertion positioning

0 Wide assertion There are two options to locate, before and after text, in general, one of the higher performance.

Lookahead Antecedent Assertion

\d+(?= dollars)and (?=\d+ dollars)\d+ both match 100 dallars 100 , but the former performs better because he only matches \d+ once. (In this case, the second formula is to assert that the current position is followed by \d+ dollars , and then match the string in the assertion \d+ ).

Negative Lookahead antecedent negative assertion

\d+(?! dollars)and (?!\d+ dollars)\d+ all match 100 pesos 100 , but the former performance is better, ibid.

There are two examples after the assertion, JS does not support it is not listed.
The differences in these examples lie in the front and back of the match. The explanation here is not to dwell on the position, but to be able to know and feel the efficiency of writing the regular, through practice, will slowly become familiar with these differences and write higher performance of the regular.

Lookarounds that look on Both sides:back to the future

This section relates to the 0-wide assertion of nesting, here only to illustrate the example of the inside, because JS does not support after the assertion, the thing here is not very useful.
Match the numbers between underscores: _12_ There are many ways to do this, and the new method is:

(?<=_(?=\d{2}_))\d+

That is, the current position before the assertion matches the underscore, and the underscore after the assertion matches the _ \d{2}_, that is, the entire subsequent assertion matches _\d{2}_ , and the current position _ \d{2} between and, followed by \d+ matching numbers.

Compound Lookahead and Compound lookbehind Composite first and compound after the flag there is at most one character

Match a number that has at most one underscore at a later:

\d+(?=_(?!_))

There is also a less elegant way to do this:\d+(?=(?!__)_)

There is at most one character before the flag

Match numbers preceded by at most one underscore:

(?<=(?<!_)_)\d+

There is also a less elegant way to do this:(?<=_(?<!__))\d+

Multiple compounding multi-compound

That is, multiple nesting, this is a bit complex, is more than once nested, multiple criteria together to determine. Here is not the list, you can take a look at this example:

(?<=(?<!(?<!X)_)_)\d+

Indicates that a numeric prefix cannot be multiple underscores, except in X__ this case.

The Engine doesn ' t Backtrack into lookarounds......because they ' re atomic

_rabbit _dog _mouse DIC:cat:dog:mouse
In this string, DIC is followed by the allowed animal name, which we want to match in _tokens the previous allowed animal name.

_(\w+)\b(?=.*:\1\b)

Get _dog and _mouse .
Flip it:

_(?=.*:(\w+)\b)\1\b

This only matches it to the_mouse
This place is amazing, just a little bit. The first one is pretty good to understand. Each forward assertion takes the previous \1 capture to match the following, pressing the result from left to right multiple times to two results. The second regular is special, the catch is placed in the forward assertion, the positive assertion that the greedy match will be directly to the _mouse underlined position, and then the engine jumps out of the forward assertion to match \1 , matching to mouse success. The match ends. The point here is that the regular engine does not backtrack in the forward judgment, and as long as it jumps out of the positive assertion, it will not go in again. So the forward assertion here will only match mouse . When I first thought of adding a non-greedy, it would only match cat.

Fixed-width, Constrained-width and infinite-width lookbehind negative assertion, omitting lookarounds (usually) want to be anchored

Match a string containing a single word with one digit:

^(?=\D*\d)\w+$

The question to consider here is ^ whether the anchor point is necessary.
The point here is ^ to be able to reduce the number of errors, if not ^ , the regular engine will be in every location to match, only after all the errors will not return the error, but added ^ , as long as the initial match error engine will stop. Although in the case of a successful match, the two cases are returned in the same way, but the performance difference is very large.

One exception:overlapping Matches

But sometimes we want the regular engine to match multiple locations, like the example above: (?=(\w+)) . In the ABCD match four times, got four results we want.

Postscript

PostScript refers to the above mentioned [^a-z]*[a-z] optimization [^a-z]*+[a-z] , but a look knows that JS does not support, the optimization point is that if the match is not successful, some of the less intelligent engine will go back to the previous non-lowercase characters, to match the lower case letter of the obvious invalid backtracking.

The general explanation of this article is here, and we need to know about the regular engine later.

Translated article Source:
Http://www.rexegg.com/regex-lookarounds.html


This article source: Jufofu

This address: http://www.cnblogs.com/JuFoFu/p/7719916.html

The level is limited, wrong welcome correct, reprint please indicate source.

In-depth JS regular first and then assertion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.