Regular Expression backtracking

Source: Internet
Author: User

I am also the first contact with "backtracking", and I am not very familiar with it. The following describes what I know as a mental log for future reference.

The basis for matching the regular expression we use is roughly divided into: the matching result with the leftmost end (starting with the most) and the standard matching quantifiers (*, + ,? And {m, n}) are matched first.

As the name suggests, "first-left matching is to start matching from the starting position of the string until the end of the matching." standard matching quantifiers "are also divided into" uncertain type finite automaton (NFA) "It can also be called" expression-dominated ". The other is" deterministic finite automaton (DFA) "or" text-dominated ". The regular expression we currently use in Javascript is "expression-dominated ". Expression-dominated and text-dominated are difficult to explain. First, an example may be clearer.

CopyCode The Code is as follows: // use a regular expression to match text
VaR Reg =/to (NITE | knight | night )/;
VaR STR = 'doing tonight ';
Reg. Test (STR );

In the above example, the first element [T] will repeat until 't' is found in the target string. Then, check whether the followed character can be matched by [O]. If yes, check the following elements (NITE | knight | night ). Its true meaning is "Nite", "Knight", or "night ". The engine tries these three possibilities in sequence. The process of trying [Nite] is to first try [N], then [I], then [T], and finally [e]. If this attempt fails, the engine tries another possibility, and continues until the matching is successful or the report fails. The control in the expression is converted between different elements, so it is called "expression-dominated ".

Similarly, in the preceding example, "text-dominated", all valid matches are recorded during string scanning. When the engine moves to T, it will add a potential possibility to the matching possibilities currently processed:

Position in the string Location in Regular Expression
...... Doing Tonight Possible matching location:/T ↑ O (NITE | knight | nigth )/

Each character scanned next updates the current possible matching sequence. After scanning for two more characters, the following situations occur:

Position in the string Location in Regular Expression
...... Doing Tonight Possible matching location:/to (Ni TE | knight | Ni ↑ te gth )/

Valid matches may be changed to two (Knight is eliminated ). When scanning to G, only one possible match is left. After matching H and T is completed, the engine finds that the matching is completed and the report is successful. "Text-dominated" is because every character in the string it scans controls the engine.

If you want to understand how "expression-dominated" works, let's take a look at our topic "backtracking" today )". Backtracking is like taking a fork in the road. When encountering a fork in the road, you should first mark each intersection. If the road is dead, you can return as you did until you see the mark you have done before, marking the road you have not tried. If the path cannot go, you can continue to return, find the next tag, and repeat it until you find the way out, or until you finish all the paths that have not been tried.

In many cases, the RegEx engine must select between two (or more) options. When you encounter /...... X ?...... /, The engine must try to match X. For /...... X + ...... /, There is no doubt that X tries to match at least once-because the plus sign must match at least once. After the first x match, this requirement has been met and you need to decide whether to try the next X. If you decide to proceed, you also need to determine whether to match the third X and the fourth X, so proceed. Each selection is actually a flag, which is used to indicate that there is another possible choice, and it is retained for backup. Two key points should be considered in the process of backtracking: Which branch should be selected first? Which (or which) of the previously saved branches are used for backtracking?

The first problem is to choose based on the following important principles:

If you need to select between "try" and "pass by", the engine will first select "try" for matching priority quantifiers, while for ignoring priority quantifiers, select "pass ".

The second problem is the following principle:

The option closest to the current storage is returned when the local tracing fails. The principle is LIFO (last in first out, followed by first in first out ).

Let's take a look at several examples of marking in the road:

1. No backtracing matching is performed.

Use [AB? C] to match "ABC ". [A] after a match, the current status of the match is as follows:

"A challenge BC" A between B? C

Now it's the turn of [B?] The RegEx engine needs to decide whether to try [B] Or skip it? Because [?] Is matched first, it will try to match. However, to ensure that the attempt can be restored after the attempt fails, the engine will:

"A challenge BC" AB? Objective C

Add to the standby status sequence. That is to say, later the engine may continue matching from the following position: [B?] in the Regular Expression Then, the string is matched before C (that is, the current position. This is actually skipping the [B] matching, and the problem allows this. After the engine is marked, it will continue to check [B]. In the example, it can match, so the new current status changes:

"AB Branch C" AB? Objective C

The final [c] can also be matched successfully, so the entire match is complete. The standby status is no longer needed, so they are not saved.

2. backtracing matching is performed.

The text to be matched below is "AC". Before trying [B], everything is the same as the previous process. Obviously, this [B] cannot match. That is to say, for […?] The attempt fails. Because there is a standby status, this "local matching failed" Labor Union causes the overall matching to fail. The engine performs backtracking, that is, switching the "current status" to the latest saved status.

"A challenge C" AB? Objective C

The untried option saved before [B] tries. In this case, [c] can match C, so the entire matching declaration is complete.

3. Unsuccessful matching

The text to be matched is "ABx ". Before trying [B], the standby status is saved due to the question mark:

"A sans BX" AB? Objective C

[B] can match, but this path cannot go down, because [c] cannot match X. Therefore, the engine will go back to the previous status and "return" B to [c] for matching. Obviously, this test also failed. If there are other saved statuses, The backtracing continues, but there are no other statuses at this time, and the entire match starting at the current position in the string will fail.

At present, regular expression backtracking can only be understood so much. I will try again later!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.