NFA Engine Matching Principle Based on Regular Expressions

Source: Internet
Author: User
Tags expression engine

1. Why should we understand the Engine Matching Principle?

Note-by-note disorganized combination, the playing may be noise. The same note can be produced through the hands of the composer, A player can also play beautiful music according to music scores, but he/she may not know how to change the combination of music notes to make music more beautiful.

The same applies to regular expressions. If you do not understand the Regular Expression Engine Principle, you can also write regular expressions that meet your needs. However, if you do not know the principle, it is difficult to write highly efficient regular expressions without any hidden risks. Therefore, it is necessary for people who often use regular expressions or are interested in learning Regular Expressions in depth to understand the matching principles of the regular engine.

2. Regular Expression Engine

The RegEx engine can be divided into two types: DFA and NFA, while NFA can be divided into traditional NFA and posix nfa.

DFA Deterministic finite Automation

NFA Non-deterministic finite automaton

Traditional NFA

POSIX NFA

Because the DFA engine does not require backtracking, the matching is fast, but the capture group is not supported, so reverse reference and $ number reference are not supported, currently, the DFA engine is mainly used in awk, egrep, and lex languages and tools.

Posix nfa mainly refers to the POSIX-compliant NFA engine. It provides longest-leftmost matching, that is, it will continue backtracking before finding the leftmost longest match. Like DFA, non-Greedy mode or ignoring the priority quantifiers are also meaningless for posix nfa.

Most languages and tools use the traditional NFA engine, which has some features not supported by DFA:

Capture group, reverse reference, and $ number reference;

Lookaround ,(? <= ...) ,(? <!...) ,(? = ...) ,(?!...)), Or some articles are called pre-search;

Ignore the optimized quantifiers (?? ,*? , +? , {M, n }? , {M ,}?), Or some articles are called non-Greedy models;

Take precedence quantifiers (? +, * +, ++, {M, n} +, {m,} +, currently only supported by Java and PCRE), and solidified grouping (?> ...).

The difference between engines is not the focus of this article. I will only give a brief introduction. If you are interested, refer to the relevant literature.

3 Preparation knowledge 3.1 string Composition

For the string"AbcIt consists of three characters and four locations.

3.2 characters in length and Zero Width

During regular expression matching, if the Sub-expression matches the character content rather than the position and is saved to the final matching result, the subexpression occupies a character. If the subexpression matches only the position, or the matched content is not saved to the final matching result, this subexpression is considered to be zero-width.

The possession characters are mutually exclusive, and the zero width is non-mutex. That is, a character can only be matched by one subexpression at a time, while a position can be matched by multiple zero-width subexpressions at the same time.

3.3 control and transmission

The regular expression matching process usually consists of a subexpression (which may be a common character, metacharacters, or metacharacters) to gain control, match is attempted from a certain position in the string. a subexpression starts to try the matching position, starting from the ending position where the match was successful in the previous subexpression. Such as regular expression:

(Subexpression 1) (subexpression 2)

Hypothesis(Subexpression 1)The zero-width expression. Because it matches the start and end positions in the same format, for example, if the position is 0(Subexpression 2)Match is attempted starting from position 0.

Hypothesis(Subexpression 1)It is a character-occupying expression, because it matches the start and end positions not the same, if the match succeeds at the position 0, and ends at the position 2, then(Subexpression 2)Match is attempted from position 2.

For the entire expression, the matching is usually attempted starting from the string position 0. If an attempt starting from position 0 fails to match the entire expression at a position of the string, the engine will make the regular expression drive forward, and the entire expression will try again from position 1, wait until the report matches successfully or tries to reach the last position to report a failure.

4. Regular Expressions: simple match. 4.1 basic matching process

Source string:Abc

Regular Expression:Abc

Matching Process:

First by the character"A"Get control, match from position 0, from"A"To match"A". The match is successful, and the control is handed over to the character"B"; Because"A"Already"A"Match, so"BStarting from position 1,B"To match"B", The match is successful, and the control is handed over to"C"; By"C"To match"C", Matching successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is"Abc", The start position is 0, and the end position is 3.

4.2 matching process with matching priority quantifiers-successful matching (1)

Source string:Abc

Regular Expression:AB? C

Quantizer"?"Is the matching priority quantizer. When the matching is not matching, the system will first try to match. Only when this option makes the entire expression unable to match successfully will the Matching content be tried. Here, the quantizer"?"Is used to modify the character"B", So"B?"Is a whole.

Matching Process:

First by the character"A"Get control, match from position 0, from"A"To match"A". The match is successful, and the control is handed over to the character"B?"; Because"?"Is the matching priority, so we will first try to match, from"B?"To match"B", The match is successful, and the control is handed over to"CAnd record an alternative status.C"To match"C", Matching successful. The alternate status of the record is discarded.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is"Abc", The start position is 0, and the end position is 3.

4.3 matching process with matching priority quantifiers-successful matching (2)

Source string:Ac

Regular Expression:AB? C

Matching Process:

First by the character"A"Get control, match from position 0, from"A"To match"A". The match is successful, and the control is handed over to the character"B?"; First try to match,B?"To match"C", And record an alternative status. If the matching fails, perform backtracking to find the alternative status."B?"Ignore matching, give control, and give control to"C"; By"C"To match"C", Matching successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is"Ac", The start position is 0, and the end position is 2. "B?"Does not match any content.

4.4 matching process with matching priority quantifiers -- matching failed

Source string:Abd

Regular Expression:AB? C

Matching Process:

First by the character"A"Get control, match from position 0, from"A"To match"A". The match is successful, and the control is handed over to the character"B?"; First try to match,B?"To match"B", And record an alternative status. The matching is successful, and the control is handed over to"C"; By"C"To match"D", The matching fails. In this case, backtrack and find the alternate status of the record."B?"Ignore matching, that is,"B?"Does not match"B", Give control, and give control to"C"; By"C"To match"B", Matching failed. In this case, the first round of matching attempts fails.

The regular engine enables the regular expression to drive forward, starting from position 1 to try matching,A"To match"B", Matching failed, no alternative status, the second round of matching attempt failed.

Continue to drive forward until the attempt to match in position 3 fails and the match ends. In this case, it is reported that the entire expression fails to match.

4.5 matching process with ignored quantifiers -- matching successful

Source string:Abc

Regular Expression:AB ?? C

Quantizer"??"Is a type of ignored priority quantifiers. When the matching is not matching, the system selects mismatch first. Only when this type of selection fails to match the entire expression, will the system try to match. Here, the quantizer"??"Is used to modify the character"B", So"B ??"Is a whole.

Matching Process:

First by the character"A"Get control, match from position 0, from"A"To match"A". The match is successful, and the control is handed over to the character"B ??"; First try to ignore the match, that is,"B ??"No matching is performed, and an alternative state is recorded at the same time, and control is handed over to"C"; By"C"To match"B", The matching fails. In this case, backtrack and find the alternate status of the record."B ??"Try to match, that is,"B ??"To match"B", The match is successful, and the control is handed over to"C"; By"C"To match"C", Matching successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is"Abc", The start position is 0, and the end position is 3. "B ??"Matching character"B".

4.6 zero-width matching process

Source string:A12

Regular Expression:^(? = [A-z])A-z0-9 +$

Metacharacters^And$"Match only the location, and view the order"(? = [A-z])"Only matches, does not occupy characters, and does not save the Matching content to the final matching result, so they are all zero-width.

This regular expression matches letters and numbers. The first character is a letter string.

Matching Process:

First, the metacharacter"^"Get control, match from position 0,"^"Matched is the starting position"Location 0", The match is successful, and the control is handed over to the sequential View"(? = [A-z])";

"(? = [A-z])It must be a letter on the right of the position to be matched successfully. The zero-width subexpressions are not mutually exclusive. That is, the same position can be matched by multiple zero-width subexpressions at the same time, so it also tries to match from position 0, and the right side of position 0 is the character"A", Meets the requirements, the match is successful, and the control is handed over to"A-z0-9 +";

Because"(? = [A-z])"Only matches, does not save the matched content to the final result, and"(? = [A-z])"The matched position is 0, so"A-z0-9 +"The match is also attempted from the position 0,"A-z0-9 +"First try to match"A", Matching successful, continue to try matching, you can successfully match the next"1And2", It is matched to position 3, and there are no characters on the Right of position 3, then the control is handed over to"$";

Metacharacters$"Starting from position 3, it matches the end position, that is,"Location 3", Matching successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is"A12", The start position is 0, and the end position is 3. "^"Matching position 0,"(? = [A-z])"Matching position 0,"A-z0-9 +"Matching string"A12","$"Matched position 3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.