regular base NFA engine matching principle _ regular expression

Source: Internet
Author: User
Tags posix expression engine

1 Why to understand engine matching principle

A jumble of notes, may be the noise, the same note through the composer's hand, you can draw a very beautiful music, a performer can also play the music of the melody, but he/she may not know how to change the combination of notes, make the music more beautiful.

As regular users also, do not understand the principle of the engine, the same can be written to meet the needs of the regular, but do not know the principle, but it is difficult to write efficient and no hidden dangers of the regular. Therefore, it is necessary to understand the matching principle of the regular engine for those who often use regular, or are interested in studying the regular.

2 Regular expression engine

The regular engine can be divided into two categories: the DFA and the NFA, and the NFA is basically divided into the traditional NFA and the POSIX NFA.

DFA deterministic finite automaton definite type with finite automata

NFA non-deterministic finite automaton non-definite type finite automata

Traditional NFA

POSIX NFA

The DFA engine, because it does not need backtracking, matches quickly, but does not support capturing groups, so it does not support reverse references and $number, and the language and tools currently using the DFA engine are mainly awk, egrep, and Lex.

The POSIX NFA is primarily defined as the POSIX-compliant NFA engine, which is characterised primarily by providing longest-leftmost matching, which continues to backtrack until the leftmost longest match is found. As with a DFA, non-greedy patterns, or ignoring priority quantifiers, are equally meaningless to POSIX NFA.

Most languages and tools use a traditional NFA engine, which has features that the DFA does not support:

Capturing groups, reverse references, and $number reference methods;

Look Around (Lookaround, <= ...), (?<!...), (? = ...), (?! ), or some an article are called pre-search;

Ignores the optimization quantifiers (??、 *, +, {m,n}?, {m,}?), or some articles are called non-greedy patterns;

Takes precedence quantifiers (? +, *+, + +, {m,n}+, {m,}+, currently only Java and pcre support), curing group (...).

The difference between engines is not the focus of this paper, only to do a brief introduction, interested in reference to the relevant literature.

3 Preliminary knowledge

3.1 String composition

For the string "ABC", it includes three characters and four locations.

3.2 Possessive characters and 0 widths

During regular expression matching, if the subexpression matches the character content, instead of the position and being saved to the final match, it is assumed that the subexpression is possessive, and if the subexpression matches only the position, or the matching content is not saved to the final match result, Then consider this subexpression to be 0-width.

Possessive characters are mutually exclusive and 0 widths are mutually exclusive. is a character that can only be matched by one subexpression at a time, while a position can be matched by multiple 0-width subexpression.

3.3 Control power and transmission

A regular matching process, usually by a subexpression (possibly a normal character, a meta character or a sequence of metacharacters. Gets control, starts at a point in the string, attempts a match, and begins at the end of the previous child expression matching success. such as regular expressions:

( subexpression one) (subexpression two)

Suppose (subexpression one) is a zero-width expression, because it matches the same position as the beginning and the end, such as position 0, then (subexpression II) attempts to match from position 0.

If (subexpression one) is an expression that occupies a character, because it matches the start and end positions not the same, if the match succeeds at position 0 and ends at position 2, then (subexpression II) attempts to match from position 2.

For the entire expression, however, it is usually the string position 0 that starts the attempt to match. If an attempt to start at position 0 matches the entire expression match to a position in the string, the engine makes the forward drive, the entire expression starts over at position 1, and so on, until the report match succeeds or the attempt to the last location reports a matching failure.

4 Regular Expressions Simple horse-this process

4.1 Base Matching Process

SOURCE string:ABC

Regular expression:ABC

Matching process:

First by the character "a" to obtain control, starting from position 0 match, by "a" to match "a", matching success, control to the character "b"; because "a" has been "a" "match, so"b"starts from position 1 to try to match, by"b"to match"b", the match succeeds, control to"C"; by"C"to match" c", matched successfully.

At this point the regular expression match completes and the report matches successfully. The result of the match is "ABC", where the start position is 0, and the end position is 3.

4.2 Matching process with matching priority classifier--matching success (i)

SOURCE string:ABC

Regular expression:ab?c

Quantifier "?" is a matching priority classifier that, when matching can be mismatched, chooses to try the match first, only to try to yield the matching content if the entire expression cannot be matched successfully. Here's the quantifier "?" is used to modify the character "b", so "B?" is a whole.

Matching process:

First by the character "a" to get control, starting from position 0 match, by "a" to match "a", matching success, control to the character "B?" ; because "?" is the matching precedence classifier, so the first attempt is made to match, by the "B?" To match "b", the match succeeds, control to "C", record an alternate state, by "C" to match "C", match successfully. The alternate state of the record is discarded.

At this point the regular expression match completes and the report matches successfully. The result of the match is "ABC", where the start position is 0, and the end position is 3.

4.3 matching process with matching priority classifier--matching success (II.)

SOURCE string:ac

Regular expression:ab?c

Matching process:

First by the character "a" to get control, starting from position 0 match, by "a" to match "a", matching success, control to the character "B?" ; try to match first, by "B?" To match "C", record an alternate state, match the failure, backtrack at this point, find an alternate state, "B?" Ignore the match, let out the control, give the control to "C",by "C" to match "C", match successfully.

At this point the regular expression match completes and the report matches successfully. The result is "ac," the start position is 0, and the end position is 2. where "B?" does not match any content.

4.4 Matching process with matching precedence classifier--matching failure

SOURCE string:abd

Regular expression:ab?c

Matching process:

First by the character "a" to get control, starting from position 0 match, by "a" to match "a", matching success, control to the character "B?" ; try to match first, by "B?" To match "b", record an alternate state, match successfully, control to "C", by "C" to match "D", match failed, then backtrack, find the record of the alternative state, "B?" Ignore the match, that is, "B?" Do not match "b", give up control, the control to "C", by "C" to match "b", the match failed. The first-round match attempt failed.

The regular engine makes the forward drive, starting with position 1 to try to match, "a" to match "b", the match failed, no alternate state, and the second round attempt failed.

Continue to drive forward until the match ends at position 3 attempt to match failed. The entire expression match fails to be reported at this time.

4.5 matching process with ignoring priority quantifiers--matching success

SOURCE string:ABC

Regular expression:ab?? C

Quantifier "??" is the ignored precedence quantifier, which, when matched can be mismatched, selects the mismatch first, only if the selection makes the entire expression impossible to match. Here's the quantifier "??" is used to modify the character "b", so "b??" is a whole.

Matching process:

First by the character "a" to get control, starting from position 0 match, by "a" to match "a", matching success, control to the character "B??" ; Try ignoring the match first, that is, "b??" Do not match, record an alternate state, control to "C", by "C" to match "b", matching failed, then backtracking, find the record of the alternative state, "b??" Try to match, that is, "b???" To match "b", the match succeeds, the control to "C", by "C" to match "C", match successfully.

At this point the regular expression match completes and the report matches successfully. The result of the match is "ABC", where the start position is 0, and the end position is 3. where "b??" Matches the character "b".

4.6 0 Width matching process

SOURCE string:A12

Regular expression:^(? =[a-z])[a-z0-9]+$

The meta character "^" and "$" match only the position, the order looks around "(? =[a-z])" only to match, does not occupy the character, also does not save the matching content to the final match result, therefore all is 0 width.

This regular meaning is matched by letters and numbers, and the first character is a string of letters.

Matching process:

First from the meta character "^" to obtain control, starting from position 0 match, "^" Matching is the start position " position 0", matching success, control to the order to look around "(? =[a-z])";

"(?) =[a-z]" requires that the right side of the position must be a letter to match the success, the 0-width subexpression is not mutually exclusive, that is, the same position can be matched by multiple 0-width subexpression, so it is also a match from position 0, and the right side of position 0 is the character " a", meet the requirements, match the success, control power to"[a-z0-9]+;]

Because "(? =[a-z])" matches only, does not save the matching content to the final result, and the "(? =[a-z])" match succeeds in position 0, so[a-z0-9]+ Also starting at position 0 to try to match, "[a-z0-9]+" first tries to match "a", matches successfully, continues to try to match, can successfully match the next "1" and "2", At this point has been matched to position 3, position 3 on the right side has no characters, then the control will be given to "$";

The meta character "$" tries to match from position 3, which matches the end position, which is " position 3", and the match is successful.

At this point the regular expression match completes and the report matches successfully. The result of the match is "A12" with the start position of 0 and the end position of 3. where "^" matches position 0, "(? =[a-z])" matches position 0, "[a-z0-9]+" Match string "A12", "$" Match location 3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.