1 Why you should know the engine matching principle
A combination of notes jumbled together, playing may be the noise, the same notes through the composer's hand, you can draw a very beautiful music, a performer can also play a beautiful music music, but he/she may not know how to change the combination of notes, making the music more beautiful.
As a regular user also, do not understand the principle of the regular engine, the same can be written to meet the requirements of the regular, but do not know the principle, but it is difficult to write efficient and no hidden danger of the regular. Therefore, it is necessary to know the matching principle of regular engine for regular use or for those who are interested in studying the regular in depth.
2 Regular expression engine
The regular engine can be broadly divided into two different types: DFA and NFA, and the NFA can basically be divided into traditional NFA and POSIX NFA.
DFA deterministic finite automaton deterministic type with poor automata
NFA non-deterministic finite automaton non-deterministic type with poor automaton
Traditional NFA
POSIX NFA
The DFA engine does not require backtracking, so the match is fast, but the capturing group is not supported, so there is no support for reverse referencing and the use of this reference, and the language and tools currently used by the DFA engine are mainly awk, egrep, and Lex.
POSIX NFA mainly refers to the POSIX-compliant NFA engine, which is characterized primarily by providing a longest-leftmost match, which will continue to backtrack until the leftmost longest match is found. As with the DFA, the non-greedy mode, or ignoring the precedence quantifier, is equally meaningless for POSIX NFA.
Most languages and tools use the traditional NFA engine, which has some features that are not supported by DFA:
Capturing groups, reverse references, and the method of referencing in the same way;
Surround (Lookaround, <= ...), (?<!...), (? = ...), (?! ...), or some an article called pre-search;
Ignore optimization quantifiers (??、 *, +, {m,n}, {m,}?), or some articles called non-greedy patterns;
Occupy the priority quantifiers (? +, *+, + +, {m,n}+, {m,}+, currently only Java and pcre support), curing group (...).
The difference between engines is not the focus of this article, only a brief introduction, interested in reference to the relevant literature.
3 Preliminary Knowledge 3.1 string composition
For the string "ABC", it includes three characters and four positions.
3.2 Possessive characters and 0 widths
In the regular expression matching process, if the sub-expression matches the character content, but not the position, and is saved to the final matching result, then it is assumed that the sub-expression is a possessive character, if the sub-expression matches only the position, or the matching content is not saved to the final matching results, Then think of this subexpression as 0 width.
Possessive characters are mutually exclusive, and the 0 width is non-exclusive. That is, a character that can only be matched by one sub-expression at a time, while a position may be matched by multiple 0-width sub-expressions simultaneously.
3.3 Control power and transmission
The regular matching process, usually by a sub-expression (may be a normal character, a meta-character or a sequence of metacharacters) to take control, starting from a point in the string to try to match, a sub-expression began to attempt to match the position of the previous sub-expression matches the successful end of the beginning of the position. As regular Expressions:
( subexpression One) (sub-expression two)
Suppose (subexpression one) is a zero-width expression, because it matches where the start and end positions are the same, such as position 0, then (subexpression two) is attempted to match starting at position 0.
Suppose (subexpression one) is an expression that occupies a character, because it matches the beginning and end of the position is not the same, such as the match succeeds starting at position 0, ending at position 2, then (subexpression two) is starting from position 2 to try to match.
For the entire expression, the string position 0 is usually the first attempt to match. If, at position 0, an attempt to match the entire expression fails when matched to a position in the string, the engine causes the regular forward drive, the entire expression attempts to match from position 1, and so on, until the report match succeeds or the match fails after the last position has been attempted.
4 Regular Expression Simple 4.1 basic matching process
SOURCE string:ABC
Regular Expressions:ABC
Matching process:
First, the character "a" to take control, starting from position 0 to match, by "a" to match "a", matching success, control to the character "b"; because "a" has been "a" "match, so"b"starts at position 1 to try to match," B "to match"b", the match succeeds, control is given"c";"C"matches" c", Match succeeded.
The regular expression match is complete and the report matches successfully. The match result is "ABC", the start position is 0, and the end position is 3.
4.2 The matching process with matching priority quantifiers-match success (i)
SOURCE string:ABC
Regular expression:ab?c
Quantifier "?" is a match-first quantifier, and when a match can be mismatched, the attempt to match is selected, and only if the selection causes the entire expression to fail to match, the match is attempted. Here's the quantifier "?" is used to modify the character "b", so "B?" is a whole.
Matching process:
First, the character "a" to take control, starting from position 0 to match, by "a" to match "a", matching success, control to the character "B?" ; because "?" Matches a priority quantifier, so the match is attempted first, by the "B?" To match "b", match success, control to "C", record an alternate state, "C" to match"C", matchsuccess. The alternate state of the record is discarded.
The regular expression match is complete and the report matches successfully. The match result is "ABC", the start position is 0, and the end position is 3.
4.3 The matching process with matching priority quantifiers--match Success (II)
SOURCE string:ac
Regular expression:ab?c
Matching process:
First, the character "a" to take control, starting from position 0 to match, by "a" to match "a", matching success, control to the character "B?" First try to match, by "B?" To match "C" while recording an alternate state, matching failed, backtracking at this time, finding an alternate state, "B?" Ignore the match, give control, give control to "C", and "C" to match "C", match success.
The regular expression match is complete and the report matches successfully. The match result is "ac", the start position is 0, and the end position is 2. where "B?" does not match any content.
4.4 The matching process with matching priority quantifiers--matching failure
SOURCE string:abd
Regular expression:ab?c
Matching process:
First, the character "a" to take control, starting from position 0 to match, by "a" to match "a", matching success, control to the character "B?" First try to match, by "B?" To match "B", simultaneously recording an alternate state, matching success, control to "C", "C" to match "D", matching failure, backtracking at this time, finding the alternate state of the record, "B?" Ignore match, i.e. "B?" Do not match "b", let out control, give control to "C"; "C" to match "b", match failure. The first match attempt failed at this time.
The regular engine makes the regular forward drive, starting with position 1 to try to match, "a" to match "b", Match failed, no alternate status, second-round match attempt failed.
Continue the forward drive until the match fails at position 3, matching ends. The entire expression matching failure is reported at this time.
4.5 matching process with ignoring priority quantifiers--matching success
SOURCE string:ABC
Regular expression:ab?? C
Quantifier "??" When a match is not matched, the mismatch is ignored, and only if the selection causes the entire expression to fail to match. Here's the quantifier "??" is used to modify the character "b", so "b??" is a whole.
Matching process:
First, the character "a" to take control, starting from position 0 to match, by "a" to match "a", matching success, control to the character "B??" , try ignoring the match first, or "b??" Do not match, while recording an alternate state, control to "C", "C" to match "b", matching failure, at this time backtracking, find the record of alternative state, "b??" Try to match, i.e. "b??" To match "b", match success, give control to "C", "C"to match "C", match success.
The regular expression match is complete and the report matches successfully. The match result is "ABC", the start position is 0, and the end position is 3. where "b??" Match the character "b".
4.6 0 Width matching process
SOURCE string:A12
Regular expression:^(? =[a-z])[a-z0-9]+$
The meta-characters "^" and "$" match only the position, and the sequential look "(? =[a-z])" only matches, does not occupy characters, and does not save the matched content to the final match result, so all are 0 width.
The meaning of this regular is to match a string consisting of a letter or a number, the first character being a letter.
Matching process:
First, the meta-character "^" to take control, starting from position 0 to match, "^" matches the start position " position 0", matching success, control to the order of the "(? =[a-z])";
"(? =[a-z])" requires that the right side of its location must be a letter to match the success, the 0-width subexpression is not mutually exclusive, that the same position can be matched by multiple 0-width sub-expressions at the same time, so it is also from position 0 try to match, position 0 to the right of the character " a", meet the requirements, matching success, control over to"[a-z0-9]+";
Because "(? =[a-z])" matches only, does not save the matched content to the final result, and the "(? =[a-z])" match succeeds in position 0, so "[a-z0-9]+" is also starting from position 0 to try to match, "[a-z0-9]+" first try to match "a", match success, continue to try to match, can successfully match the next "1" and "2", At this point has been matched to position 3, the right side of position 3 has no characters, this will give control to "$";
The meta-character "$" starts at position 3 trying to match, it matches the end position, that is, " position 3", and the match succeeds.
The regular expression match is complete and the report matches successfully. The match result is "A12", the start position is 0, and the end position is 3. where "^" matches position 0, "(? =[a-z])" matches position 0, "[a-z0-9]+" matches string "A12", "$" Match position 3.
NFA Engine Matching principle Source: http://blog.csdn.net/lxcnn/article/details/4304651
NFA Engine Matching principle