Regular Learning (2)---simple matching principle,---matching
Write a simple understanding of the principle of matching, or PHP-based.
First of all, the regular engine can be divided into two main categories: DFA and NFA, anyway, the engine is more than strange, simple understanding is a different way of matching, like in the array to find data, there are some from the beginning of the order, find, some from the middle to find, the way different. The NFA has a much longer history, using the NFA tools or languages more, but there are also two engines mixed with it. The example of a book is very pertinent: the NFA is like a petrol engine, a DFA is like a motor, they can make a car run but have different mechanisms. Since the NFA and DFA have been developed for many years, a standard called POSIX has been introduced, which specifies the metacharacters and features that should be supported as a regular engine, and the exact results that end users want.
NFA, the match is dominated by the (regular) expression, and the DFA is matched by the string to be matched. PHP uses a traditional NFA engine, and of course Perl does. No matter what kind of engine, there are two general principles: 1. First match the result of the left end; 2. Standard quantifiers (+ 、?、 *, {m,n}) are match-first.
The existing expression is as follows, to match the string ' Tonight '
'/to (nite|knite|night)/'
NFA: takes the expression as the dominant , starting with the first part of the expression, checking whether the current text matches, and, if so, continuing to the next part of the expression until all the expressions match and the entire match succeeds. When the first character of the expression is t, it is searched sequentially from left to right in the string until a T character is found, fails if not found, and if found, the next character in the expression O, continues to find in the string to be matched, beginning with two can match, into a selection branch grouped by parentheses, Match Nite, Knite, or night, which in turn tries these three possibilities. The first branch fails when it tries to Nig, the second branch tries to fail at the first n in the expression, and the third branch exactly matches the other one. In the case of a regular expression-dominated engine, it is necessary to examine the expression to arrive at the final conclusion.
DFA: Unlike NFA, when the DFA scans a string, it records all the matches that are currently valid, starting with T, which adds a possibility to the current match, and if it does, it will record two possibilities, if present, to the N in the string. Nite and Night Two n (it is from the string angle to match the expression), continue to scan to I or Nite and night two possible, then to G only night one possible, when the H and T match is completed, the engine found that the scan string has been scanned end, Report success (seemingly a bit of depth and breadth of meaning).
Therefore, in general the text-dominated DFA engine to be faster, the expression-dominated NFA must detect all the patterns, before the end of the arrival mode, do not know the success of the match, even if the previous expression matches successfully, perhaps after the continuation of the test, it might waste a lot of time. While the DFA is string-dominated, there are several possibilities to be recorded in a single place, and the characters in the target text will be detected at most once.
However, the order of the multi-select branches, the effect on the different target strings is also very large, exactly the same branch in front, of course, can be found faster.
Because the NFA is based on the expression, the different expression writing will have a great impact, but also allows us to control it more flexibly, but also has more variability. Here, for the NFA (originally PHP-based), there is an important feature: backtracking ---Encounter the need to choose from two possible successful matches, first select one, and remember the other to be possible later, this situation is mainly in the standard quantifier and multi-Select branch (| )。
A picture of theft:
Starting from the expression ('/'. * "!/'), first find the double quotation mark A, and then because the dot number matches any character (the default does not include the line break, here does not consider), plus the meta-character * means that can match more than one, because the standard quantifier matching priority mechanism, it came to the end of the string at Because between this * can be 0, one or more, that is to say, these forms are likely to match the success, so the engine will remember the two states, that is, in a position may match, or can not match it, as long as the * meta-character passing, from the M to the end will be recorded.
Until the end found no ", so backtracking, it needs to be explained that the engine is always back to the state of the most recent record (like stacks), a backward backwards until encountering a double quotation mark place (C), and then match the match to the double quotation mark (D), found not exclamation mark!, failure, backtracking again (status record not empty), To E and found a double quotation marks, the same as the situation, continue to match (F) found no exclamation point, failure, continue to backtrack to G, also because the back (H) is not an exclamation point, still need to backtrack, to I, then the state of the record is no longer, can not continue to backtrack, the first match failed, but not finished The drive of the engine continues from the next position of the double quotation mark A to continue looking for the first qualifying double quote, to J, and then the same process as the previous round continues to unfold. I didn't find the "..." in the end! Such a string, the process is very tortuous.
As can be seen from the above example: the first is. * This form of efficiency is very low, especially at the time of failure (of course, our few lines of code is almost negligible), and very prone to error, such as with the/". * *" match a pair of double-quoted string to match ab "CDE" FGH "" Ijk "LMN, The end result is "CDE" FGH "" Ijk ", the first double quotation mark and the end of the double quotation mark in the middle of the whole content; second, if there is a similar ((...) *)*、((...) +) * And so on, the outside of the record in the same time, the number of backtracking is exponential rise, and even the formation of a dead cycle, more time-consuming. Of course, for the improved state of the engine to detect this situation ahead of time and report errors, as if the browser to jump to their own page as the error.
Therefore, in the use of quantifiers, such as +, it can match one to multiple, greater than one, is not necessary, there are two options, can or do not, these two states may be used in the back of the state, called Standby state,?
For matches between double quotes and between the characters, the middle does not include the escaped double quotes, we can use the ignore precedence, '/'. /', ignoring the minimum of priority quantifier matching , * minimum is no, equivalent to detect empty string first, do not match a character immediately detect double quotes, this must be detected as long as there is a double quotation mark match is successful end, it matches the "McDonald" also saves a lot of backtracking.
Of course, there is also a way to detect other characters between two characters, such as '/' [^ '] '/', exclude-type character sets, which also meet the same requirements, but not always.
For example, match the characters between, using '/[^<\/b>]*<\/b>/'? Note that the character group can only match one character at a time, here is the match between and , not <,/, B, > any one character, character group can not put inside the character as a whole, for the whole, multiple character detection, you can choose to surround. Look around is related to the position, is born to limit the surrounding characters, one or more can. We need a negative look around here, because we need characters that can't be , for good-looking middle-spaced.
'/( (?! <\/b>). ) *<\/b>/' // negative surround- look
But the above can still match "ABCdef", so also to exclude between them, the mid-look detection ,/can have or not all line
'/ ((?! <\/? b >).) * <\/b>/'
The previous article wrote "Return", in the process of backtracking accompanied with the return, as above the '/'. * "!/", because it is found that the double quotation mark is not an exclamation point, but have to backtrack, at this time choose another alternate state---does not match the character just matched to, the fallback, is an obvious return . Example:
$pattern _1= '/(\w+) (\d?) /'; $pattern _2= '/(\w+) (\d+)/'; $pattern _3= '/(\w+) (\d*)/'; $subject= ' abc12345 '; Preg_match($pattern _1,$subject,$match _1); Preg_match($pattern _2,$subject,$match _2); Preg_match($pattern _3,$subject,$match _3); Echo' Match=>'; Var_dump ($match _1$match _2$match _3);
Using the capturing brackets, grouping, the engine remembers the matching text in two brackets. The results are as follows:
From top to bottom, $match_1, $match _2, $match _3, because the \w and \d have a public part, and two quantifiers are matched first, from the result, the previous + quantifier matches the most (key value of 1), pattern_1, \d? No match, Pattern_2, \d+ only match one, Pattern_3, \d* no match, and their processes are similar to, first let the \w+ match to the end, and then the engine to see the remaining mode, \d? Optional, it is not, because no regular match is also successful, not return. \d+ must match one, otherwise it will cause the match to fail, and here it will return one because it must obey to make the global match succeed . The same is true for \d*, which can be mismatched and not returned.
If the pattern here is '/(\w+) (\d\d)/', then it will have to return two digits, and if not returned or with a matching string, it can only report a failure. So there are two of rules:
1. First come first service principle, match the first quantifier in front, first try to satisfy oneself;
2. Must obey the results of the global match, there is the possibility of failure, the engine will force the matching of the priority quantifier to return , or the entire match fails.
What if we don't return it? will report the failure in advance. It must be talked about the possession of the priority quantifiers and curing groups, in the case of + \w++, (? >\w+)
$pattern = '/\w+:/'; $subject = ' Abcdefghijk ';
As an example, we can certainly see that the end of the string without a colon must fail, but the program does not, it will round and round the matching-backtracking, because it remembers some of the optional state, but now we have been clear that these states are useless, backtracking is a waste of effort, has failed before backtracking. So you can \w++: or (? >\w+):, with a dominant or cured group, these optional states will be discarded, \w+ always match to the end of the string, the word is gone, and then detect the colon does not exist, the colon does not exist, but now there is no fallback state, immediately report the failure, If it is hundreds of thousands of lines of text will save a lot of time.
It is important to note that the fixed-line grouping or the possession of priority to nested inside the quantifiers also have a role, and this?: Grouping does not capture different
(?> (\d+) + ) // The \d+ state of the inside will also be discarded (?: ABC (\D\D)) // The outer brackets are not captured, but the parentheses inside are not affected, and the \1 still records the count characters
Finally record a pit of the selected branch, example
$pattern = '/cat/'; $subject = ' indicate big cat ';
Cat will certainly match indicate in the middle of the cat as it is in front. Look at this again.
$pattern = ' Fat|cat|belly|your '; $subject = ' The dragging belly indicates that your cat is too fat ';
The NFA takes the expression as the dominant, looks at the string from the expression angle, therefore first detects is the fat, the result is fat? The result is still cat! So the NFA engine always prioritizes the selection branch to select the leftmost matching result, even if it is behind the selection branch.
This also shows that the NFA engine's regular, as long as there are possible multi-select branches in the expression, the regular engine will go back to the existence of a multi-select branch has not been attempted, the process is repeated until the global match is completed, if not, the previous example of the fat match success has been returned as a result. The multi-select state is not a match-first or ignores precedence, but the attempt is left-to-right. For DFA and some POSIX NFA, it is not the text that matches the left end of the string, but the longest branch in the branch.
Regular details too much, not clear, or to read, especially for regular tuning, and some commonly used judgment skills, such as matching "" surrounded by the string, the middle can have \ "and other escape sequences, or very troublesome, recommended" proficient regular expression ", Chinese translation is very good, read it is not blunt, And there are a lot of tricks, such as "Elimination of the cycle" is a big weapon. End
http://www.bkjia.com/PHPjc/1064939.html www.bkjia.com true http://www.bkjia.com/PHPjc/1064939.html techarticle Regular Learning (2)---simple matching principle,---matching write to the simple matching principle of understanding, or PHP-based. First, the regular engine can be divided into two major categories: DFA and NFA, anyway ...