Regular Expression Learning (2) --- simple matching principle, --- matching. Regular Expression Learning (2) --- simple matching principle, --- understanding of simple matching principles in matching and writing, and focusing on php. First, the regex engine can be divided into two main categories: DFA and NFA, anyway, regular expression learning (2) --- simple matching principle, --- matching
The understanding of simple matching principles is based on php.
First, the regex engine can be divided into two main categories: DFA and NFA. it is not surprising that the engine is too many. a simple understanding is different matching methods, just like when you look for data in an array, some are sequential queries from the beginning, and some are searches from the middle, in different ways. NFA has a longer history. NFA has more tools or languages, but there are also two engines used in combination. The example in a book is very apt: NFA is like a gasoline engine, and DFA is like a motor. they can make cars run but use different mechanisms. Since NFA and DFA have been developing for many years, a POSIX standard has been introduced, which specifies the metacharacters and features that should be supported as a regular engine, and the accurate results the end user wants.
NFA: matches based on the (regular) expression, and DFA matches based on the string to be matched. Php uses the traditional NFA engine. of course, Perl is also used. No matter which engine, there are two general principles: 1. matching the result with the leftmost priority; 2. standard quantifiers (+ ,? , *, {M, n}) are all matched first.
The existing expression is as follows. to match the string 'Tonight'
'/to(nite|knite|night)/'
NFA:Dominated by expressions, Starting from the first part of the expression, and checking whether the current text matches. If yes, continue to the next part of the expression until all expressions can match, and the entire match is successful. When the first character of the expression is t, it will be searched from left to right in order in the string until a t character is found. if it cannot be found, it will fail. if it is found, when the expression is the next character, o continues to search for the string to be matched. both of the first two can be matched. enter a selection branch grouped by brackets to match nite, knite, or night, it tries these three possibilities in sequence. The first branch fails to try nig, the second branch fails to try the first n in the expression, and the third branch exactly matches. The engine dominated by regular expressions must check the expression to reach the final conclusion.
DFA: Unlike NFA, DFA records all matching possibilities currently valid when scanning strings. from the very beginning, it adds a possibility to the current matching possibilities, if n exists, it will record two possibilities when scanning n in the string all the time. n at the nite and night (it is an expression from the string to be matched ), continue scanning to I or nite or night, and then to g only has night. when h and t match is complete, the engine finds that the scan string has been scanned, report success (it seems a bit deep and wide ).
Therefore, the text-dominated DFA engine is generally faster. the expression-dominated NFA must detect all the modes and do not know whether the match is successful before the arrival mode ends, even if a previous expression matches successfully, it may take a lot of time to detect it later. DFA is dominated by strings and records at most several possibilities at a place. the characters in the target text can only be detected once.
However, the order of multiple branch selection has a great impact on different target strings. the branch that is exactly the same can be found quickly.
Because NFA is dominated by expressions, the differences in expression writing will have a great impact, so that we can control it more flexibly and have more variability. Among them, NFA (originally based on php) has an important feature:Backtracking--- Select one type from two possible matches, and remember the other one for later possible needs, this situation mainly occurs in the standard quantifiers and multi-choice branch (| ).
Stolen images:
Slave expression ('/".*"! /') First, locate the double quotation mark (A), and then add the metacharacters * to indicate that multiple characters can be matched because the DOT number matches any character (the line break is not included by default, because of the priority mechanism of the standard quantifiers, it comes to B at the end of the string, because * can be 0, 1, or more, that is, these two forms may be matched successfully. Therefore, the engine will remember these two states, that is, they may or may not match at one location, as long as it is the place where * metacharacters pass by, the records will be recorded from M to the end.
When no "is found at the end, the engine always goes back to the state of the latest record (similar to stack) and goes back one by one until a double quotation mark (C) is encountered ), then match the double quotation marks (D). It is not an exclamation point !, Failed, re-trace (status record is not empty), and a double quotation mark is found at E, which is the same as the situation just now. (F) no exclamation point found. failed, continue to trace back to G. Similarly, because (H) is not an exclamation point, you still need to trace back to I. At this time, the record status is no longer available and you cannot continue to trace back, the first round of matching failed, but it was not complete. the engine drive continued to search for the first qualified double quotation marks from the next position of double quotation marks at location A, to J, then, similar to the previous round of process. "..." Was not found in the end "..."! The process of such a string is tortuous.
From the example above, we can see that: 1. * the efficiency of this form is very low, especially in the case of failure (of course, we usually ignore the few lines of code), and it is easy to make mistakes, such as using /". */"match a string enclosed by a pair of double quotes to match AB" cde "fgh" "ijk" lmn. The final result is "cde" fgh "" ijk ", the content in the middle of the initial double quotation marks and the ending double quotation marks; the second is that if there is a similar ((...) *)*,((...) +) * or the like, which is not fixed at the same time,The number of backtracking times has increased exponentially.And even forms an endless loop, which is more time-consuming. Of course, the engine that improves the status checks this situation in advance and reports errors, as if the browser is jumping on its own