Regular Expression matching analysis process (Regular Expression matching principle), regular expression matching
There have been many articles about regular expressions. As we use more and more regular expressions, we want to optimize performance and reduce the regular expression writing matching Bug. We have to learn more about the regular expression execution process. Next we will study and analyze the regular expression execution process. We will use the regexbuddy test tool to break down the execution process. For details about the specific tool, see: Recommendation of the regular expression Performance Testing Tool and recommendation of the optimization tool (recommended by regexbuddy ). To understand the regular expression parsing process, we should first familiarize ourselves with several concepts.
Common Regular Expression Engines
The engine determines the regular expression matching method and internal search process, and it is essential to understand it. Currently, the main popular engines are DFA and NFA.
Engine |
Differences |
DFA Deterministic finite automaton Deterministic Finite Automaton |
DFA engines do not require backtracking (and therefore they never test the same character twice), So matchFast! The DFA engine can also match the longest possible string. However, the DFA engine only contains a limited number of States, so it cannot match a pattern with a reverse reference.Subexpressions cannot be captured.. Representativeness:Awk, egrep, flex, lex, MySQL, Procmail |
NFA Non-deterministic finite automatic Non-deterministic finite automaton, which can be divided into traditional NFA, Posix NFA |
The traditional NFA engine runs the so-called "greedy" matching backtracking algorithm (longest-leftmost) to test all possible extensions of the regular expression in a specified order and accept the first matching item. The traditional NFA tracing can be accessed multiple times in the same State. In the worst case, its execution speed may beVery slowBut itSupports sub-matching. Representativeness:GNU Emacs, Java, ergp, less, more,. NET language, PCRE library, Perl, PHP, Python, Ruby, sed, vi, etc,This mode is generally used in advanced languages. |
DFA is a string character that matches and finds the regular expression one by one, while NFA is mainly a regular expression and searches for the string one by one. Although the speed is slow, it is simpler for the operator, so it is more widely used! The NFA engine is used as an example to illustrate the parsing process!
String composition in the parsing engine eye
For the string "DEF", it contains three characters: D, E, and F, and the positions 0, 1, 2, and 3: 0D1E2F3. For the regular expression, all source strings, both have characters and positions. The regular expression is matched one by one from the position 0.
Character occupation and Zero Width
During regular expression matching, if the Sub-expression matches the character content rather than the position and is saved to the final matching result, the subexpression occupies a character. If the subexpression matches only the position, or the matched content is not saved to the final matching result, this subexpression is considered to be zero-width. The possession characters are mutually exclusive, and the zero width is non-mutex. That is, a character can only be matched by one subexpression at a time, while a position can be matched by multiple zero-width subexpressions at the same time. Common zero-width characters include: ^ ,(? =)
Detailed examples of Regular Expression matching process
We have mastered the above concepts. Next we will analyze several common parsing processes. Use the software regexBuddy for analysis.
Demo1: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/DEF/
The process can be understood as: first, the regular expression character/D/gets control, starts from the position 0, and matches "D" by/D/. The match is successful, control is given to the character/E/; because "D" has been/D/matched, so/E/tries to match from position 1, And/E/matches "E ", the match is successful, and the control is handed over to/F/./F/matches "F". The match is successful.
Demo2: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/D \ w + F/
The process can be understood as: first, the regular expression character/D/gets control, starts from the position 0, and matches "D" by/D/. The match is successful, control is handed over to the character/\ w +/. Since "D" has been/D/matched,/\ w +/tries to match from position 1, \ w + greedy mode, an alternative status is recorded. By default, the longest character is matched. The longest character is matched to EF and the matching is successful. The current position is 3. And the control is handed over to/F/; from/F/the matching fails. \ w + matches back one bit, and the current position is changed to 2. And the control is handed over to/F/, and the/F/matches the character F successfully. Therefore, \ w + matches the E character, and the matching is complete!
Demo3: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/^ (? = D) [D-F] + $/
The process can be understood as: The metacharacters/^/AND/$/match only the positions, and the sequential view /(? = D)/(match the current position, whether the character "D" appears on the right) only matches, does not occupy characters, and does not save the matched content to the final matching result, therefore, they are all zero-width. First, the metacharacters/^/get control. The match starts from the position 0./^/matches the start position 0. The match is successful, and the control is handed over to the sequential view /(? = D )/;/(? = D])/requires that it be located at the right side of the letter "D" to be matched successfully. The child expressions with zero width are not mutually exclusive, that is, the same position can be matched by multiple zero-width subexpressions at the same time, so it also tries to match from position 0, and the right side of location 0 is the character "D", which meets the requirements, match successful, control to/[D-F] +/; because /(? = D)/only matches, does not save the matched content to the final result, And /(? = D)/the position where the match is successful is position 0, so/[D-F] +/also tries to match from position 0, /[D-F] +/first try to match "D", match successful, continue to try to match until matching "EF", then it has matched to position 3, there are no characters on the Right of Location 3. In this case, the control will be handed over to/$/, metacharacters/$/and try to match from location 3. It matches the Ending position, that is, "Location 3", matched successfully. At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "DEF", the start position is 0, and the end position is 3. Where/^/matches 0 ,/(? = D)/match position 0,/[D-F] +/match string "DEF",/$/match position 3.
Note: In the above examples, we have analyzed Regular Expression regular matching, and the Backtracking process, followed by the zero-width character, matching process. Of course, the example given is relatively simple, and a longer and more complex regular expression will be encountered in the actual process. However, the idea is similar. As long as we break down my parsing principles, we can break them down one by one. Now, let's talk about it. Thank you!