Regular Expression matching analysis process (Regular Expression matching principle), regular expression matching

Last Update:2015-10-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There have been many articles about regular expressions. As we use more and more regular expressions, we want to optimize performance and reduce the regular expression writing matching Bug. We have to learn more about the regular expression execution process. Next we will study and analyze the regular expression execution process. We will use the regexbuddy test tool to break down the execution process. For details about the specific tool, see: Recommendation of the regular expression Performance Testing Tool and recommendation of the optimization tool (recommended by regexbuddy ). To understand the regular expression parsing process, we should first familiarize ourselves with several concepts.

Common Regular Expression Engines
The engine determines the regular expression matching method and internal search process, and it is essential to understand it. Currently, the main popular engines are DFA and NFA.

Engine	Differences
DFA Deterministic finite automaton Deterministic Finite Automaton	DFA engines do not require backtracking (and therefore they never test the same character twice), So matchFast! The DFA engine can also match the longest possible string. However, the DFA engine only contains a limited number of States, so it cannot match a pattern with a reverse reference.Subexpressions cannot be captured.. Representativeness:Awk, egrep, flex, lex, MySQL, Procmail
NFA Non-deterministic finite automatic Non-deterministic finite automaton, which can be divided into traditional NFA, Posix NFA	The traditional NFA engine runs the so-called "greedy" matching backtracking algorithm (longest-leftmost) to test all possible extensions of the regular expression in a specified order and accept the first matching item. The traditional NFA tracing can be accessed multiple times in the same State. In the worst case, its execution speed may beVery slowBut itSupports sub-matching. Representativeness:GNU Emacs, Java, ergp, less, more,. NET language, PCRE library, Perl, PHP, Python, Ruby, sed, vi, etc,This mode is generally used in advanced languages.

Engine

Differences

DFA
Deterministic finite automaton
Deterministic Finite Automaton

DFA engines do not require backtracking (and therefore they never test the same character twice), So matchFast! The DFA engine can also match the longest possible string. However, the DFA engine only contains a limited number of States, so it cannot match a pattern with a reverse reference.Subexpressions cannot be captured.. Representativeness:Awk, egrep, flex, lex, MySQL, Procmail

NFA
Non-deterministic finite automatic Non-deterministic finite automaton, which can be divided into traditional NFA, Posix NFA

The traditional NFA engine runs the so-called "greedy" matching backtracking algorithm (longest-leftmost) to test all possible extensions of the regular expression in a specified order and accept the first matching item. The traditional NFA tracing can be accessed multiple times in the same State. In the worst case, its execution speed may beVery slowBut itSupports sub-matching. Representativeness:GNU Emacs, Java, ergp, less, more,. NET language,
PCRE library, Perl, PHP, Python, Ruby, sed, vi, etc,This mode is generally used in advanced languages.

DFA is a string character that matches and finds the regular expression one by one, while NFA is mainly a regular expression and searches for the string one by one. Although the speed is slow, it is simpler for the operator, so it is more widely used! The NFA engine is used as an example to illustrate the parsing process!

String composition in the parsing engine eye
For the string "DEF", it contains three characters: D, E, and F, and the positions 0, 1, 2, and 3: 0D1E2F3. For the regular expression, all source strings, both have characters and positions. The regular expression is matched one by one from the position 0.

Character occupation and Zero Width
During regular expression matching, if the Sub-expression matches the character content rather than the position and is saved to the final matching result, the subexpression occupies a character. If the subexpression matches only the position, or the matched content is not saved to the final matching result, this subexpression is considered to be zero-width. The possession characters are mutually exclusive, and the zero width is non-mutex. That is, a character can only be matched by one subexpression at a time, while a position can be matched by multiple zero-width subexpressions at the same time. Common zero-width characters include: ^ ,(? =)

Detailed examples of Regular Expression matching process
We have mastered the above concepts. Next we will analyze several common parsing processes. Use the software regexBuddy for analysis.

Demo1: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/DEF/

The process can be understood as: first, the regular expression character/D/gets control, starts from the position 0, and matches "D" by/D/. The match is successful, control is given to the character/E/; because "D" has been/D/matched, so/E/tries to match from position 1, And/E/matches "E ", the match is successful, and the control is handed over to/F/./F/matches "F". The match is successful.

Demo2: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/D \ w + F/

The process can be understood as: first, the regular expression character/D/gets control, starts from the position 0, and matches "D" by/D/. The match is successful, control is handed over to the character/\ w +/. Since "D" has been/D/matched,/\ w +/tries to match from position 1, \ w + greedy mode, an alternative status is recorded. By default, the longest character is matched. The longest character is matched to EF and the matching is successful. The current position is 3. And the control is handed over to/F/; from/F/the matching fails. \ w + matches back one bit, and the current position is changed to 2. And the control is handed over to/F/, and the/F/matches the character F successfully. Therefore, \ w + matches the E character, and the matching is complete!

Demo3: Source character DEF, corresponding Tag: 0D1E2F3, matching Regular Expression:/^ (? = D) [D-F] + $/

The process can be understood as: The metacharacters/^/AND/$/match only the positions, and the sequential view /(? = D)/(match the current position, whether the character "D" appears on the right) only matches, does not occupy characters, and does not save the matched content to the final matching result, therefore, they are all zero-width. First, the metacharacters/^/get control. The match starts from the position 0./^/matches the start position 0. The match is successful, and the control is handed over to the sequential view /(? = D )/;/(? = D])/requires that it be located at the right side of the letter "D" to be matched successfully. The child expressions with zero width are not mutually exclusive, that is, the same position can be matched by multiple zero-width subexpressions at the same time, so it also tries to match from position 0, and the right side of location 0 is the character "D", which meets the requirements, match successful, control to/[D-F] +/; because /(? = D)/only matches, does not save the matched content to the final result, And /(? = D)/the position where the match is successful is position 0, so/[D-F] +/also tries to match from position 0, /[D-F] +/first try to match "D", match successful, continue to try to match until matching "EF", then it has matched to position 3, there are no characters on the Right of Location 3. In this case, the control will be handed over to/$/, metacharacters/$/and try to match from location 3. It matches the Ending position, that is, "Location 3", matched successfully. At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "DEF", the start position is 0, and the end position is 3. Where/^/matches 0 ,/(? = D)/match position 0,/[D-F] +/match string "DEF",/$/match position 3.

Note: In the above examples, we have analyzed Regular Expression regular matching, and the Backtracking process, followed by the zero-width character, matching process. Of course, the example given is relatively simple, and a longer and more complex regular expression will be encountered in the actual process. However, the idea is similar. As long as we break down my parsing principles, we can break them down one by one. Now, let's talk about it. Thank you!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regular Expression matching analysis process (Regular Expression matching principle), regular expression matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Regular Expression matching analysis process (Regular Expression matching principle), regular expression matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support