NFA Engine Matching Principle Based on Regular Expressions

Source: Internet
Author: User
Tags expression engine

Original article address:Http://sysenter.blog.163.com/blog/static/1408453882011110115434795/

If you do not understand the principles of the Regular Expression Engine, you can also write regular expressions that meet your needs. However, if you do not know the principles, it is difficult to write highly efficient regular expressions without any potential risks. Therefore, it is necessary for people who often use regular expressions or are interested in learning Regular Expressions in depth to understand the matching principles of the regular engine.

1. Why should we understand the Engine Matching Principle?

Note-by-note disorganized combination, the playing may be noise. The same note can be produced through the hands of the composer, A player can also play beautiful music according to music scores, but he/she may not know how to change the combination of music notes to make music more beautiful.

The same applies to regular expressions. If you do not understand the Regular Expression Engine Principle, you can also write regular expressions that meet your needs. However, if you do not know the principle, it is difficult to write highly efficient regular expressions without any hidden risks. Therefore, it is necessary for people who often use regular expressions or are interested in learning Regular Expressions in depth to understand the matching principles of the regular engine.

2. Regular Expression Engine

The RegEx engine can be divided into two types: DFA and NFA, while NFA can be divided into traditional NFA and posix nfa.

DFA Deterministic Finite Automation

NFA non-deterministic finite automaton

Traditional NFA

POSIX NFA

Because the DFA engine does not require backtracking, the matching is fast, but the capture group is not supported, so reverse reference and $ number reference are not supported, currently, the DFA engine is mainly used in awk, egrep, and Lex languages and tools.

Posix nfa mainly refers to the POSIX-compliant NFA engine. It provides longest-leftmost matching, that is, it will continue backtracking before finding the leftmost longest match. Like DFA, non-Greedy mode or ignoring the priority quantifiers are also meaningless for posix nfa.

Most languages and tools use the traditional NFA engine, which has some features not supported by DFA:

Capture group, reverse reference, and $ number reference;

Lookaround ,(? <= ...) ,(? <!...) ,(? = ...) ,(?!...)), Or some articles are called pre-search;

Ignore the optimized quantifiers (?? ,*? , +? , {M, n }? , {M ,}?), Or some articles are called non-Greedy models;

Take precedence quantifiers (? +, * +, ++, {M, n} +, {M,} +, currently only supported by Java and PCRE), and solidified grouping (?> ...).

The difference between engines is not the focus of this article. I will only give a brief introduction. If you are interested, refer to the relevant literature.

3 Preparation knowledge 3.1 string Composition

The string "ABC" contains three characters and four locations.

3.2 characters in length and Zero Width

During regular expression matching, if the Sub-expression matches the character content rather than the position and is saved to the final matching result, the subexpression occupies a character. If the subexpression matches only the position, or the matched content is not saved to the final matching result, this subexpression is considered to be zero-width.

The possession characters are mutually exclusive, and the zero width is non-mutex. That is, a character can only be matched by one subexpression at a time, while a position can be matched by multiple zero-width subexpressions at the same time.

3.3 control and transmission

The regular expression matching process usually consists of a subexpression (which may be a common character, metacharacters, or metacharacters) to gain control, match is attempted from a certain position in the string. a subexpression starts to try the matching position, starting from the ending position where the match was successful in the previous subexpression. Such as regular expression:

(Subexpression 1) (subexpression 2)

Assume that the (subexpression 1) is a zero-width expression, because it matches the start and end positions in the same way, such as the position 0, then (subexpression 2) match is attempted starting from position 0.

Assume that (subexpression 1) is a character-occupying expression, because it matches the start and end positions not the same. For example, if the match succeeds, it starts at the position 0 and ends at the position 2, then (subexpression 2) tries to match from position 2.

For the entire expression, the matching is usually attempted starting from the string position 0. If an attempt starting from position 0 fails to match the entire expression at a position of the string, the engine will make the regular expression drive forward, and the entire expression will try again from position 1, wait until the report matches successfully or tries to reach the last position to report a failure.

4. Regular Expressions: simple match. 4.1 basic matching process

Source string: ABC

Regular Expression: ABC

Matching Process:

First, the control is obtained by the character "A", and the match starts from the position 0. Then, the "A" character matches "A", the match succeeds, and the control is handed over to the character "B "; because "A" has been matched by "A", "B" tries to match from position 1, and "B" matches "B". The match is successful, control is handed over to "C"; "C" matches "C", and the match is successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "ABC", the start position is 0, and the end position is 3.

4.2 matching process with matching priority quantifiers-successful matching (1)

Source string: ABC

Regular Expression: AB? C

Quantizer "?" It is a matching priority quantizer. When the matching is not matching, the system will first select to try to match. Only when this option makes the entire expression unable to match successfully will the Matching content be tried. Here, the quantizer "?" Is used to modify the character "B", so "B ?" Is a whole.

Matching Process:

First, the character "A" gets control, starts from the position 0, and matches "A". The match succeeds, and the control is handed over to the character "B ?"; Because "?" Is the matching priority, so we will first try to match, from "B ?" To match "B". The match is successful. The control is handed over to "C", and an alternative state is recorded. "C" is used to match "c" and the match is successful. The alternate status of the record is discarded.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "ABC", the start position is 0, and the end position is 3.

4.3 matching process with matching priority quantifiers-successful matching (2)

Source string: AC

Regular Expression: AB? C

Matching Process:

First, the character "A" gets control, starts from the position 0, and matches "A". The match succeeds, and the control is handed over to the character "B ?"; First try to match, by "B ?" To match "C", and record an alternative status. If the matching fails, backtrack and find the alternative status, "B ?" Ignore matching, give control, and give control to "C". "C" matches "c" and the match is successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "AC", the start position is 0, and the end position is 2. Where "B ?" Does not match any content.

4.4 matching process with matching priority quantifiers -- matching failed

Source string: Abd

Regular Expression: AB? C

Matching Process:

First, the character "A" gets control, starts from the position 0, and matches "A". The match succeeds, and the control is handed over to the character "B ?"; First try to match, by "B ?" To match "B", record an alternative state, match successfully, and control to "C"; "c" to match "D". If the match fails, perform backtracking, find the alternate status of the record, "B?" Ignore matching, that is, "B ?" If "B" is not matched, the control is granted and the control is handed over to "C". If "C" matches "B", the matching fails. In this case, the first round of matching attempts fails.

The regular engine enables the regular expression to drive forward, starting from position 1 to try matching, and "A" to match "B". The matching failed, no alternative status, and the second round of matching failed.

Continue to drive forward until the attempt to match in position 3 fails and the match ends. In this case, it is reported that the entire expression fails to match.

4.5 matching process with ignored quantifiers -- matching successful

Source string: ABC

Regular Expression: AB ?? C

Quantizer "?" It is a type of ignore priority quantifiers. When the matching is not successful, the system selects mismatch first. Only when this type of choice makes the entire expression fail to match. Here, the quantizer "?" Is used to modify the character "B", so "B ??" Is a whole.

Matching Process:

First, the character "A" gets control, starts from the position 0, and matches "A". The match succeeds. The control is handed over to the character "B ??"; Ignore the match, that is, "B ??" If no matching is performed, an alternative state is recorded, and the control is handed over to "C". If "C" matches "B", the matching fails. In this case, backtrack and find the alternative state of the record, "B ??" Try to match, that is, "B ??" To match "B". If the match succeeds, the control is handed over to "c". If "C" matches "C", the match succeeds.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "ABC", the start position is 0, and the end position is 3. "B ??" Match the character "B ".

4.6 zero-width matching process

Source string: A12

Regular Expression: ^ (? = [A-Z]) [a-z0-9] + $

The metacharacters "^" and "$" match only the positions, and the order is centered around "(? = [A-Z]) "only matches, does not occupy characters, and does not save the Matching content to the final matching result, so they are all zero-width.

This regular expression matches letters and numbers. The first character is a letter string.

Matching Process:

First, the metacharacter "^" gets control, starts from the position 0 and matches, and "^" matches the start position "0". The match is successful, and the control is handed over to the sequential View "(? = [A-Z]) ";

"(? = [A-Z]) "requires that it be located on the right side of a letter to be matched successfully. Child expressions with zero width are not mutually exclusive, that is, the same position can be matched by multiple zero-width subexpressions at the same time, so it also tries to match from position 0, and the right side of location 0 is the character "A", which meets the requirements, match successful, control to "[a-z0-9] + ";

Because "(? = [A-Z]) "only matches, does not save the matched content to the final result, and" (? = [A-Z]) "The position that matches successfully is position 0, so" [a-z0-9] + "also tries to match from position 0, "[a-z0-9] +" first try to match "A", match successful, continue to try to match, you can successfully match the next "1" and "2", at this time has been matched to the position 3, there are no characters on the Right of Location 3, and the control will be handed over to "$ ";

The metacharacter "$" starts from position 3 and matches the Ending position, that is, "position 3". The match is successful.

At this time, the regular expression matching is complete, and the report matching is successful. The matching result is "A12", the start position is 0, and the end position is 3. "^" Matches 0 and "(? = [A-Z]) "match position 0," [a-z0-9] + "match string" A12 "," $ "match position 3.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.