Analysis of regular expression matching parsing process (regular expression matching principle) _ Regular expression

Source: Internet
Author: User
Tags regular expression expression engine

There have been several articles about regular expressions, and as we use regular expressions more and more, we want to optimize performance and reduce our regular expression-writing matching bugs. We have to delve further into the regular expression execution process. Let's study together to analyze the execution process of the regular expression below. We will use Regexbuddy test tools to decompose the implementation process, specific tools to use, you can see: Regular Expression Performance test tool recommendations, optimization tools recommended (Regexbuddy recommended). To understand the regular expression parsing process, we first familiarize ourselves with several concepts.

Common regular expression engine
The engine determines the regular expression matching method and the internal search process, and it is important to understand it. At present, the main popular engines are: DFA,NFA two kinds of engines, we are relatively differentiated.

engine difference point
DFA
deterministic finite automaton
determined to have poor automata
DFA engine they do not require backtracking (and therefore they never test the same characters two times), so the match is fast ! The DFA engine can also match the longest possible string. However, the DFA engine contains only a limited state, so it cannot match a pattern with a reverse reference, and no subexpression can be captured . Representative: awk,egrep,flex,lex,mysql,procmail
NFA
non-deterministic Finite automaton is a finite automaton, and is divided into traditional nfa,posix NFA
very slow , but it supports child matching . Representative has: GNU emacs,java,ergp,less,more,. NET language,
PCRE library,perl,php,python,ruby,sed,vi, etc.,
this pattern is used in general high-level languages.

The DFA matches the lookup with a string character, one by one, in a regular expression, and the NFA looks in the string as a regular expression. Although the speed is slow, but is simpler for the operator, therefore the application is more extensive! Below all with NFA engine illustration, parsing process!

Parse the string composition in the eye of the engine
For string "DEF", includes D, E, F three characters and 0, 1, 2, 34 digit positions: 0D1E2F3, all source strings have characters and positions for regular expressions. The regular expression is matched from position No. 0, one by one.

Possessive character and 0 width
During regular expression matching, if the subexpression matches the character content, instead of the position and being saved to the final match, it is assumed that the subexpression is possessive, and if the subexpression matches only the position, or the matching content is not saved to the final match result, Then consider this subexpression to be 0-width. Possessive characters are mutually exclusive and 0 widths are mutually exclusive. is a character that can only be matched by one subexpression at a time, while a position can be matched by multiple 0-width subexpression. Common 0 Wide characters are: ^, (? =), etc.

A detailed example of regular expression matching process
We have mastered several of the above concepts, and then we analyze the next few common parsing processes. Combined with software Regexbuddy to analyze.

Demo1: source character DEF, corresponding tag is: 0d1e2f3, matching regular expression is:/def/

The process can be understood to be: first, by the regular expression character/d/to gain control, starting from position 0 match, by/d/to match "D", the match succeeds, the control to the character/e/, because "D" has been/d/match, so/e/start from position 1 to try to match, by/e/to match "E", match the success, control Right to/f/;/f/to Match "F", match successfully.

Demo2: source character def, corresponding tag is: 0d1e2f3, matching regular expression is:/d\w+f/

The process can be understood as: first, the regular expression character/d/to obtain control, starting from position 0 match, by/d/to match "D", matching success, control to the character/\w+/; since "D" has been matched by/d/,/\w+/tries to match from position 1, \w+ greedy mode, An alternate state is logged, and the default matches the longest character, directly to the EF, and the match is successful and the current position is 3. and give control to/f/; the/f/match fails, and the \w+ match goes back one bit, and the current position becomes 2. And the control right to the/f/, by/f/match character F success. So \w+ here matches the E character, the match is done!

Demo3: source character def, corresponding tag is: 0d1e2f3, matching regular expression is:/^ (? =d) [d-f]+$/

The process can be understood as: Metacharacters/^/and/$/match just the position, in order to look around/(? =d/(matches the current position, the right has the character "D" character appears) only to match, does not occupy the character, also does not save the match content to the final match result, therefore all is 0 width. First from the/^/to gain control, starting from position 0 match,/^/matching is the start position "position 0", matching success, control of the order to look around/(? =d)/;/(? =d])/requires that the right side of the location must be the letter "D" to match the success, The 0-width subexpression is not mutually exclusive, that is, the same position can be at the same time by multiple 0-width subexpression matching, so it is also from position 0 to try to match, position 0 on the right side is the character "D", meet the requirements, matching success, control to/[d-f]+/; =d)/Match only, Does not save the matching content to the final result, and/(? =d)/Match the successful location is position 0, so/[d-f]+/is also starting from position 0 to try to match,/[d-f]+/first attempt to match "D", match successfully, continue to try to match until the match "EF", At this time has been matched to position 3, position 3 right has no characters, then the control to/$/,/$/from position 3 to try to match, it matches the end position, that is, "position 3", match successfully. At this point the regular expression match completes and the report matches successfully. The result is "DEF," where the start position is 0, and the end position is 3. where/^/match position 0,/(? =d)/Match position 0,/[d-f]+/match string "DEF",/$/match position 3.

PostScript: These examples above, we analyzed regular expression normal matching, and backtracking process, then 0 width characters, matching process. Of course, the examples given are simpler, and the actual process will encounter longer, more complex regular expressions. However, the idea is similar. As long as we put my analytic principles, can be decomposed one by one. All right, here we are, welcome to the exchange!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.