Getting started with regular expressions (java)-matching principles-1-engine classification and universality principles

Source: Internet
Author: User
Tags egrep

Engine type programs DFA awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail traditional nfa gnu Emacs, Java, grep (most versions), less, more ,. NET Language, PCRE library, Perl, PHP (all three regular library), Python, Ruby, sed (most versions) viPOSIX NFA mawk, Mortice Kern Systems 'utilities, GNU Emacs (timed) DFA/NFA hybrid GNU awk, GNU grep/egrep, and Tcl NFA): regular expression: "to (nite | knight | night)" target text: "tonight" regular expression starts from "t, each time you check a part (the engine checks part of the expression), it also checks whether the current text matches the current part of the expression. If yes, continue to the next part of the expression until all parts of the expression can match. In this example, the first element is "t", which will repeat, locate "t" in the target string, and then check "o". The process is consistent with this. Next, in the "(nite | knight | night)" section, the expression will be tried once until the matching is declared successful or failed. The control in the expression is converted between different elements, so the author calls it "expression-dominated". Therefore, the regular expression is "nfa | nfa not", and the target string is "nfa not, it only matches "nfa", but not completely. DFA: The text-dominated DFA engine records all matching possibilities of "currently valid" when scanning strings. In the original example, when the engine moves to "t", it will add a potential possibility to the current processing matching, and then scan each character, and update the current possible matching sequence. The situation after scanning two characters is shown in. The branch "knight" is excluded. The author in the book said that the question text is dominant because the engine type is controlled when scanning each character. 1. If priority quantifiers are supported, they are basically traditional NFA. DFA does not support ignoring priority quantifiers, Which is meaningless in posix nfa. 2. DFA does not support capturing parentheses and backtracking. In the two mixed engines, if no capturing parentheses are used, DFAps will be used: In RegexBuddy, it seems that only the traditional NFA, at least this is the result of 1 verification. Therefore, DFA and hybrid engines are not verified here. This article also focuses on java, so here we will focus on the two universal principles related to java (from "proficient in regular expressions" v3): 1. note: This principle does not specify the length of the priority matching result, but only specifies that among all possible matching results, select the leftmost (possibly ). The author explains this principle: matching first tries to match from the starting position of the string to be searched. Here, "try to match" means that you can test the matching performance of the entire regular expression at the current position. If no matching result is found after testing all the possible characters at the current position, you need to start from the second character before the second character and try again ...... A failure is reported only when no matching result is found at the starting position (until the last character of the string. The following is an example: the target string "This is a cat. "I want to match the character" is ". My regular expression is" is "and the result is as follows (figure 1): Here two results are found. According to principle 1, "is" in "this", but the word "is" is not found. This is easy to understand. Next let's take a look at the debug process in RegexBuddy. How can there be so many characters here? The target string actually only has 13 characters. Where are the extra characters? In my opinion, RegexBuddy counts the position between characters as a character. Let's take a look at Figure 1. The reason why I put every character in a table is to let everyone see clearly. Here, every vertical line (which does not actually exist) is also used as a character. I think this makes sense. For example, if there is a zero-width assertion, its matching is at the position of a vertical line. Let's test it with "^" to see the debug results. When the regular expression starts with "string start position anchor", the engine will know that if the regular expression can be matched, it must start with the string, so more attempts are not required. Ps: this seems to be in conflict with the debug result above RegexBuddy. This is indeed the case. I don't know if it is a bug. At least v3.5.4 is a RegexBuddy that does not support character group set operations for the moment. I wonder whether this computing function is missing or a bug 2. standard quantifiers (*, + ,?, {M, n}) is the matching priority target string: "copyright 2003" regular: ". * ", then the matching result is all characters regular:". * [0-9] * ". At this time, because the quantifiers are matched first,". * "will match the entire string, and the" [0-9] * "after it cannot match anything. This does not affect the final result, because" * "indicates that 0 can also be used, we can add a set of parentheses to verify this result. As shown in the following figure, we will change the regular expression to the following: ". * [0-9] + ", then". * "will first match all the strings, followed by" [0-9] + ", and find that it must match at least one number, so it will force". * "spit out the matched content for your own use, when". * After "3" is spit out, "[0-9] +" is matched successfully. At this point, the matching is complete and no other attempts are made. Let's take a look at this process in debug:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.