Pumpkin does not speak (M01)-the principle of regular expression

Source: Internet
Author: User

        1. Grammar
                1. A grammar can be defined with a four-dollar, G = {Vt,vn,s,p}

                Vt: A non-empty, finite set of symbols, each of which is called a Terminator;

    1. Vn: A non-empty, finite set of symbols, each of which is called a non-terminating symbol, and vt∩vn=φ;
    2. S∈VN, called the starting sign of the grammar g;
    3. P is a non-empty finite set, and its element is called the generating formula;
    4. The production means that the form of α→β,α is called the left part of the production, β is called the right part of the production, the symbol "→" means "defined as", and α, β∈ (VT∪VN) *,α≠ε, that is, α, β is a symbol string consisting of terminator and non-Terminator;
    5. The start character s must appear at least one time on the left side of a given production type;

                Grammar can be deduced from a language labeled L (G);

                The grammar is divided into 0, 1, 2, and 3 types, depending on the restrictions imposed on the production pattern.

                      1. The 0-type grammar requires at least one non-terminator, with little or no restriction, and a very important theoretical result: the ability of type 0 grammar is equivalent to Turing
                  1. The Type 1 grammar is also called context-sensitive grammar, which corresponds to a linear bounded automaton, which requires each production α→β to have a |β|>=|α|,|β|-length;
                  2. Type 2 grammar is also called context-independent grammar, which corresponds to the push-down automata, which is required on the basis of the 1-type grammar, and satisfies: every α→β has α is non-terminator;
                  3. The Type 3 grammar is also called the regular grammar, which corresponds to the finite state automata. It is based on the Type 2 grammar: a→α|αb (right linear) or a→α| Bα (left linear).
                  4. Type 1 grammar is a subset of type 0 grammar, type 2 grammar is a subset of the 1 type grammar, and the 3 type grammar is a subset of type 2 grammar.

                Type 3 grammar (regular grammar) is equivalent to regular expressions (Regular expression), and any regular grammar can always be converted into an equivalent regular expression. At the same time, the regular expression is equivalent to the finite automata.

                A language that can be recognized by a finite automaton must be represented by a regular expression, whereas a language expressed in regular expressions can be identified with a finite automaton.

                Finite state automata

                Finite automata are divided into the most common: deterministic finite automata (DFA) and non-deterministic finite automata (NFA) two kinds;

                The grammatical description of the DFA is: G = {s,ε, f,s0,z} The NFA's grammatical description is: M = {s,ε, f,s0,z}
                S: A non-empty limited set of input symbols; S: A non-empty limited set of input symbols;
                ε: The set of input symbols that make the state change; ε: The set of input symbols that make the state change;
                F: mapping; F: mapping;
                S0: initial state; S0: initial state set;
                Z: Terminating state; Z: Terminating state;

                F (s,a) =g description is: The initial S state and then enter a under the conditions of the conversion to the G State:

                Each NFA can be converted into a DFA

                The efficiency difference between DFA and NFA

                It is easy to understand that the cost of constructing the DFA is much greater than the NFA, assuming that the NFA has a state number of K, then the number of States of the equivalent DFA can theoretically reach 2 k-th square, but in fact almost no such extreme situation,

                To be sure, constructing a DFA consumes more time and memory.

                However, once the DFA has been constructed, the execution efficiency is very good, if the length of a string is n, then the execution complexity of the matching algorithm is O (n), and the NFA in the matching process, there are a large number of branches and back,

                Assuming the number of States of the NFA is s, because each input character may reach the number of States more than S, then the complexity of the matching algorithm in a timely manner the length of the input string multiplied by the number of States O (NS).

                The NFA&DFA structure, transformation and simplification of regular expressions have a whole set of theories and methods, which are far more complex than the above examples, and this article only illustrates the principle through a simple example. NFA: Expression-led

                Starting at the first part of the expression, checking the portion of the current text at the same time, examining the current part of the expression and, if it is, continuing the next part of the expression, and so on,

                Until all parts of the expression can match, that is, the entire expression matches successfully.

                Let's take a look to(nite|knight|night) at the process of matching text ...tonight... : The first part of the expression is T, which repeats the scan until it finds T in the string,

                After that, check the subsequent O, and continue checking the following elements if they match. In this example, the following element is (nite|knight|night) , meaning nite or knight or night, the engine

                Will try these three possibilities in turn.

                Throughout the process, control is transformed between the elements of the expression and is therefore called the "expression-dominated". The characteristic of "dominant expression" is that each sub-expression is independent and there is no intrinsic connection.

                The hierarchical relationship between sub-expressions and the control structure of the entire regular expression (multi-Select, quantifier) controls the entire matching process. DFA: Text-led

                When a DFA reads a text, it records all the matching expression positions that are currently valid (the set of positions corresponds to a state of the DFA). Take the above matching process as an example:

                  1. When the engine reads the text T, the record matching position is T O (nite|knight|night);
                  2. Then read into O, match position t o (nite|knight|night);
                  3. Read N, match position to ( N ite|knight| n ight), two positions, Knight eliminated;
                  4. ...

                This approach is called "text-driven" because the scanned string controls the execution of the engine. One of the differences: NFA expression affects engine

                The NFA expression-led feature allows the engine to be affected by modifying the regular expression, so the following three expressions can match the same text, but the engine executes differently:

                  1. to (Nite|knight|night)
                  2. Tonite|toknight|tonight
                  3. to (K?night|nite)

                But for the DFA, there is no difference. Difference Two: The DFA guarantees the longest match

                For an expression that contains or options, the NFA may report a successful match after a successful match, and it is not known whether the subsequent options will succeed or if it contains a longer match.

                Assuming one(self)?(selfsufficient)? oneselfsufficient that a match is used, the NFA first matches one and then matches self, when it finds selfsufficient

                With the remaining substring, but this subexpression is not required, so you can immediately return to success, at which point the matched string is oneself .

                The matching results of the NFA engine are actually related to the specific implementation, and the DFA is bound to match successfully oneselfsufficient . The third difference: NFA supports more features

                The NFA can support advanced features such as capture group, surround look, occupy priority quantifier, and landline grouping, all based on the feature of "sub-expression independent matching".

                The DFA cannot record the relationship between matching history and sub-expressions, and therefore cannot implement these functions.

                The NFA engine can be seen to have greater practical value, so the regular expression libraries we use in our programming language are based on the NFA. The Pattern of Java is the NFA-based, Pattern.compile ()

                The method is obviously constructing the NFA state diagram.

      1. Reference: http://www.cnblogs.com/longhuihu/p/4128203.html

Pumpkin does not speak (M01)-the principle of regular expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.