Regular expression DFA Construction method __ Regular Expression

Source: Internet
Author: User
Tags arithmetic character set closure first string regular expression repetition valid expression engine

Chen Zixian vczh@163.com http://www.cppblog.com/vczh/

1. Overview of issues

With the structure of computer language more and more complex, in order to develop a good compiler, people have gradually felt that lexical analysis independent of the importance of doing research. However, the role of the lexical analyzer is not limited to this. Recall that our teacher has just begun to tell us about programming, there will always be a topic: give a string filled with arithmetic formula, write the program to calculate the results of the equation. In addition, we sometimes create more complex configuration files, such as XML, the parser first to parse the file, the entire string is broken into a relatively short notation (refers to a string with some attributes), before the structural analysis. Furthermore, when implementing a console application, the program needs to analyze the user's command to enter the screen. If the command is complex enough, we must first parse the command, and then the result will be much easier to carry out the work.

Of course, most of these problems have been solved, and historically there have been a variety of specialized or generic tools (Lex, regular expression engine, etc.) to solve this kind of problem. When we use this tool, we need to understand the rationale behind lexical analysis in order to write the configuration more efficiently, or if we need to make a similar tool ourselves in a particular situation. This article will give a general lexical analysis tool to construct the necessary principles. The implementation is not included in this article because of the long code implementation.

What exactly is "breaking a string into some notation"? Let's start with the arithmetic formula. A arithmetic is a sequence of characters, but the objects we care about are actually operators, parentheses, and numbers. So the function of this method is to break a string into a token with attributes that we care about. As an example: (11+22) * (33+44) is a legal arithmetic formula if the input is (the left parenthesis, "(") (number, "11") (the first operator, "+") (number, "22") (the right parenthesis, ")") (two-level operator, "*") (the Left Parenthesis, "(") (number, 33 ") (primary operator," + ") (number," 44 ") (right parenthesis,") "), we only need to care about the attribute of the mark (that is, the opening parenthesis, the closing parenthesis, the number, the operator, etc.) when we examine the structure. If there is a space in the formula, we just need to take the space as a type of notation, after the lexical analysis results, will have the space attribute of the token discarded can be, the next step does not need to change.

However, it is important to note that the results of lexical analysis are not hierarchical, and all tokens are equivalent objects. When we calculate expressions, we think of + and * as different levels of operators, and similar structures have nested hierarchies. Lexical analysis can not draw information about nested hierarchies, and can only get information about duplicate structures at most.

2. Regular Expressions

We now need to look for a tool that can describe the type of token, before we begin by examining the structure of the common notation. To represent a collection of strings that have some commonality, we need to write some rules that represent a collection of strings. All members of this collection will be considered to be a specific type of token.

first, the rule can treat a specific character or an empty string as all of the tokens of a type. in the case of the arithmetic equation mentioned above, the "left parenthesis" type of notation corresponds only to the character "(", the other character or string cannot be a type of "left parenthesis").

second, rules can be concatenated. concatenation means that we can let a string prefix match a specified rule, the remainder of the prefix conforms to the second rule, the remaining part of the prefix conforms to the third rule, and so on, until the last part of the whole to conform to the last rule. If we treat the string "function" as a token type, we can replace the string "function" with 8 concatenated rules: "F", "U", "n", "C", "T", "I", "O", "n". First, the prefix "F" of the string "function" conforms to the rule "F", the remainder of the "unction" prefix "U" conforms to the rule "U", and so on, until the last part of "n" all conform to the rule "n".

Thirdly, the rules can be paralleled. parallel means that if a string conforms to one of a series of rules, we say that the string conforms to these rules in parallel. The parallel of these rules, then, constitutes a new rule. A typical example is determining whether a string is a keyword. The keyword can be "if", which can be "else", which can be "while" and so on. Of course, a keyword is not likely to conform to these rules at the same time, but as long as a string conforms to one of these rules, we say that the string is a keyword. Thus, the keyword is the rule of "if", "Else", "while" and other rules of parallel.

Four, a rule can be optional. the optional rules are actually a special form of parallel. Join we need the rule "abc" and "ABCDE" in parallel, we will find that the two rules have the same prefix "ABC", and this prefix is exactly one of the rules. We can then rewrite the rules into parallel concatenation of "abc" and "De". However, the rule "" is specified as an empty string, so the parallel between this rule and the "de" can be considered an optional rule "de".

Five, the rules can be repeated. a finite repetition can be expressed in concatenation, but if we do not want to limit the number of repetitions, concatenation cannot represent the rule, so we introduce "repetition". A typical example is the identifier of the programming language. The identifier can be the name of a variable or something else. A language usually does not specify the maximum length of the variable name. So in order to represent this rule, you need to parallel the 52 letters and repeat the rule.

Of the above 5 methods of constructing rules, the following 4 methods are used to combine rules into larger rules. In order to give a formal representation of this rule, we introduce a paradigm. This paradigm has the following syntax:

1: Characters are surrounded by double quotes, and empty strings are replaced with ε.

2: Two rule-by-tail connection represents the concatenation of a rule.

3: Two rules used | Separate the parallel representing the rules.

4: The rule using [] bracketing represents the rule is optional, and the rule using {} bracketing means that the rule is duplicated.

5: Rule using () bracketing represents the rule is a whole, usually used to change the operator | The priority level.

For example, the rules for a real number are written as follows:

{"0" | " 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 "}". " [{"0" | " 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 "}].

However, how do we say "not the number of other characters". The number of characters is limited, so we can use the parallel representation of the rules. But all the characters are really too many (the ASCII character set has 127 characters, the UTF-16 character set has 65,535 characters), so later people came up with a variety of simplified rule writing methods. The more famous is the BNF paradigm. The BNF paradigm is often used for theoretical research, but more practical is the regular expression.

The characters of regular expressions do not need to be enclosed in double quotes, but if you need to represent some defined characters (such as "|") , you can use the escape character method (such as "/|"). Second, X represents [x],x+ represents {x},x* representative [{X}]. The character set can be expressed in intervals, [0-9] can represent "0" | " 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 ", [^0-9] means" characters other than numbers ". Regular expressions also have a variety of other rules to simplify our writing, but since this article is not "proficient in regular expressions", we only retain a few more essential operations to describe the lexical analysis principle.

Regular expressions are extremely expressive, and decimal rules can be used [0-9]+. [0-9] to indicate that the C language annotation can be expressed as//* ([^/*]|/*+[^/*/]) */*+/to represent.

3, there is a poor state automatic machine

People reading regular expressions can be easier, but machine reading regular expressions is a very difficult thing to do. Moreover, the direct use of regular expressions to match the words, not only the workload is large, but also a slow speed. So we also need another type of expression specifically designed for machines. In a later chapter, an algorithm is given to convert the regular expression into a machine-readable form, which is described in this chapter as a state-of-the-poor automaton.

There's a poor state automaton. The name sounds scary, but in fact it's not as complicated as it might seem. This concept of state machine is widely used in a variety of fields. The Unified Modeling Language (UML) of software engineering has a state diagram and a state transition diagram in the digital logic. But these kinds of graphs are fundamentally different from state machines. I will use an example to tell the actual meaning of the state.

Suppose we now need to check whether the number of a in a string and the number of B are all even. Of course we can use a regular expression to describe it. However, for this problem, it is far less convenient to describe a regular expression than to construct a state machine. We can design a set of states and then specify an element in the collection as the "starting state". In fact, the state is the state of the analyzer at the time the work has not started. The parser resets the status to the starting state each time a new job is performed. The parser modifies the state every time a character is read, and the modified method can be specified. After reading all the characters, the parser must stay in a definite state. If this state is consistent with what we expected, we would say that the parser accepted the string, otherwise we would say that the parser rejected the string.

How to implement a parser through design state and its transfer method. Of course, if a string contains only a or B, then the parser has only four states: "Odd a Odd B", "odd a even B", "Even an odd number B", "Even a even B". We name these statuses AA, AB, AB, AB, in turn. Uppercase represents an even number, and lowercase represents an odd number. When the work has not started, the parser has read into the string is an empty string, then the starting state of course should be AB. When the parser reads all the characters, we expect the number of A and b of the string to be read to be even, so the end state should also be AB. So we give a diagram of this state:

Figure 3.1

Checks whether a string consists of an even number of a and an even number of B state graphs

In this state diagram, there is a short arrow pointing to AB, which indicates that the state of AB is the initial state. The AB state has a coarse edge, which represents the state of AB which is an acceptable state of ending. The end state of a state graph can be one or more. In this example, the starting and ending states are exactly the same state. The arrows labeled "a" from AB Point to AB, which means that if the parser is in state AB and the character being read is a, it is transferred to state AB.

We apply this state graph to two strings, namely "Abaabbba" and "Aababbaba". Where the first string is acceptable, the second string is unacceptable (since there are 5 A and 4 B).

When parsing the first string, the state machine's state is:

AB [A]ab[b]ab[a]ab[a]ab[b]ab[b]ab[b]ab[a]AB

When parsing the second string, the state machine's state is:

AB [A]ab[a]ab[b]ab[a]ab[b]ab[b]ab[a]ab[b]ab[A]ab

The first string "Abaabbba" lets the state machine stop on State AB, and the string is acceptable. The second string, "Aababbaba", lets the state machine stop on State AB, and the string is unacceptable.

To represent this state diagram inside the machine, we can use a relatively simple method. This method simply records the set of arrows, start states, and end States between state and state. In response to this state diagram, we can represent this state diagram in the following form:

Starting state: AB

End State Collection: AB

(AB,A,AB)

(AB,B,AB)

(AB,A,AB)

(AB,B,AB)

(AB,A,AB)

(AB,B,AB)

(AB,A,AB)

(AB,B,AB)

When a state diagram is used to represent a state machine, there are sometimes deterministic and non-deterministic problems. The so-called certainty is that for any state, entering a character can jump to another determined state. Deterministic and non-deterministic distinctions have an intuitive description: Any state of a state diagram can have an indefinite number of edges pointing to another state, and if there are two edges in those sides, and if the characters they carry are the same, then this state input character can jump to the other two states, So the state machine is uncertain. As shown in the figure:

Figure 3.2

A deterministic state machine representation of the regular expression ba*b

Figure 3.3

A non-deterministic state machine representation of the regular expression ba*b

The state machine's starting state in Figure 3.3 is read into the character B and can jump to two states in the middle, so this state machine is nondeterministic. Instead, the state machine in Figure 3.2, although functionally consistent with the state machine in Figure 3.3, is deterministic. We can also use a special edge to transform the state. We use the ε edge to indicate that a state can jump to another state without reading the characters . The following figure shows a non-deterministic state machine with a ε edge that is consistent with the function of Figure 3.3:

Figure 3.4

A non-deterministic state machine with ε-edge in regular expression ba*b

In textbooks, a deterministic state automaton (a state machine that is discussed in this paper) is known as a DFA, which is called an NFA, and a non-deterministic state machine with ε Edge is called Ε-NFA. These terms are also used below to indicate the various types of machines with poor state.

When it comes to ε, a common question is the reason for the existence of this side. In fact, if people directly draw state machine, sometimes can also directly draw a certain state machine, more complex words can also draw a non-deterministic state machine, in some extreme cases we need to use the ε edge to more concise representation of our intentions. But the biggest reason for the ε-side existence is that we can use the ε edge to give a concise algorithm to convert a regular expression into Ε-NFA.

4, from regular expression to Ε-nfa

As described in section II, we know that the basic element of a regular expression is the character set. We can construct more complex regular expressions by means of concatenation, parallel, repetition, and optional operation of the rules. If you can combine the state diagram with these operations when constructing a state machine from a regular expression, then the method will be very simple. Next, we will discuss the one by one methods of constructing the regular expressions for these 5. All Ε-nfa that are constructed using the algorithm described below have and have only one end state .

1: Character Set

A character set is the most basic element of a regular expression, so it is reflected on the state diagram, and the character set is the basic element that makes up the state diagram. For character Set C, if there is a rule that accepts only C, the state diagram corresponding to this rule will be constructed in the following form:

Figure 4.1

The initial state of this state graph is start and the end state is end. The start state reads into the character set C jumps to the end state and does not accept other character sets.

2: Tandem

If we use A⊙b to represent the concatenation of rule A and rule B, we can easily know that concatenation is a binding, that is to say (a⊙b) ⊙c=a⊙ (b⊙c). So for the concatenation of n rules, we just need to concatenate the first n-1 rules and then treat the resulting rules as a whole, in series with the last rule, then we get the concatenation of all the rules. If we know how to concatenate two rules together, we know how to concatenate n rules.

In order to convert two concatenated rules into a state diagram, we only need to convert the two rules into a state diagram, and then let the end state of the first state jump to the starting state of the second state graph. This jump must be a non-read-in character jump, which means that the two states are equivalent. Therefore, when the first state diagram jumps to the end state, it can be used as the starting state of the second state diagram and continue the second rule check. So we used the ε side to connect two state graphs:

Figure 4.2

3: Parallel

The parallel method is similar to that in series. In order to be able to read a character in the starting state when it is possible to know that the character may go in parallel to which branches and to jump, we need to first construct all the branches of the state diagram, and then connect the starting state to all branches of the starting state. Furthermore, after a certain branch has successfully accepted a string, in order for that state graph to be reflected in the end state of the entire state graph, we also connect the end state of all branches to the end state of the large rule. As shown below:

Figure 4.3

4: Repeat

For a repetition, we can set up two states. The first state is the starting state, and the second state is the end state. When the state goes to the end state, if it encounters a string that allows the rule to be accepted, it returns to the end state again. In this way, a state diagram can be used to represent repetition. So for repetition, we can construct the state diagram as follows:

Figure 4.4

5: Optional

It is easier to establish a state diagram for an optional operation. In order to complete the optional operation, we need to accept a character, if the prefix of the string is accepted by the current rule to go to the current rule of the state diagram, if the following rules of the optional Rules accept the string then follow the rules of the state diagram, if you accept the words of the two diagram will go. To achieve this, we connect the starting and ending states of the rule's state graph to the following state graphs:

Figure 4.5

If you reuse more than 0 repetitions, that is, the original repetition plus the optional results, you can simply remove the start state of Figure 4.4, so that the end state has both a starting state and an end state of two roles, [Start] and [end] remain intact.

At this point, we have mapped the 5 tectonic state diagrams to 5 ways of constructing rules. For any one regular expression, we just need to restore the expression to the nesting of those 5 constructs, and then we can convert a regular expression to a Ε-NFA by mapping each step construct to the construction of a state graph. For example, we use regular expressions to express "a string that contains only an even number of a and an even number of B" and then converts it to Ε-NFA.

Let's analyze the problem first. If a string contains only an even number of a and an even number of B, then the string must be even length. So we can split this string into two two-character segments. And there are only four of these character segments: AA, BB, AB, and BA. For AA and BB, no matter how many times they occur, there is no effect on the parity of the number of A and B in the string (for the reason: in a modulo 2 addition system, 0 is the invariant term, that is, for any number of modulo 2 additions x has x+0 = 0+x = x). For AB and BA, if the start and end of a string is AB or BA, and the middle part is any combination of AA or BB, the string also has an even number a and an even number B. We now have two methods for constructing the string of an even number A and an even number B. The concatenation, parallel, optional, repetitive operations are applied to these strings, and a string with an even number A and an even number B is still obtained. So we can write the regular expression into the following form:

((AA|BB) | ((AB|BA) (AA|BB) * (AB|BA))) *

According to the method mentioned above, we can convert this regular expression to the following State machine:

Figure 4.6

So far, we've got a way to convert a regular expression to Ε-NFA. But only get ε-nfa or not, because the uncertainty of Ε-NFA is too big, directly according to Ε-nfa run words, each time will get a lot of temporary state set, will greatly reduce efficiency. Therefore, we also need a way to eliminate the non-determinism of a state machine.

5. Elimination of non-deterministic

elimination of the ε Edge algorithm

We've seen a poor state automaton. There are three types of automata: Ε-nfa, NFA and DFA. Now we need to convert the Ε-NFA to DFA. The ε edge is not possible in a DFA, so we first have to eliminate the ε edge. The elimination of the ε-edge algorithm is based on a very simple idea: if state a reaches the status B through the ε edge, then state a can go directly to State B without having to read the characters. If State B needs to be read into the character X to reach the state C, then the State a reads into X can also reach the state C. Because the path from a to C is a B C, where a to B does not require a read-in character.

So we get a natural idea: to eliminate the ε edge from state A, we just need to look for all the states that can be reached by the ε edge from a, and copy the non-ε edges from those states to a. The rest of the work is to remove all ε edges and some states that become unreachable because of the elimination of ε edges. In order to describe the elimination of the ε-edge algorithm more vividly, we construct a Ε-NFA from the regular expression (AB|CD) * and apply the elimination of the ε-edge algorithm on this state machine.

The status diagram for the regular expression (AB|CD) * is as follows:

Figure 5.1

1: Find all valid states.

The valid state is the state that still exists after the elimination of the ε-edge algorithm. We can pre-calculate all the valid states before we start the entire algorithm. The valid state is characterized by the presence of non-ε-side inputs. At the same time, the starting state is also a valid state. The end state is not necessarily a valid state, but if there is a valid state that can reach the end state only through the ε edge, then the state should be marked as an end state. Therefore, there may be multiple end states for an NFA produced by an Ε-NFA application to eliminate the ε-edge algorithm. But the starting state is still only one.

We can apply the "existence of non-ε edge of the input or starting state" This method is applied to Figure 5.1 each state, the figure 5.1 to calculate all the valid states. The result is shown in the following figure.

Figure 5.2

All non-valid status tags are deleted

If a state has both ε-and non-ε-side inputs, the state is still active. Because all the valid states are in the next operation, they will get new output and new input.

2: Add all the necessary edges

Next we will apply an algorithm to all the valid states. The algorithm is divided into two steps. The first step is to look for the ε closure of a state, and the second step is to consider the ε-closure of the state as a whole, copying all the edges that are output from the closure to the current state. From the result of the token valid state we got the figure 5.1 the valid state set of the state graph is {s/e 3 5 7 9}. We apply the above algorithm to these states in turn. The first step is to calculate the ε closure of the s/e state. The so-called ε closure of a state is the set of all States that can be reached from this state only through the ε edge. the ε closure of the state s/e is marked in the following figure:

Figure 5.3

Now, we exclude the state s/e from the ε closure of the state s/e. Because non-ε edges that are output from state a belong to non-ε edges that are output from the ε closure of state A, copying these edges is of no value. The next step is to find the non-ε edge output from the ε closure of the state s/e. In Figure 5.3 We can easily find that from State 1 and state 6 (as shown in the Status tab of Figure 5.1) respectively to status 3 and status 7 are marked with a or B side, which is the side we are looking for. Next we copy these edges to the state s/e, and the target state of the edge remains the same, so we can get the following image:

Figure 5.4

At this point, the application of this algorithm on the s/e is over, and then we apply this algorithm separately for the remaining valid state {3 5 7 9}, and we can get the following image:

Figure 5.5

Red side for new added sides

3: Remove all ε edges and invalid states

This step is the last step to eliminate the ε-edge algorithm. We only need to remove all ε edges and invalid states to complete the entire elimination of the ε-edge algorithm. Now we apply the third step to the state machine in Figure 5.5 and get the following state diagram:

Figure 5.6

However, not only new edges are not deleted. By definition, all non-ε edges departing from the active state are edges that cannot be deleted.

We get the DFA in Figure 5.6 by applying the elimination ε-edge algorithm to the state machine in Figure 5.1. But not all of the elimination of the ε-edge algorithm can be directly from the Ε-NFA directlyto the DFA, which is actually related to the regular expression itself. As for what the regular expression can achieve this effect here is not to delve into. But because of the possibility of an NFA, we also need an algorithm to convert the NFA into a DFA.

from NFA to DFA

NFA is a non-deterministic state machine, and DFA is a deterministic state machine. The most significant difference between deterministic and non-deterministic is the reading of a character from a state, the deterministic state machine getting a state, and a state machine being a set of states without a certainty. If we consider the NFA's starting state as a set {s}, for a state set S ', given an input, you can calculate the corresponding state set T ' with the NFA. So when we construct the DFA, we just need to match the starting state to S ', and find all the state sets that might be present at the NFA, convert the sets to a state of the DFA, and the task is done. Because the status of the NFA is limited, the number of elements in the power set of the set of all States of the NFA is limited, so it is entirely possible to construct the DFA using this method.

In order to visualize the process of the algorithm, we will construct a regular expression, and then give the regular expression converted to NFA results, and the construction of the DFA algorithm applied to the NFA. Suppose a string has only a, B, and C three characters, judging whether a string is starting with ABC and ending the regular expression with the CBA as follows:

ABC (A|B|C) *CBA

With the algorithm above, the NFA can be obtained as shown in the following figure:

Figure 5.7

Now we start to construct the DFA with the following algorithm:

1: put {S} into queue L and set D. where S is the initial state of the NFA. Queue L is placed in an unhandled DFA state that has been created, and set D places an already existing DFA state. According to the above discussion, each state of the DFA corresponds to some states of the NFA.

2: Take a state from the queue L, calculate the assembly of the set of characters accepted by all sides of the output from this state, and then look for each character in the set to accept the edge of the character, and compute the set T of the target state of those edges. If T∈d represents the current character, it points to a known DFA state. Otherwise, the current character points to a non-created DFA state, at which point the T is put into L and D. In this step there is a two-layer loop: The first layer iterates through the set of all the accepted characters, and the second layer computes the set of NFA states contained in the target DFA state for each acceptable character traversing all output edges.

3: If L non-empty then jump to 2.

Now we begin to apply the DFA construction algorithm to the state machine in Figure 5.7.

First step, we get:

Figure 5.8

The current state of the queue L, set D, and DFA, respectively, from top to bottom. In this way, the algorithm is executed until State 3 enters set D, and we get:

Figure 5.9

The {3} is now removed from queue L, and the character set {3} accepted by the state set is parsed to be {a B c}. {3} read in a reaches {4}, read in B to {5}, read C arrives {6 7}. Because {4}, {5}, and {6 7} do not belong to D, put them all in queue L and set D:

Figure 5.10

Taking 4 out of the queue for calculation, we get:

Figure 5.11

Obviously, the processing of state {4} did not discover a new DFA state. So we deal with {5} and {6 7}, and we can get:

Figure 5.12

No new DFA status was found while processing the status {5}, processing {6 7} jump after entering {A c} did not find a new DFA state, but we found {6 7} input B but got a new DFA state {5 8}. After the algorithm is executed, we get a DFA:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.