Using java to develop the compiler: Thompson structure, convert regular expressions to finite state automation
Blog readers can go to my Netease cloud class to view the code debugging and execution process through video:
In the previous section, we implemented a finite state automation through code and applied it to the identification of integer and floating point numbers. The whole process is the essence of lexical analysis.
The state machine developed in the previous section is based on the following model:
This model is written into the program manually in the code. Actually, it corresponds to a group of Regular Expressions:
D [0-9] indicates the 0-9 character class
{D} + indicates the integer value composed of 0-9.
({D} + | {D} * \. {D} + | {D} + \. {D} *) (e {D} + )? Indicates a floating point number or scientific notation.
{D} + corresponds to the process from 0 to 1 in the state machine, and then rotates in 1. The last regular expression corresponds to the flow from status 0 to status 2, or 4 in the figure.
Then, the question is, can a finite state automation be directly generated for a given regular expression? The answer is yes. Most regular expression recognition programs convert them to automatic machines first, and then identify input by driving them, converting a regular expression to a finite state automation is the focus of our sections.
Classification of Finite State Automation
Finite State Automation can be divided into two types. The first type is what we have given above, called Deterministic finite state automation: Deterministic finite Automation for short, DFA. A Deterministic state machine has a characteristic that, given the current state and input characters, the next state can be uniquely identified. For example, when status 1 receives a string of 0 to 9, the next state must be 1. If the string is received ., the next status must be 2. more strictly speaking, DFA is such an automatic machine. The edges going out of a given State correspond to a definite character, and at the same time, two edges going out of a state, their corresponding characters must be different.
Corresponding to DFA, another type of state machine is non-deterministic finite state machine: Nondeterministic finite automaton, that is, NFA. In practice, NFA is required to smoothly convert a regular expression to an automatic machine. NFA features that the two edges that exit from a State can have the same corresponding characters. Or its edge can correspond to a special character called "null". The symbol corresponding to this character is :?. This edge indicates that you can enter the next state from the current state without any input.
For example, the expression (and | any) corresponds to the following DFA:
The corresponding NFA is as follows:
Two edges are differentiated from the initial state. The two edges correspond to the same character.
Or:
Splits two edges with null characters from the initial state, and then enters two corresponding state machines respectively.
The second NFA is easy to implement in programming. Therefore, in the code in the next section, we will adopt the second NFA implementation mode.
An obvious weakness of NFA is that it is difficult to express it with data structures in code design. In particular, if NFA can jump to multiple States corresponding to one input character, it is difficult to identify the input string using NFA. Generally, the NFA program requires two steps: converting the regular expression to NFA and converting the NFA to DFA. in the subsequent discussions, we will use the code to demonstrate the two conversions.
Thompson Constructor
The algorithm for converting regular expressions to NFA is provided by Ken Thompson of Bell Labs. This guy and Dennis Ritchie developed Unix together, And he developed the C language's predecessor, B.
The algorithm is as follows:
The simplest regular expression is single-character matching. For example, if a matches the input character "a", the NFA structure of the expression is as follows:
Then, the join expression AB synthesized by two such regular expressions can be expressed as follows:
In fact, it first constructs the NFA of two expressions, and then uses? Edge to connect the two NFA headers and tails.
Next let's take a look at how to construct the NFA when two expressions are performing the OR operation. The structure diagram is as follows:
To construct two expressions or operations: exp1 | exp2, according to the illustration, the NFA: NFA1 (upper dotted box) of the two expressions exp1 and exp2 are constructed respectively ), NFA2 (dotted box at the bottom), and then construct two states: Initial State (starting with a circle node) and end state (ending with a circle node). Two initial state extensions? Edge, pointing to the beginning of NFA1 and NFA2 respectively, and then the end of NFA1 and NFA2 respectively produce? Edge, respectively pointing to the end state.
Let's take a look at the nfa diagram of a | B:
The principle is the same as described above. The upper dotted box is the NFA of expression a, and the lower dotted box is the NFA of expression B. the connections between the two NFA are exactly the same as described above.
If the expression is (a | B) | cd), the algorithm is the same. First, construct the nfa diagram of a | B and then construct the NFA diagram of cd. Finally, connect the two NFA instances according to the preceding method:
The upper dotted box is the NFA of (a | d), and the lower dotted box of the long plaque is the NFA of cd. Then, the front and end are connected by two state nodes and the ε side.
As you can see, Thompson constructor is actually a process of self-recursion.
Let's look at the construction process of corresponding closure operations:
NFA of exp:
If it is self-repeating 0 times, it will go directly from the bottom side to the end node.
Exp + (at least once) NFA:
Exp? (Repeated 0 or 1 times) NFA:
The NFA construction of any complex regular expression is a combination of the above structures, such as expressions
(D * \. D | D \. D *)
The constructor algorithm is as follows:
1. Construct NFA of D:
2. Construct D *:
3. Construct D * \. D (because. is a special character in a regular expression, if you only want to express its symbolic content, add a backslash to the front to escape it ):
The first part of. Is D *, and the last part is NFA of D.
4. Construct D \. D *. the NFA of this expression is actually moving the part after. To the beginning.
5. Construct the NFA of the entire expression (D * \. D | D \. D *) according to the OR constructor:
The top is the NFA of D * \. D, and the bottom is the NFA of D \. D *.
The structure of the complex expression NFA is a combination of several basic structures.
This section introduces concepts and algorithms. According to my habits, the next section must be the code.