Developing compilers with Java: Thompson constructs, converting regular expressions to finite state automata

Source: Internet
Author: User

Read blog friends can go to my netease cloud classroom, through the way the video View code debugging and execution process:

http://study.163.com/course/courseMain.htm?courseId=1002830012


In the previous section, we implemented a finite state automaton by code, and applied it to the recognition of shaping and floating-point numbers. Constructs the finite state automaton, and drives it, realizes the recognition to the input string, the whole process is the lexical analysis essence.

The state machine developed in the previous section is based on the following model:


This model is what we write to the program manually in the code. In fact, it corresponds to a set of regular expressions:

D [0-9] The character class representing 0-9

{d}+ represents an integer value composed of 0-9

({d}+ | {d}*\. {d}+ | {d}+ \. {d}*) (e{d}+)? denotes floating-point numbers or scientific notation

where {d}+ corresponds to the state machine, from 0 to 1, and then rotates this process in 1. The last regular expression that corresponds to the flow in the diagram from state 0 to state 2, or 4.

So, the question is, given a regular expression, can you directly generate a finite state automaton? The answer is yes, most regular expression recognition programs are basically converted to automata first, then by driving the automata to identify the input, converting regular expressions to finite state automata will be the focus of our sections.

Classification of finite state automata

Finite state automata, in fact, can be divided into two categories. The first class is given above, called deterministic finite state automata: Deterministic finite automaton abbreviated DFA. A deterministic state machine has a feature, that is, given the current state and input characters, the next state can be uniquely determined. For example, based on State 1 o'clock, the received character 0-9, that next state must be only 1 if the character is received. , the next state, it must be only 2. More strictly, the DFA is such a self-motive, the side that goes out from given state corresponds to a definite character, at the same time, from a state out of two side, their corresponding character must be different.

Corresponds to DFA, another state machine called non-deterministic finite state machine: nondeterministic finite automaton, namely NFA. In practice, to successfully convert regular expressions to automata, the help of NFA is required. The NFA is characterized by two edges that go out of a state and can have the same corresponding characters. or its side can correspond to a special character called "Empty" character, the character corresponding to the symbol is:?. This edge indicates that no input is required to enter the next state from the current state.

For example, an expression (and | any) that corresponds to the DFA is as follows:


It corresponds to the NFA as follows:

Starting from the initial state, two edges are differentiated, and the corresponding characters of the two edges are the same.


Or:

Differentiate two characters from the initial state with the null character, then enter two corresponding state machines respectively


The second NFA is easy to implement in the program design, so in the next section of the code, we will adopt the second NFA implementation pattern.

One obvious weakness of the NFA is that it is difficult to use data structures to represent it in code design. In particular, when an NFA can jump to multiple states when it corresponds to an input character, it is more difficult to use the NFA to identify the input string. In general, a program that uses an NFA requires two steps: Convert the regular expression to an NFA, and convert the NFA to DFA. In a later discussion, we'll show both of these transformations in code.

Thompson Construction Method

The algorithm for converting regular expressions to NFA was given by Ken Thompson of Bell Labs, who developed Unix with Dennis Ritchie, and he developed the predecessor of the C language, B.

His algorithm is as follows:

The simplest regular expression is a single-character match, for example a matches the input character "a", then the NFA structure of the expression is as follows:


So, the join expression AB two such a regular expression composition can be represented as follows:


In fact, it is an NFA that constructs two expressions first, and then connects the two NFA end-to-end with a single edge.

Let's take a look at the time when two expressions are being OR manipulated | , the NFA How to construct, the structure diagram is as follows:

To construct or manipulate a two expression: EXP1 | EXP2, according to the diagram, first constructs two expressions Exp1, EXP2 respective nfa:nfa1 (top dashed box), NFA2 (the bottom dashed box), and then constructs two states, the initial state (the opening Circle node), and the end state (the End Circle node), the initial state of the birth of two bar? Side, pointing to the beginning of NFA1 and NFA2, respectively, and then the end of the NFA1 and the NFA2, respectively, to the end of each side, together pointing to the ending state.

Let's look at a | NFA diagram for B:


The principle is the same as described earlier. The top dashed box is the NFA of expression A, and the top dashed box is the NFA of expression B. The two NFA connection is exactly the same as the one described earlier.

If the expression is ((a|b) | cd), the algorithm is the same, first constructs a | B of the NFA diagram, and then construct the NFA map of the CD. Finally, according to the previous method, the two NFA is connected together:

Above the big dashed box is (a|d) the NFA, the head long plaque dotted box is the CD NFA. It is then connected through two state nodes and ε edges.

As you can see, the Thompson construction algorithm is actually a self-recursive process

Let's look at the construction process for the corresponding closure operation:

Exp* 's NFA:

If it is self from 0 times, then go straight from the bottom side to the end node.


exp+ (at least one repeat) of the NFA:


Exp? (Repeat 0 or 1 times) of the NFA:


Any complex regular expression its NFA structure is a combination of the above constructs, such as an expression

(d*\. d| D\. d*)

The construction algorithm is as follows:

1. The NFA that constructs D:


2. Construction d*:


3. Construct the d*\. D (due to the. In the regular expression is a special character, if you want to only express its symbolic content, to be preceded by a backslash to escape):

. The first part of the number is d*, and the latter part is the NFA of D.


4. Construct the d\. d*, the NFA of the expression is actually going to be. The back part is moved to the beginning.

5. Construct the entire expression (d*\) according to the or construction method. D | D\. d*) of the NFA:

It's d*\. The NFA of D, the head is d\. D* 's NFA


The structure of the NFA with complex expressions is a repeated combination of several basic structures.

In this section of our introduction to concepts and algorithms here, according to my habits, the next section must be the code.

Developing compilers with Java: Thompson constructs, converting regular expressions to finite state automata

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.