Configurable syntax analyzer development discipline (3) -- generate a downstream Automatic Machine

Last Update:2018-12-06 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous blog talked about the construction of symbol tables. After constructing the symbol table, you need to enter the next stage of Semantic Analysis: constructing a state machine. Similar to the two articles I wrote earlier on how to implement the Regular Expression Engine, the automatic machine starts with Epsilon Nondeterministic automation, and then constructs the deterministic automation step by step. However, syntax analysis is very different from regular expressions. What is the appearance of this automatic machine?

(If you are interested in academics, go to the Wiki and click "Push-down Automation ")

The difference between a push-down machine and a finite machine is that when a push-down machine is extended to a common one, the number of its states is infinite (nonsense ). But infinite things cannot be expressed by programming. What should we do? Add a "status description" with an indefinite length ". Here is a simple Syntax:

Id = Name
Idlist = ID | idlist "," Id

This constitutes a simple syntax used to analyze the list of names separated by commas. The state machine is written as follows:

Id0 = ● name
Id1 = Name ●
Idlist0 = ● (ID | idlist "," ID)
Idlist1 = (ID | idlist "," Id) ●
Idlist2 = (ID | idlist ● "," Id)
Idlist3 = (ID | idlist "," ● ID)

Id0-> name-> id1
Idlist0-> ID-> idlist1
Idlist0-> idlist-> idlist2
Idlist2-> ","-> idlist3
Idlist3-> ID-> idlist1

It is easy to see that id0 and idlist0 are the initial state of grammar, while id1 and idlist1 are the final state of grammar, as shown in the following figure:

(Copying a PowerPoint drawing to livewriter is too convenient)

But this is not complete yet. When idlist0 jumps to idlist2, the input "idlist" is not enough, because the token used for input is actually only name and. The next step is to demonstrate how to program a real-name down state machine from this state machine.

Here I will first introduce several concepts. The first is the transfer, and the second is the statute. Why should we use these two names? This is because the textbooks of the low-energy compilation principles of Tsinghua University Press are all described as follows: Shift and reduce, respectively. Okay. What is shift? When idlist0 jumps to idlist2, it must be moved to idlist. Idlist3 jumps to idlist1 and needs to be moved to ID. Idlist0 jump to idlist1 also needs to be moved to ID. That is to say,When the state transfer passes through a non-terminator edge, it will be moved to another grammar state machine.. Id1 and idlist1 are the final nodes of ID and idlist. They must follow the "removed from there" Rules and then jump to "idlist2 or idlist1 ". That is to say,Once you reach the final state machine of the method of smell, you have to start the protocol and jump to the State at the upper level..

Someone asked, How do I know where to jump when the Statute ends? This is a good question. Let's look back at my previous article on how to hand-write a syntax analyzer. What does it say? When you write a recursive syntax analyzer, each syntax is actually a function. How does the program know the next instruction when the function is called? The reason is that the compiler has helped us push the "Address of the next instruction when calling the function" into the call stack. But now we do not need to write the syntax analyzer, but do it with the push-down state machine. The same is true. When "moving", we first push the current status to the stack. During the protocol, we can look at "what are the statuses at the top of the stack ", in combination with a forward view (this is the look ahead. The lalr's La, lalr (1) is to peek at a token during La) to decide where the Protocol goes. As for the profound connotation of La here, I will talk about it in the next article. Because now I have not implemented Nondeterministic to deterministic step, there are many black technologies in it, and I want to concentrate on it.

Now let's connect the two state machines in the figure above to generate a push-down machine. But here I will do the first step first. Because the jump from idlist0 to idlist1 is a left-recursive process, it will be ignored temporarily.

The orange edge is a jump from the input non-Terminator, so it does not exist in the push-down state machine. In this figure, we process the edges of two IDs. Idlist0 will shift (push in the stack) itself and jump to id0, so when id1 sees that the top of the stack is idlist0, he knows that the path idlist0-> ID-> idlist1 is taken, so reduce and jump to idlist1. The same applies to idlist3.

However, shift does not generate input, so we should change it to the following.

In this way, the shift edge is input. In addition, id0 to id1. In fact, id0 should also be discarded. There is another problem that cannot be solved, that is, left recursion and reduce do not generate input. These two problems are actually the same. Let's first consider why we cannot use the same method to process reduce as generating input. The reason is that you do not know what reduce needs to input to jump to this stage, especially when the token has ended and the parse has a complete idlist. In the past, were you reading parsing techniques and longshu all confused about why a $ character is generated at the end of a string? In fact, it is particularly useful. Now let's add it to everyone. Here, the goal of this syntax is to generate an idlist structure, so $ should also be added to idlist's final state -- idlist1:

Then it is the turn of reduce. Where should id1 be from reduce? The first step is reduce to idlist1. So where can I reduce idlist1? We can see that at the end of idlist, either jump to idlist2, or jump to finish. However, idlist2 is generated by left recursion. What are the requirements for skipping to finish? The first is the input $, and the second is that the stack is empty after the pop status is complete. Therefore, we can first modify the reduce edge between id1 and idlist1:

The last step is left recursion. The processing of left recursion is a bit like hack, because in fact you cannot determine in advance whether or not you want to perform left recursion (that is, to check the number of commas in token stream), and then shift several idlist0 values first, let's take a while. Therefore, we only need to temporarily insert some idlist0 values when the jump relationship is met. So what is the relationship? The end of the left recursive idlist-that is, the jump from idlist0 to idlist2-is only possible after the input ",". All edges pointing to idlist1 are input IDs. Therefore, this left recursive line should be connected from id1 (the end of ID) to idlist2, in addition, "false shift idlist0" is added during the link ":

The orange states are the starting and ending states of the entire parsing process. At this time, we have killed all useless vertices and States and changed them:

Isn't that the state machine of the regular expression name ("," Name? This is also because this syntax can be expressed as a regular syntax. If we add parentheses to it to change the priority or something, then it will become a much more complex state machine. Okay. Now let's simulate the state transition and stack operations of the slave push down state machine. Let's analyze the input A, B, and C $.

In the following illustration, we use S | ABC | def to express the current State S, the State ABC in the current stack (the stack is on the right), and the input def waiting. Then the initial status must be
Idlist0 | null | A, B, C $

Then we started! (It is too ugly to express it in words, so it is shown as a graph)

If the process is successfully completed and the stack and input are all absent, the parsing process ends perfectly without any errors.

The general process of how to generate a downstream automata from grammar and complete Parsing is written here. The current development progress is to "generate a non-deterministic push-down machine. When I generate the "deterministic push-down Automation", that is, the last state machine diagram above, I will write the next article about the complex syntax, how to adjust the push-down automation. At the same time, we will focus on the look ahead section and why lalr (1) should be designed to look like that.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More