Last time, we talked about the contextual language used in syntax analysis, the generative formula used to describe contextual grammar, the generative derivation, And the syntax analysis tree. Today, we will discuss how to compile a syntax analyzer. The method described today is calledRecursive descent(Recursive descent) method, which is a method suitable for handwriting syntax compiler, and very simple. The recursive descent method has some limitations on the syntax used by the language. However, recursive descent is the mainstream syntax analysis method at present, because it can be highly controlled by developers and has advantages in providing error information. Even the official Microsoft C # compiler is a handwritten recursive descent syntax analyzer.
The recursive descent method is used to compile the syntax analyzer without any class libraries. When writing a simple analyzer, you do not even need to use the previously learned lexical analysis library. Let's take an example: there is a string expression that represents a binary tree. Its syntax is:
The Terminator a indicates any English letter, and ε indicates null. The syntax means that the node of a binary tree is either empty or starts with a letter with a pair of parentheses. The left side of the comma in the brackets is the left son of the node, the right side of the comma is the right son of the node. For example, stringA (B (, C (,), D (,))It indicates such a binary tree:
Note: The grammar stipulates that even if a node does not have a son (the son is empty), parentheses and commas cannot be omitted. Therefore, if there is only one node, it must be writtenA (,). Now we need to write a parser, input this string, and then establish this binary tree in the memory. The binary tree in the memory is represented by the following class:
class Node{ public Node LeftChild { get; private set; } public Node RightChild { get; private set; } public char Label { get; private set; } public Node(char label, Node left, Node right) { Label = label; LeftChild = left; RightChild = right; }} |
This is a Microsoft interview question, which once stumped many candidates for the interview. Do you have any confidence in writing this program? Many candidates think of methods such as using stacks or Recursion to search for comma positions to separate strings. However, if recursive descent is used, this program is very easy to write. Let's take a look at the general steps for compiling the recursive descent syntax analyzer:
- Use an index to record the current scan location. It is usually made into an integer field.
- Write a method for each non-Terminator.
- If a non-terminator has more than one generator, which one is used in this method?Branch Prediction.
- When processing a single production formula, if a correct Terminator is encountered, the position of the scan index created in step 1 is moved forward. If a non-Terminator is encountered, the corresponding method created in step 2 is called.
- If you need to generate the resolution result (such as the binary tree in this example), construct it before the method returns.
Let's try it out now. First, create a class, and then store an index variable to save the current scan location. Then we need to create a method for each non-Terminator. In our grammar, there is only one non-terminator N, so we only need to create one method:
Class BinaryTreeParser {private string m_inputString; private int m_index; // initialize the constructor of the input string and index, for example, Node ParseNode (){}} |
Back to the generated formula, we can see that there are two native non-terminator N, so we must make branch prediction at the beginning of the ParseNode method. The branch prediction method isView ahead(Look ahead ). That is to say, we should first "peek" the characters in front of the current position, and then determine which production type should be used for further analysis. One of the two generations of non-terminator N will generate the (N, N) structure, while the other will directly generate an empty string. Now we know that at least one possibility is that we may encounter a letter. At this time, we should continue the analysis using the generative formula N → a (N, N. So when should we use N → ε for analysis? We observe all the places where N appears on the right of the formula. If N is a null string, the characters after N will appear directly, that is, commas and parentheses. So this is our branch prediction:
- If you encounter an English letter, the prediction branch N → a (N, N)
- If you encounter a comma (,) or right brace prediction Branch (,),
The conversion code is as follows:
Node ParseNode () {int lookAheadIndex = m_index; char lookAheadChar = m_inputString [lookAheadIndex]; if (Char. isLetter (lookAheadChar) {// use N → a (N, N) to continue analysis} else if (lookAheadChar = ',' | lookAheadChar = ')') {// use N → ε for further analysis} else {throw new Exception ("syntax error ");}} |
Next, let's take a look at how the two branches work. Let's take a look at N → ε. In this case, the non-Terminator is an empty string, so we do not need to move the current index, and directly return null to indicate null nodes. Let's take a look at the N → a (N, N) Branch. If the input string does not have any syntax errors, we should encounter letters, left brackets, N, commas, and N right brackets in turn. According to the above rules, when a Terminator is encountered, the current index is moved and the scan is performed forward. If a non-Terminator is encountered, the method of the corresponding node is recursively called. The complete method code (regardless of syntax errors) is as follows:
Node ParseNode () {int lookAheadIndex = m_index; char lookAheadChar = m_inputString [lookAheadIndex]; if (Char. isLetter (lookAheadChar) {// use N → a (N, N) to continue the analysis of char label = m_inputString [m_index ++]; // parse the letter m_index ++; // parse the left parenthesis. Because the value does not need to be used, skip Node left = ParseNode (); // non-terminator N and call m_index ++ recursively; // parse the comma and skip Node right = ParseNode (); // non-terminator N, recursively call m_index ++; // parse the right parenthesis and skip return new Node (label, left, right);} else if (lookAheadChar = ',' | lookAheadChar = ') {// use N → ε for further analysis // no input characters are required, return null directly;} else {throw new Exception ("syntax error ");}} |
Because of the syntax constraints, once we complete the branch prediction, we can clearly understand what the next character or non-Terminator must be, no judgment is required (unless syntax error check is required ). Therefore, we do not need to find the location where the comma is. When we resolve it to a comma, the comma must be there. Isn't that great? You only need a few lines of code to write a complete Parser. If you are interested, you can continue to complete some auxiliary code and use the real string input to test whether the code works properly. The preceding assumption is that the syntax of the input string is correct, but in the real world, the program will always write an error, so the compiler must be able to help check syntax errors. It is very easy to add a syntax error check to the above program. You only need to verify whether the characters at each position are truly equal to the terminator specified in the formula. So let's do some exercises.
The branch prediction method we used above is "human observation method". In the compilation principle book, there are generally some algorithms used to calculate the FIRST set or FOLLOW set. We can calculate a character that may start with a production formula, in this way, the branch prediction can be written in an automatic way to achieve automatic generation of recursive descent syntax analyzer. Anlr is a well-known tool implemented using this principle. If you are interested, you can go to the compilation principles book. In fact, I think "human observation" is not difficult in practice, because the grammar of programming languages is unique and we are very experienced in writing code every day.
Next we will study the limitations of recursive descent on grammar. First, we must make branch prediction by looking ahead.Supports recursive descent syntax. You must be able to view k characters from left to right to determine which syntax to use.. We call this syntaxLL (k)Grammar. In this name, the first L indicates scanning strings from left to right, which can be seen from our index variable increasing from 0, while the second L indicatesLeftmost DerivationYou may remember the leftmost derivation example described in the previous article. You can use the debugger to track the analysis process of the recursive descent syntax analyzer, it is easy to feel that it is indeed the leftmost derivation (always show the leftmost non-terminator of the current sentence ). The k in the final brackets indicates that you need to view k characters in advance. If you check k characters ahead of the start of each non-terminator parsing method, you cannot determine which formula to use, then this syntax cannot be parsed by recursive descent. For example, the following syntax:
F → id F → (E) E → F * F E → F/F |
When we compile a non-terminator E parsing method, we need to make branch prediction in two E generations. However, both E-generating formulas start with F, and F itself may be any long expression, no matter how many characters are viewed in advance, it cannot be determined whether the multiplication formula or division generation formula should be used. In this case, we can useExtract left InternetInto LL (k) Syntax:
F → id F → (E) G → * F G →/F E → FG |
We extract a left formula F and then make the rest into a new generative G. When parsing G, branch prediction is easy. When parsing E, no branch prediction is required. In practice, extracting the left Internet expression can not only convert grammar into LL (k) type, but also help reduce repeated parsing and improve performance.
Next let's look at the second important limitation of LL (k) grammar-left recursion is not supported. The so-called left recursion means that the first sign generated by the generative formula may be the non-terminator of the generative formula. The syntax below is a straightforward left recursion example:
This expression is similar to the non-Ambiguous binary operator syntax obtained at the end of the previous article. However, this syntax has left recursion: The first symbol produced by E is E itself. Let's imagine that if you call yourself recursively at the beginning of the function when writing the recursive descent parsing function of E, the input string will not be consumed at all, and this recursive call will become an endless loop. Therefore, left recursion is a syntactic structure that must be eliminated. The solution is to convert left recursion into an equivalent right recursion form:
F → id E → FG G → + FG G → ε |
You should keep this example in mind. This is not just an example, but also a universal formula for removing most left recursion! We will use this transform again and again when writing the miniSharp syntax analyzer.
Because the LL (k) syntax cannot carry the left recursion and the left Internet formula, many common grammar statements become less elegant after being converted to LL (k. Many programmers prefer to useLR (k)Grammar analyzer. LR indicates scanning from left to right and deriving from the rightmost. LR-type syntax allows left recursion and left Internet-type syntax, but it cannot be used as a syntax analyzer for Recursive descent. Instead, it must be used.Move-to-ContractOrBottom-upTo analyze. I personally think that the principle of LR-type syntax analyzer is very elegant and subtle, but I am not going to introduce it as part of this article. I would like to give a detailed introduction to any compilation principles book. Of course, if my VBF Library supports LR-type syntax analyzer in the future, I may add some special articles. Who knows?
I hope that after reading this article, you can use the recursive descent method to write some LL (k) grammar syntax analyzers. In the next article, I will introduce how to "combine" the syntax analyzer using the MAGIC Linq syntax in C # and VB, so stay tuned!
Hope you continue to pay attention to my VBF project: https://github.com/Ninputer/VBF and my microblog: http://weibo.com/ninputer thank you for your support!