5.1 Intermediate code Generation and Optimization _ Introduction
During the parsing and semantic review phases, we are always dealing with the 3 concepts of statement statement, expression expressions, and externally declared externaldeclaration. Through the declaration, we finally established the corresponding type structure, and in the symbol table holds the correlation identifier type information, in the intermediate code generation phase, we no longer need to deal with the external declaration externaldeclaration, only needs to generate the corresponding intermediate code for the statement and the expression. The file ucl\tranexpr.c is used to generate intermediate code for an expression, the file name tranexpr is an abbreviation for translate expression, and ucl\transtmt.c is used to generate intermediate code for the statement. We gave an example of 5.1.1 in the "section 1.4 UCC Compiler Preview", for easy reading, and we revisit the concept. Figure 5.1.1 The recursive function of line 1th to 9th is used to calculate n factorial, and the 12th to 15th row of the while loop is used to print out 1! to 10! , the 19th to 45th is the string form of the "intermediate code generated by the UCC compiler", in which the UCC compiler will represent the intermediate code with a structure of three address codes. The "Dragon book" treats both the syntax tree and the three address code as the intermediate code, where we discuss the UCC compiler, and if you don't make a special declaration, the intermediate Code refers to the three address code. The three address codes contain two source operands, a purpose operand, and an operator. For example, for T1 = a+b, the plus "+" is the operator, A and B are two source operands, and T1 is the destination operand (that is, the result of the operation), which corresponds to three "addresses", so the intermediate code is called three address code.
Figure 5.1.1 Basic Block
Figure 5.11 The middle code of the 20th line indicates a conditional jump, and the 23rd row of Goto indicates an unconditional jump, in assembly language, there are assembly instructions corresponding to it. A "conditional or unconditional jump" in a low-level language causes the transfer of the control flow, which allows us to implement control structures such as the branch statement if and the loop statement while in high-level language C. Of course, the function return is also a control flow change, in the UCC compiler generated in the intermediate code, the 22nd line of return 1 only the return value is set to 1, the real Return action by the 30th line of RET instruction to complete. The advantage of this approach is that even if there are multiple return statements (such as the return statement for Lines 5th and 7th) in C, we only set the return value to 1 on the 22nd line, and the 29th row sets the return value to T2, at the intermediate code level. The real return action only needs to be handled in the same place (that is, the RET instruction on line 30th), which also means that we start from the entrance of the function, leaving the function with only one exit. When executing a "function call", we are going to put the argument and return address into the stack in turn, and then unconditionally jump to the start address of the function, so the function call can also be regarded as an unconditional jump. After generating the intermediate code, we also need to do code optimization, at which point we need to consider the change in control flow. We expect to treat several adjacent intermediate codes as a whole, such as the 4 intermediate code in line 26th to 29th, where the flow of control can only be entered from the first instruction of the whole (this is the middle code corresponding to line 26th), leaving the last instruction in the base block (this is the middle code of line 29th). We call this "whole" as a basic block Basicblock, and the 25th line of "BB3" represents the 3rd base block. Thus, the whole C program can be composed of several basic blocks.
For the middle code of line 19th to 45th of Figure 5.1.1, from the static point of view, these basic blocks are arranged from top to bottom, and we can store these basic blocks with a list. However, from the dynamic semantics of conditional jumps and unconditional jumps, for example, if we consider the basic block BB7 of line 41st as a node, then after we have executed the conditional jump statement for line 42nd, we may be executing the intermediate code of Line 44th or the intermediate code of the 37th line. This is equivalent to the existence of a BB7 of the basic block of line 41st, which points to the 43rd line, and a BB8 edge from BB7 to BB6. Thus, the basic block becomes a point, and the "transfer of control Flow" becomes a forward edge, we can take the entire program's dynamic execution of the route as a data structure of the "graph", the basic block of the diagram is called control flow graph, Flow graph, abbreviated to CFG. Subsequent analyses and optimizations are based on data structures such as control flow graphs. It should be stated that the UCC compiler's intermediate code optimization work is done only in the basic block, and there is no code optimization between functions (inter-process), so the UCC compiler does not use the function call as the last instruction of the basic block when dividing the basic block. For example, in the basic block BB6 of line 36th of Figure 5.1.1, the function call of line 39th is not the last instruction of the base block. In the Intermediate code optimization phase, the peephole () function in ucl\simp.c optimizes the call instruction, peephole to "peep hole optimization", meaning that we look at the world through a small hole or small window, and see only a small part of the world at a time (local). In the UCC compiler, the size of this hole is generally limited to a "base block". For example, for the following two intermediate codes:
T1 = f ();
num = t1;
Since the UCC compiler does not place the function call F () at the very end of the base block, the 2 intermediate codes can be in the same base block, and when the peephole () function makes "peep-hole optimization" of the basic block, it will find that the temporary variable T1 is superfluous, These 2 intermediate codes can be optimized to the intermediate code shown below:
num = f ();
To facilitate such optimizations, the UCC compiler does not take function calls as a transfer of the control flow when dividing the basic blocks. But we know that when a machine instruction is actually executed, the function call does cause a transfer of the control flow. When the UCC compiler generates x86 assembly instructions in subsequent stages, it also makes a special judgment on the call instruction in the Emitblock function of ucl\x86.c to save some of the registers used before the function calls.
To put it simply, in order to store the middle code of 5.1.1 Line 19th to 45th, we need a list structure consisting of several basic blocks, and in order to describe the dynamic transfer of the control flow between the basic blocks, we need the data structure of the control flow graph. Figure 5.1.2 The structure of line 2nd to 8th Irinst is used to describe a "three address Code" intermediate code, the 7th line of Opds[3] is used to hold 3 operands, each operand is represented by a symbol object struct symbol, we in "section 2.5 UCC Compiler symbol Table Management" The related concepts of symbolic symbol are introduced in. The opcode of line 6th is used to store operators. Since a base block can contain several pieces of intermediate code, we use the next prev and 4th lines of line 3rd to form a doubly linked list structure, and the 5th row of Ty is used to record the type information of the result of the operation.
Figure 5.1.2 Related data structures
The structure of line 15th to 32nd of Figure 5.1.2 is used to characterize a basic block, the 16th row of Prev and the next of line 17th are used to construct a doubly linked list formed by several basic blocks. The two-way list describes the "static structure" of the basic block shown in 5.1.1, while its dynamic structure "control flow graph cfg", a node can have multiple precursors, can also have multiple successors, the 20th row of the Succs is used to record the current basic block of all the precursor nodes, The Preds of line 22nd is used to record all its subsequent nodes. The Ninst of line 25th is used to record how many intermediate codes are in the current base block, and the 27th row of NSUCC is used to record the number of precursor nodes, while the 29-row npred is used to record the number of subsequent nodes. The sym of line 18th is used to hold the name of the base block, such as "BB3" and so on. The insth of line 23rd corresponds to a placeholder intermediate code that acts only as the head node of a doubly linked list and does not correspond to any actual code.
Since a basic block BX can have multiple precursors {b1,b2,b3, ..., Bn}, each precursor to the basic block BX has a forward edge. Similarly, the basic block BX can also have a plurality of successive, from BX to each successive nodes also exist corresponding to the edge. Figure 5.1.2 The struct struct Cfgedge of line 10th to 13th is used to describe a forward edge, and the 11th line of the BB field is used to hold a forward (or successor) of an edge, and the next field in line 12th is used to form a list of unidirectional links formed by several precursors (or successors). The 49th line of function Drawcfgedge (head,tail) is used to construct a directed edge from the base block head to the basic block tail, which means that tail is the successor node of the head, We add tail to the successor list in the SUCCS domain of the basic block head by the Addsuccessor function in line 50th, and the head is the precursor to the tail node, and we need to add the head to the list of precursors pointed to in the tail domain of the basic block preds. This work is done by the function Addpredecessor of line 51st.
Head--tail
Figure 5.1.2 The function Appendinst of line 54th is used to add an intermediate code to the current base block, and the 4 statement of line 55th to 59th is used to implement the insert operation of the doubly linked list.
Next, let's take a preliminary look at the overall execution of intermediate code generation, as shown in 5.1.3, the Translate function in line 55th to 64th implements the translation from the abstract syntax tree AST to the three address code, while the 57th line of the while loop translates each function in the current C file. The actual work is done by the translatefunction of line 60th. Figure 5.1.3 the 38th to 54th line gives the main code of the function translatefunction, according to our previous processing of the return statement, no matter how complex the control flow inside the function, the entire function definition has only one entry and one exit. Line 43rd of Line 44th calls Createbblock () to create these two basic blocks, the 20th to 26th line gives the code of the Createbblock () function, the 23rd line is used to set the head node is empty instruction NOP (No operation abbreviation, namely "NULL instruction, No actual operation "). Line 46th calls Translatestatement () to implement the translation of the function body, which is actually a compound statement compoundstatement. The 1th to 16th line lists a function pointer table, and the 2nd to 15th line functions complete the translation of each statement, and the Translatestatement function in line 17th to 19th simply queries the function pointer table. We will analyze each of the functions in line 2nd to 15th in subsequent chapters. The optimize () function in line 48th is used to optimize the generated intermediate code, while the 50th line of the while loop is used to name the individual base blocks, such as "BB1" and "BB2". The 38th to 54th line of Figure 5.1.3 is the main process of intermediate code generation and optimization.
Figure 5.1.3 Translate ()
It is important to note that the newly created base block object is not added to the doubly linked list after the Createbblock () function called in line 43rd of Figure 5.1.3 is returned. Only after we call the Startbblock (bblock bb) function of line 27th, we add the basic block object pointed to by the function parameter bb to the doubly linked list, and the code in line 29th to 30th actually inserts the operation. Figure 5.1.3 The IF statement in line 31st examines whether the last instruction of the current base block will transfer the control flow to the base block that the parameter BB points to, and if such a transfer is possible, we call the Drawcfgedge function on line 34th. Constructs a forward edge that is pointed to by the current base block to the parameter BB base block, followed by the 36th line setting to make the parameter bb a new current base block. The unconditional jump instruction at the end of the base block, JMP, causes the control flow to move to the non-contiguous base block, where the 5.1.1 basic block BB1 The goto BB4 instruction on line 23rd will jump to the BB1 non-contiguous base block BB4, in which case we will construct the forward edge when generating the unconditional jump instruction , refer to the function generatejump in UCL\GEN.C for the corresponding operation. Similarly, if the last instruction of the base block is an indirect jump like the one shown below, we also do not call the Drawcfgedge function in line 34th of Figure 5.1.3, but by the function generateindirectjump in UCL\GEN.C () The directed edges are constructed according to the actual jump target. These two functions are not complicated, so we withheld. In the translation of the switch statement, we will use the indirect jump.
Determine the target of a jump based on the value of t0
Goto (Bb1,bb2,bb3,) [T0];
If the last instruction of the current base block is "conditional jump instruction", then the control flow may flow to the next base block adjacent to it, we need to call the 34th line of the figure 5.1.3 Drawcfgedge function to construct the forward edge, example 5.1.1 of the 42nd line. Of course, if the last instruction is not a jump instruction, then the control flow must flow to the next basic block adjacent, we also need to call the Drawcfgedge function of line 34th of Figure 5.1.3, example of the 40th line of 5.1.1.
C Compiler Anatomy _5.1 intermediate code Generation and Optimization _ Introduction