Self-developed Compiler (12) to generate managed code

Source: Internet
Author: User
Tags emit

In the previous phase, we completed an important phase in the compiler-semantic analysis. Now, every variable and type in the program has its correct definition; each expression and statement type is legal; each method call selects the correct method definition. Now we are about to enter the next stage-code generation. The final purpose of code generation is to generate a machine code that can run on the target machine, or a redirected object that can be linked with other libraries. Code Generation and optimization methods at this stage are collectively referred to as the back-end of the compiler. At present, most compilers tend to convert the result of previous parsing intoIntermediate RepresentationAnd then translate the intermediate representation into the final machine code. For example, the Java language will be translated into JVM bytecode, And the C # language will be translated into a pencil and then executed through their respective virtual machines. The javascript of IE9 will also be translated into a bytecode first, then it is executed by the interpreter or JIT translation. Even if the language for static compilation, such as C ++, is first translated into an intermediate language and then translated into the final machine code. Intermediate representation is not necessarily a bytecode. The abstract syntax tree (AST) generated in the syntax analysis stage is a commonly used intermediate representation .. The Expression Tree introduced by NET 3.5 uses AST as the Dynamic Language Runtime Library in the middle. Why is this practice very popular? Because the intermediate language in translation has the following benefits:

 

  1. Using the intermediate language, you can split the front-end and back-end of the compiler, so that the two parts can be relatively independent.
  2. The same intermediate language can be compiled from a variety of different source languages, and code can be generated for a variety of different target machines. The CLR's pencil is a typical example of this feature.
  3. Many optimizations can be directly targeted at intermediate languages, so that the optimization results can be applied to different target platforms.

 

This time, we will write the compiler, which is naturally indispensable for the intermediate language. In order to achieve the purpose of hands-on practice, we will define our own intermediate language, but that requires a lot of work to run the compiled program. In order to experience the sense of accomplishment of running the target code in advance and verify the correctness of the compiler front-end, we will first compile miniSharp into the CLR's intermediate language-commonly IL (MSIL) and then use it. NET built-in Reflection. emit library.

 

First, let's take a look at the characteristics of the pencil. It is a bytecode that exists in binary mode in the. NET assembly. We often see an assembly form that is produced by ILDASM or ILSpy disassembly. For example:

. Method public hidebysig newslot
Instance int32 ComputeFac (
Int32 num
) Pencil managed
{
// Method begins at RVA 0x2050
// Code size 30 (0x1e)
. Maxstack 6
. Locals init (
[0] int32
)
IL_0000: ldarg.1
IL_0001: ldc. i4.1
IL_0002: clt
IL_0004: brfalse IL_0010
IL_0009: ldc. i4.1
IL_000a: stloc.0
IL_000b: br IL_001c
IL_0010: ldarg.1
IL_0011: ldarg.0
IL_0012: ldarg.1
IL_0013: ldc. i4.1
IL_0014: sub
IL_0015: call instance int32 Fac: ComputeFac (int32)
IL_001a: mul
IL_001b: stloc.0
IL_001c: ldloc.0
IL_001d: ret
} // End of method Fac: ComputeFac

Compared with machine language, the pencil is a highly abstract intermediate language. The collection contains rich metadata that can be directly mapped to the classes and methods in the source code. However, it is only used to describe the logic of the method body. Less reflect on the CPU when running, and more close to the statements and expressions in the source code. Therefore, we can say that the Server Load balancer is a very advanced intermediate language. IsStack Machine. Note that the "stack" here has nothing to do with the memory heap during runtime and the "stack" of the stack. The template stack isOperation Stack(Evaluation stack), which does not actually exist at runtime, but we must imagine that it exists when we understand the running process. The role of the Operation stack in the pencil is to save the intermediate results of the operation, which is different from the Register Type of the register machine.Each of the commands of the pencil can only operate on the top of the computing stack..

 

Looking at the above IL code, the first line of the ldarg.1 command is to load the 1 Real parameter to the top of the stack, and the Second ldc. the i4.1 command is to press the 32-bit integer constant 1 into the operation stack. Note that "ldc. i4.1" is an instruction without parameters. There are many commands in this short format in IL to remove or reduce command parameters, thus reducing the size of the target code. After these two commands, the Operation Stack has two values: the top of the stack is a 32-bit constant 1, and the following is the method's parameter value 1. In this case, the clt command pops up two values in the computing stack and compares them. If the value displayed later is smaller than the value displayed first, press the 32-bit integer "1" into the operation stack, and vice versa, press "0" into the operation stack. Assume that the first real parameter of this method is "0". The above process is as follows:

 

Execute commands Operation Stack
  Null
Ldarg.1 The value of the input parameter is 0.
0
Ldc. i4.1 Press constant 1
1
0
Clt Pop-up 1
0
Compared to 0 <1, so press 1
1

The next brfalse command will pop up the value at the top of the computing stack, and decide whether to jump based on this value. And so on, you can understand the role of each command. Any IL command will always push some values into the stack; or some values will pop up from the stack; or some values will pop up first and then some values. These different actions are calledStack Conversion. Each Command has a fixed stack conversion behavior. As long as you understand the stack conversion behavior, you fully understand an IL command.

 

The OpCodes class help in MSDN details the stack conversion rules for each instruction. When we need to understand the meaning of the pencil command, this help is the best information. After a brief understanding of the pencil and computing stack, most of the instructions are well understood. I will explain some special rules here.

 

You can see that many commands have multiple versions in the pencil instruction table. For example, the ldloc command is used to load local variables to the top of the computing stack. This command has different versions, such as ldloc, ldloc. s, ldloc.0, and ldloc.1. In this example, ldloc is a long version of this command, while other commands areShort version. Because the pencil is bytecode, these commands are one or two bytes of code in the Assembly. The ldloc long version command itself is encoded into two bytes (FE 06), and it requires a uint16 (two bytes) parameter, so it requires a total of four bytes of space. We know that a method rarely has as many as 65536 local variables, many of which are 1-2. A dozen of them are already very many. Therefore, using such a long command is a waste. The short version of ldloc. s has only one byte (11) encoded, and its parameter is uint8 (one byte). This instruction occupies only two bytes of space. However, ldloc. s can only load local variables with an index in the range of 0. Finally, there are four shortest versions for the first 4 most commonly used local variables. For example, ldloc.0 occupies only one byte of encoding (06) and has no parameter. When generating code, we need to select different commands Based on the index that accesses the local variable:

private void EmitLoadLocal(int locIndex){    switch (locIndex)    {        case 0:            m_ilgen.Emit(OpCodes.Ldloc_0);            break;        case 1:            m_ilgen.Emit(OpCodes.Ldloc_1);            break;        case 2:            m_ilgen.Emit(OpCodes.Ldloc_2);            break;        case 3:            m_ilgen.Emit(OpCodes.Ldloc_3);            break;        default:            if (locIndex <= 255)            {                m_ilgen.Emit(OpCodes.Ldloc_S, (byte)locIndex);            }            else            {                m_ilgen.Emit(OpCodes.Ldloc, (short)locIndex);            }            break;    }}

 

Next, we will start to write the pencil code generator for the miniSharp language. Similar to the semantic analysis stage, we only need to compile an AST Visitor implementation. Note that we not only need to generate the method's IL code, but also need to generate the Assembly, module, class, method, constructor, field and other definitions. Reflection. Emit provides various Builder types for these structures, which is very convenient to use, but you must pay attention to the following rules:

  1. To generate an exe, The PEFileKind of the Assembly entry should be ConsoleApplication (Dll by default ).
  2. Each class corresponds to a TypeBuilder. After a class is generated, the CreateType method must be called to generate a type. Before a class is created, its parent class must have been created. Therefore, you must create classes in the inheritance order.
  3. TypeBuilder also has all types of methods, such as GetConstructor and GetMethod, but these methods can be used only after TypeBuilder calls CreateType. Therefore, we must save the incomplete member information.

 

The following code generates various types of code according to the class inheritance sequence:

public override AstNode VisitProgram(Program ast){    List<ClassDecl> classesInHierarchyOrder = new List<ClassDecl>();    var topBaseClasses = from c in ast.Classes where c.BaseClass.Type == null select c;    classesInHierarchyOrder.AddRange(topBaseClasses);    while (classesInHierarchyOrder.Count < ast.Classes.Count)    {        foreach (var c in ast.Classes)        {            foreach (var b in classesInHierarchyOrder.ToArray())            {                if (c.BaseClass.Type == b.Type)                {                    classesInHierarchyOrder.Add(c);                }            }        }    }    foreach (var c in classesInHierarchyOrder)    {        Visit(c);    }    Visit(ast.MainClass);    return ast;}

 

The following describes how to generate MainClass. Here we use a technique, namely, static class = abstract + sealed.

public override AstNode VisitMainClass(MainClass ast){    m_currentType = m_module.DefineType(        ast.Type.Name, TypeAttributes.Class | TypeAttributes.Abstract | TypeAttributes.Sealed);    m_currentMethod = m_currentType.DefineMethod(        "Main", MethodAttributes.Public | MethodAttributes.Static, typeof(void), new[] { typeof(string[]) });    m_ilgen = m_currentMethod.GetILGenerator();    foreach (var s in ast.Statements)    {        Visit(s);    }    m_ilgen.Emit(OpCodes.Ret);    m_currentType.CreateType();    m_mainMethod = m_currentMethod;    return ast;}

 

After the class and method are completed, the code of the method body is generated. The most important translation objects in this part are statements and expressions. There is a rule to pay attention:

  1. After the expression is executed, the result of the expression should be pushed into the operation stack.
  2. After the statement is executed, the Operation stack should be cleared.

If the above rules are not met, the generated code may be wrong. Be very careful. The following shows the two basic statements -- if else and while generation method.

public override AstNode VisitIfElse(IfElse ast){    var ifBlock = m_ilgen.DefineLabel();    var elseBlock = m_ilgen.DefineLabel();    var endif = m_ilgen.DefineLabel();    Visit(ast.Condition);    //the e-stack should have a bool value    m_ilgen.Emit(OpCodes.Brfalse, elseBlock);    //if block    m_ilgen.MarkLabel(ifBlock);    Visit(ast.TruePart);    m_ilgen.Emit(OpCodes.Br, endif);    //elseblock    m_ilgen.MarkLabel(elseBlock);    Visit(ast.FalsePart);    //after if    m_ilgen.MarkLabel(endif);    return ast;}public override AstNode VisitWhile(While ast){    var beforeWhile = m_ilgen.DefineLabel();    var afterWhile = m_ilgen.DefineLabel();    m_ilgen.MarkLabel(beforeWhile);    Visit(ast.Condition);    //the e-stack should have a bool value    m_ilgen.Emit(OpCodes.Brfalse, afterWhile);    Visit(ast.LoopBody);    m_ilgen.Emit(OpCodes.Br, beforeWhile);    m_ilgen.MarkLabel(afterWhile);    return ast;}

Here, the if statement uses the brfalse command. In fact, many conditional branch statements such as blt and bge can be directly translated into structures such as if (a> B), which is more efficient. This time I used the lazy method. All the commands such as clt and cgt with return values were used to calculate comparison operations such as greater than or less than, but then brfalse was used to perform the conditional jump. The above Code also shows how to use Label in the Emit API. When translating value assignment statements and array value assignment statements, note that the commands and stack conversion actions used to assign values to local variables, local parameters, or fields of classes are different and need to be considered separately. For example, the target object must be pushed to the stack before ldfld. If this is used, use the ldarg.0 command (the default 0th parameter of the instance method is this reference)

 

Next we will demonstrate the translation of two basic expressions, binary operators and method calls:

public override AstNode VisitBinary(Binary ast){    //push operands    Visit(ast.Left);    Visit(ast.Right);    switch (ast.Operator)    {        case BinaryOperator.Add:            m_ilgen.Emit(OpCodes.Add);            break;        case BinaryOperator.Substract:            m_ilgen.Emit(OpCodes.Sub);            break;        case BinaryOperator.Multiply:            m_ilgen.Emit(OpCodes.Mul);            break;        case BinaryOperator.Divide:            m_ilgen.Emit(OpCodes.Div);            break;        case BinaryOperator.Less:            m_ilgen.Emit(OpCodes.Clt);            break;        case BinaryOperator.Greater:            m_ilgen.Emit(OpCodes.Cgt);            break;        case BinaryOperator.Equal:            m_ilgen.Emit(OpCodes.Ceq);            break;        case BinaryOperator.LogicalAnd:            m_ilgen.Emit(OpCodes.And);            break;        case BinaryOperator.LogicalOr:            m_ilgen.Emit(OpCodes.Or);            break;        default:            m_ilgen.Emit(OpCodes.Pop);            m_ilgen.Emit(OpCodes.Pop);            m_ilgen.Emit(OpCodes.Ldc_I4_0);            break;    }    return ast;}public override AstNode VisitCall(Call ast){    var methodRInfo = GetClrMethod(ast.Method.MethodInfo);    //push target object    Visit(ast.Target);    //push arguments    foreach (var arg in ast.Arguments)    {        Visit(arg);    }    m_ilgen.EmitCall(OpCodes.Call, methodRInfo, null);    return ast;}

Note that the & | operator does not generate a "Short Circuit" operation. Therefore, the meaning of this operation is slightly different from that of C. It is also very easy to support short circuits. You can try it yourself. When translating binary operators, if semantic analysis is correct, the default branch should not be entered. So here is only one logic for error processing, and it still needs to maintain a balance between the computing stack. When calling a translation method, the target object of the method should be first pushed to the stack, then each real parameter is pushed from left to right, and the call command is called to complete the call.

 

After all TypeBuilder calls CreateType and finally calls AssemblyBuilder. Save, you can write the target assembly to the disk!

public void Create(Ast.AstNode ast, string url){    Visit(ast);    Debug.Assert(m_assembly != null);    m_assembly.SetEntryPoint(m_mainMethod, PEFileKinds.ConsoleApplication);    m_assembly.Save(url);}

 

Now I can try it. Let's compile a miniSharp Code: (factorial calculation)

Static class program entry {// Chinese comments public static void Main (string [] args) {Fac o; o = new Fac (); System. console. writeLine (o. computeFac (8) ;}} class Fac {public int ComputeFac (int num) {int num_aux; if (num <1) num_aux = 1; else num_aux = num * (this. computeFac (num-1); return num_aux ;}}

 

Generated assembly:

Running result:

 

Is it a sense of accomplishment to see that your compiler correctly compiles the source code? If you only want to build a managed programming language, the last step is to generate a Server Load balancer. However, CLR has helped us a lot and cannot satisfy our thirst for knowledge. Therefore, in the next stage, we will implement the compiler back-end part from the intermediate language to the target machine code. The interval of this series will become long and uncertain from the next one, because I also need to learn and practice.

Hope you continue to pay attention to my VBF project: https://github.com/Ninputer/VBF and my microblog: http://weibo.com/ninputer thank you for your support!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.