Excerpted from http://blog.chinaunix.net/uid-26548237-id-3839979.html
"Self-cultivation of programmers" chapter II-Compiling and linking
2.1 The process of being hidden
The C Statement of the classic, "Hello World" program is almost every programmer closed eyes can write, compiled run one go, basically become a program to get started and development environment test default standards.
- #include <stdio.h>
- int Main ()
- {
- printf ("Hello world\n");
- return 0;
under Linux, when we compile the program with GCC, we only need to use simple commands (assuming the source file is named test.c):
in fact, the above process can be decomposed into 4 steps, namely preprocessing (Prepressing), compilation (compilation), assembly (Assembly), and link (linking). Specific as shown.
2.1.1 Pretreatment
The first is the source code file test.c and the associated header files, such as stdio.h, are pre-processed by the preprocessor CPP into a. i file. The process of preprocessing the first step is equivalent to the following command (-e means preprocessing only):
- gcc- e test.c- o test.i
The preprocessing process mainly handles those pre-compiled instructions that the source code files can only start with "#". For example, "#include", "#define" and so on, the main processing rules are as follows:
(1) Delete all the "#define" and expand all macro definitions;
(2) Process all conditional precompiled directives, such as "#if", "#ifdef", "#elif", "#else", #endif;
(3) processes the "#include" precompiled instruction, inserting the contained file into the location of the precompiled instruction. Note: This process is recursive, meaning that the included file may also contain other files;
(4) Delete all comments "//" and "/**/";
(5) Add line number and file name ID, e.g. " TEST.c "2, which enables the compiler to generate line number information for debugging purposes and to display line numbers when compiling errors or warnings are generated at compile time;
(6) retains all #pragma compiler directives because the compiler needs to use them;
Pre-compiled. I files do not contain any macro definitions because all macros have been expanded and the included files have been inserted into the. i file. So when we can't tell if the macro definition is correct or if the header file contains it correctly, you can look at the preprocessed file to determine the problem.
2.1.2 compilation
The process of compiling is a series of lexical analysis, grammar analysis, semantic analysis and optimization of the pre-processed files to generate the corresponding assembly code files, this process is often what we call the whole process of the construction of the core part, is also one of the most complex part. The above compilation process is equivalent to the following command:
- GCC- s test.i- o test.s
the current version of GCC merges two steps of preprocessing and compiling into one step, using a program called CCL to complete two steps. We can use the following command directly:
- GCC- s test.c- o test.s
< Span class= "Apple-converted-space" >< Span class= "Apple-converted-space" > all can get assembly output file TEST.S. For C language code, this preprocessing and compiling program is CCL, for C + +, there is a corresponding program called Cclplus.
So actually GCC this command is just the wrapper for these daemons, it will call the preprocessor compiler CCL, assembler as, and linker LD according to the different parameters required.
2.1.3 Assembler
The assembler is the instruction that transforms the assembly code into a machine executable, and almost every assembly statement corresponds to a machine instruction. So the assembler's compilation process is relatively simple compared to the compiler, it has no complex syntax, no semantics, and does not need to do instruction optimization, just according to the assembly instructions and machine instructions one by one translation is OK.
The above assembly process we can call assembler as to complete:
- As Test.s- o TEST.O
- Or
- gcc- c test.s- o test.o
or use the GCC command to start with a C source code file, preprocess, compile, and assemble the direct output target file:
- gcc- g test.c- o test.o
< Span class= "Apple-converted-space" >< Span class= "Apple-converted-space" >< Span class= "Apple-converted-space" >
2.1.4 links
Why does the assembler output a target file instead of outputting the executable file directly? What exactly does the link process contain? Why link? We can invoke the LD to produce an executable program that can run. That is, you need to link a lot of files to get "a.out", that is, the final executable file.
2.2 What did the compiler do
from the most intuitive point of view, the compiler is a tool for translating high-level languages into machine languages. For example, a program that we can write in C + + language can use the compiler to translate it into instructions and data that the machine can execute. The process of
compiling is generally divided into 6 steps: scanning (lexical analysis), parsing, semantic analysis, source code optimization, code generation, and target code optimization. The entire process is as shown.
We will combine a simple description of the process from source code to final target code. Take a very simple C semantics code as an example to tell the process. For example, we have a line of C semantic source code as follows:
- Array[index] = (index + 4) * ( 2 + 6)
- Compilerexpression.c
lexical analysis of 2.2.1
First the source code program is input to the scanner, the task of the scanner is very simple, it is simply the lexical analysis, the use of a similar to the finite state machine algorithm can easily divide the source code of the character sequence into a series of tokens. For example, the above line of the program, a total of 28 non-null characters, after scanning, generated 16 tokens, as shown in the table below.
Lexical analysis of the resulting tokens can generally be divided into the following categories: keywords, identifiers, literals (including numbers, strings, etc.) and special tokens (such as add good, equal sign). At the same time as the identification mark, the scanner also completed other work. For example, the identifier is stored in the symbol table, the number, string constants are stored in the text table, etc., for later steps to use.
A program called Lex can implement a lexical scan, which splits the input string into tokens according to the code of the word that the user previously described.
In addition, for some pre-processing languages, such as the C language, its macro substitution and file inclusion do not generally fall into the scope of the compiler and handed over to a standalone preprocessor.
2.2.2 Syntax Analysis
The parser will then parse the tokens produced by the scanner to produceSyntax Tree。 The whole analysis process uses the Analytic method of the context cannot be grammatical. Simply put, the syntax tree generated by the parser is the tree with the expression as a node. We know that a statement in C is an expression, and a complex expression is composed of many expressions. It forms the syntax tree as shown after the parser.
As you can see, the entire statement is treated as an assignment expression: the left side of the assignment expression is an array expression, the right side of which is a multiplication expression, and the array expression consists of two symbolic expressions, and so on. Symbols and numbers are the smallest expressions, and they are not made up of other expressions, so they are usually leaf nodes of the entire syntax tree. At the same time as parsing, the precedence and meaning of many operators are also determined. For example, the multiplication expression takes precedence over addition, while the parentheses expression takes precedence over multiplication and so on. If an expression is illegal, such as a variety of parenthesis mismatches, missing operators in an expression, and so on, the compiler reports errors in the parsing phase.
Syntax analysis also has a ready-made tool called YACC. It is also like Lex, which can construct a syntax tree by parsing the input sequence of tokens based on the user-given syntax rules.
Semantic Analysis of 2.2.3
Semantic analysis, completed by the semantic parser. Parsing only completes the parsing of the syntactic aspects of an expression, but it does not understand whether the statement is really meaningful. For example, the C language inside two pointers to do multiplication is meaningless, but this statement is syntactically legal. The semantics that the compiler can parse areStatic Semantics。The so-called static semanticsRefers to the semantics that can be determined during compilation. corresponding to theDynamic SemanticsIs the semantics that can be determined only during run time.
Static semantics typically include declarations and types of matches, types of conversions. For example, when a floating-point expression is assigned to an integer expression, where a floating-point assignment is implied to a pointer, the semantic parser discovers that the type does not match and the compiler will give an error.
Dynamic semantics generally refer to semantic-related issues that occur during run time, such as dividing 0 as the divisor is a run-time semantic error.
After the semantic analysis phase, the expression of the entire syntax tree is identified by the type, and if some types require implicit conversions, the semantic parser inserts the corresponding transformation nodes into the syntax tree. The syntax tree described above became the form shown after the semantic analysis phase.
As you can see, each expression, including symbols and numbers, is identified by the type. Almost all of the expressions in our example are integer, so there is no need to convert.
The semantic parser also updates the symbol types in the symbol table.
2.2.4 Intermediate language generation
Modern compilers have many levels of optimization and often have an optimization process at the source code level. What we're describing here.Source-level optimizerThere may be different definitions or some other differences in the different compilers. The source optimizer is optimized at the source code level. In the above example, it can be found that the expression (2+6) can be optimized, because its value can be determined during compilation.
The syntax tree is optimized as shown.
You can see (2 + 6) that the expression is optimized to 8. It is difficult to make optimizations directly on the syntax tree, so the source optimizer tends to convert the entire syntax tree intoIntermediate Code, which is the sequential representation of the syntax tree, which is actually very close to the target code. However, it is generally irrelevant to the target machine and the runtime environment, such as the size of the data, the address of the variable, and the name of the register. There are many types of intermediate code that have different forms in different compilers, such as three address codes.
The intermediate code allows the compiler to be split into the front and back ends. The compiler's front-end is responsible for generating machine-independent intermediate code, and the compiler backend converts the intermediate code into the target machine code. For some compilers that can cross-platform, they can use the same front end for different platforms and several backend for different machine platforms.
2.2.5 target code generation and optimization
The source-level optimizer produces intermediate code that marks the following process as belonging to the compiler back end. The compiler backend mainly includes the code generator and the target code optimizer.
Code GeneratorConverting the intermediate code to the target machine code is very dependent on the target machine, because different machines have different word lengths, registers, integer data types, and floating-point data types.
For the intermediate code of the above example, the code generator might generate the following code sequence:
Finally, the target code optimizer optimizes the above target code, such as choosing the appropriate addressing method, using displacement instead of multiplication, and so on.
Through these scanning (lexical analysis), parsing, semantic analysis, source code optimization, target code generation, and Target code optimization, the compiler passes through so many steps that the source code is finally compiled into the target code. But there is a problem with this target code: the address of index and array is not yet determined. If we use the assembler to compile the target code into an instruction that can actually be executed on the machine, where does the address of index and array get from? If the index and array are defined in the same compilation unit as the source code above, then the compiler can allocate space for index and array to determine their address; Is this defined in other program modules?
What should I do with the variable definition in the target code in other modules? In fact, defining the global variables and functions of other modules will be determined at the end of the final link when the absolute address of the final run. So the modern compiler can compile a source code file into an unlinked target file, and the linker will eventually link the target files to form an executable file.
2.3 linker age is longer than compiler
2.4 Modular assembly--Static link
People compile each source code module independently, and then follow the process of assembling the module to "assemble" it, which is the link. The main content of the link is to deal with the parts of each module that are referenced to each other, so that the modules can be properly bridged.
In principle, however, its work is simply to modify some of the instructions to the other symbolic address references. The link process mainly includes the address and space allocation, symbol resolution and relocation of these steps.
The most basic static link is shown in Cheng. The source code file for each module, such as. c, is compiled into a target file (such as. O) by the compiler, and the destination file and library are linked together to form the final executable file. The most common library is the runtime library, which is a collection of basic functions that support programs to run.
A library is actually a set of target file packages, that is, some of the most commonly used code compiled into the target file after packaging storage.
During the linking process, the instructions for other functions that are defined in the target file need to be re-adjusted, and the same problem applies to other variables that are defined in other target files.
Let's take a look at the process with specific CPU instructions. Suppose we have a global variable called Var, which is in target file a. We want to access this global variable in target file B, for example, we have an instruction in target file B:
- MOVL $0x2a, var
This instruction is to assign a value to the VAR variable 0x2a, which is equivalent to the statement var = 42 in the C language. Then we compile the target file B, get this instruction machine code, as shown in.
Since the compiler does not know the target address of the variable var when compiling the target file B, the compiler can set the target address of the MOV instruction to 0 without having to determine the address, and wait for the linker to fix the target file A and B when it is linked. Assuming that A and B are linked, the variable var address is determined to be 0x1000, then the linker will modify the target address portion of the directive to 0x10000. The process of correcting this address is also called relocation, where each correction is called a reposition entry.
PS: Each destination file, in addition to its own data and binary code, also provides three tables: unresolved symbol table, export symbol table, address redirection table.
(1) The unresolved symbol table provides all references in the compilation unit but does not define the symbol in this compilation unit and the address it appears in;
(2) The Export symbol table provides the symbols and their addresses that this compilation unit has defined and is willing to provide to other units;
(3) The Address redirection table provides a record of all references to its own address in this compilation unit;
The compiler placed extern-declared variables into the unresolved symbol table, and pail into the export symbol table;----External links
The compiler pail the global variables of the static declaration into the unresolved symbol table, and also pail into the export symbol table, so that other cells cannot be used;----internal links
General changes and their functions are placed into the export symbol table;
Read through this chapter, and then recorded, the compilation and links have a preliminary understanding, but only fur, and continue.
Citation: "Programmer self-cultivation", network, etc.
Thank
"Self-cultivation of programmers" chapter II-Compiling and linking