C Compile and link process

Source: Internet
Author: User

Detailed link
Some people write C + + (the following is assumed to be C + +) program, unresolved external link or duplicated external simbol error message at a loss (because such an error message cannot be located to a row). or some parts of the language don't know why (or not) to design like that. After understanding this article, there may be some answers.
First look at how we write a program. If you are using some kind of IDE (Visual Studio,elicpse,dev C + +, etc.), you may not find out how the program is organized (many people oppose the use of the IDE for beginners). Because using the IDE, all you have to do is create a new series of. cpp and. h files in a project, and then click "Compile" in the menu after writing. But in the past, programmers didn't write programs like this. They start by opening an editor, writing code like a text file, and then knocking at the command line
CC 1.cpp-o 1.O
CC 2.cpp-o 2.O
CC 3.cpp-o 3.O
Here CC stands for a C + + compiler, followed by the CPP file to be compiled, and specifies the file to output with-O (please forgive me for not using any of the popular compilers as an example). This will appear in the current directory:
1.O 2.O 3.O
Finally, the programmer will also type
Link 1.O 2.o 3.o-o a.out
To generate the final executable file, a.out. Now the IDE, in fact, also follow this step, just to automate everything.
Let's analyze the process above to see what we can find.
First, compiling the source code is done separately for each CPP file. For each compilation, if you exclude the CPP file in the CPP file (which is extremely wrong in C + + code writing), then the compiler only knows the one CPP file that is currently being compiled and is completely unaware of the existence of the other CPP files.
Second, after each CPP file is compiled, the resulting. o file, which is read in by a linker, can eventually generate an executable file.
Well, with these perceptual perceptions, let's take a look at how C + + programs are organized.

First you need to know some concepts:
Compile: The compiler compiles the source code, which is the process of translating the code that exists as text into a target file in the form of a machine language.
Compilation unit: for C + +, each CPP file is a compilation unit. As you can see from the previous compilation process, each compilation unit is mutually agnostic.
Target file: The file generated by the compilation, in the form of machine code contains all the code and data in the compilation unit, as well as some other information.

Let's take a look at the compilation process. We skip grammar analysis, etc., and come directly to the target file generation. Suppose we have a 1.cpp file
int n = 1;

void F ()
{
++n;
}

It compiles the target file 1.O will have a region (assuming the name is 2), including the above data/function, which has n, F, in the form of a file offset is likely to be:
Offset content Length
0x000 N 4
0x004 f??
Note: This is just speculation and does not represent the true layout of the target file. The individual data of the target file is not necessarily sequential, not necessarily in this order, and certainly not necessarily from 0x000.
Now let's look at the contents of the F function starting from 0x004 (guess under the 0x86 platform):
0X004 Inc DWORD PTR [0x000]
0x00? Ret
Note that n++ has been translated as: Inc DWORD PTR [0x000], which is to add a DWORD (4 bytes) to the 0x000 position of this unit plus 1.

Below if there is another 2.cpp, as follows
extern int n;
void G ()
{
++n;
}
Then its target file 2.o 2 binary segment should be
Offset content Length
0x000 g??
Why there is no space for n (that is, the definition of n), because N is declared extern, indicating that the definition of n is in a different compilation unit. Do not forget the compile time is impossible to know the situation of other compilation unit, so the compiler does not know where n exactly where, so this time G's binary code is not able to fill in the Inc DWORD PTR [???] In the??? Part. What do we do? This work can only be handed over to the later linker to deal with. In order for the linker to know where the address is not filled out, the target file also has an "unresolved symbol table", that is, unresolved symbol table. Similarly, the target file that provides the definition of n (that is, 1.O) also provides an "export symbol table", which is exported to tell the linker which addresses it can provide.
Let's get to the idea: now we know that each target file, in addition to having its own data and binary code, must provide at least 2 tables: unresolved symbol tables and exporting symbol tables, telling the linker what it needs and what it can provide, respectively. The following question is how to establish a correspondence between the 2 tables. Here's a new concept: symbols. In C + +, each variable and function has its own symbol. For example, the symbol of variable n is "n". The symbol of a function is more complex, it needs to combine the function name with its arguments and invocation conventions, etc., to get a unique string. The sign of F may be "_f" (varies depending on the compiler).
So, the export symbol table for 1.O is
Symbolic address
N 0x000
_f 0x004
and the unresolved symbol table is empty
The Export symbol table for 2.O is
Symbolic address
_g 0x000
Unresolved symbol table IS
Symbolic address
N 0x001
Here 0x001 for Inc DWORD PTR starting from 0x000 [???] stored in binary encoding??? The start address (this assumes that the 2–5 byte of the Machine Code of INC is the absolute address of +1, need to know the exact case to check the manual). This table tells the linker that there is an address at the location of this compilation unit 0x001, which has an unknown value but has a symbolic n.
Link, the linker found in the 2.O unresolved symbol N, then in the search for all the compilation unit, the export symbol n found in 1.O, then the linker will be the address of N 0x000 to fill the location of 2.o 0x001.
"Stop," maybe you'll jump out and accuse me. If this is done, it is not the content of G will become the INC DWORD PTR [0x000], as previously understood, this is the 0x000 address of this unit 4 bytes plus 1, rather than the corresponding position of 1.O plus 1. Yes, since the address of each compilation unit starts at 0, the address is duplicated when it is finally stitched together. So the linker will adjust the address of each unit when stitching. In this example, assuming that the 2.O 0x00000000 address is positioned on the 0x00001000 of the executable file, and the 1.O 0x00000000 address is positioned on the 0x00002000 of the executable, the 1.O export symbol table actually
Symbolic address
N 0x000 + 0x2000
_f 0x004 + 0x2000
and the unresolved symbol table is empty
The Export symbol table for 2.O is
Symbolic address
_g 0x000 + 0x1000
Unresolved symbol table IS
Symbolic address
N 0x001 + 0x1000
So the final G code will change to Inc DWORD PTR [0x000 + 0x2000].
Finally, there is a vulnerability, since the last n's address becomes 0x2000, then the previous F Code inc DWORD PTR [0x000] is wrong. Therefore, the target file also provides a table called address Redirection table redirect.
For 1.O, its redirection table is
Address
0x005
This table does not require symbols, and when the linker processes the table, it finds that there is an address in the location of 0x005 that needs to be redirected, then add 0x2000 directly to the 4 bytes starting with 0x005.
Let's summarize: when the compiler compiles a CPP into a target file, it provides at least 3 tables in addition to the data and code contained in the CPP in the target file: Unresolved symbol table, export symbol table and address redirection table.
The unresolved symbol table provides all of the symbols that are referenced in the compilation unit but are not defined in this compilation unit and the address they appear in.
The Export symbol table provides the symbols and their addresses that this compilation unit has defined and is willing to provide to other compilation units.
The Address redirection table provides a record of all references to its own address in this compilation unit.
When the linker is linking, it first determines the location of each target file in the final executable file. It then accesses the address redirection table for all the target files, redirecting the addresses in which they are recorded (that is, adding the starting address of the compilation unit to the executable file actually). It then iterates through the unresolved symbol table for all the target files, finds the matching symbol in all the exported symbols, and fills in the actual address in the position recorded in the unresolved symbol table (plus the start address of the compilation unit that owns the symbol definition actually in the executable file). Finally, the contents of all the target files are written in their respective positions, and then some other work is done, and an executable file is released.
Final link 1.o 2.o .... The resulting executable file is probably
0x00000000???? (Some other information)
....
0x00001000 Inc DWORD PTR [0x00002000]//Here is the beginning of the 2.O, which is the definition of G
0X00001005 RET//Suppose Inc is 5 bytes, this is the end of G
....
0x00002000 0x00000001//Here is the beginning of the 1.O, also the definition of n (initialized to 1)
0X00002004 Inc DWORD PTR [0x00002000]//This is the beginning of F
0X00002009 RET//Suppose Inc is 5 bytes, this is the end of F
...
...
The actual link is more complicated, because the actual target file in the data/code is divided into several areas, redirection, etc. to be carried out by the District, but the principle is the same.



Now we can take a look at a few classic link errors:
unresolved external link.
It is obvious that the linker found an unresolved symbol, but did not find the corresponding item in the Export symbol table.
Solution, of course, is to provide the definition of this symbol in a compilation unit. (Note that this symbol can be a variable, or it can be a function), or you can see if there is no link to the linked file
Duplicated external Simbols ...
This is where duplicates appear in the export symbol table, so the linker cannot determine which one to use. This may be a duplicate name, or there may be another reason.


Let's take a look at some of the features that are provided in the C + + language:
extern: This tells the compiler that this symbol is defined in a different compilation unit, that is, to put the symbol in the unresolved symbol table. (external link)

Static: If the keyword is in front of the declaration of a global function or variable, it indicates that the compilation unit does not export the function/variable symbol. Therefore, it cannot be used in other compilation units. (internal link). If it is a static local variable, the variable is stored in the same way as a global variable, but the symbol is still not exported.

Default Link properties: For functions and variables, modulo external links, for const variables, default internal links. (You can change the link properties by adding extern and Static)

Pros and cons of external links: externally linked symbols can be used throughout the program (because symbols are exported). However, it is also required that other compilation units cannot export the same symbol (otherwise duplicated external simbols)

The pros and cons of internal links: Symbols that are internally linked and cannot be used within other compilation units. However, different compilation units can have internal link symbols of the same name.

Why the header file can generally only have a declaration cannot have a definition: the header file can be included in multiple compilation units, if there is a definition in the header file, then each containing the header file of the compilation unit will be the same symbol is defined, if the symbol is an external link, it will cause duplicated external Simbols. Therefore, if the header file is to be defined, the defined symbol must be guaranteed to have only internal links.

Why Changshime think of internal links, while variables are not:
This is to be able to define constants such as const int n = 0 in the header file. Because constants are read-only, it does not matter if each compilation unit has a definition. If a variable that is defined in a header file has an internal link, then if more than one compilation unit defines the variable, one of the compilation units modifies the variable without affecting the same variable in the other cells, which can have unintended consequences.

Why the function defaults to an external link:
Although the function is read-only, and unlike variables, the function is very easy to change when the code is written, if the function has internal links by default, then people tend to define the function in the header file, and once the function is modified, all the compilation units containing the header file are recompiled. In addition, the static local variables defined in the function are also defined in the header file.

Why static variables of a class cannot be initialized in place: the so-called in-place initialization is similar to the case:
Class A
{
static char msg[] = "aha";
};
This is not allowed because the class declaration is usually in the header file, and if this is allowed, it is actually equivalent to defining a non-const variable in the header file.

In C + +, the header file defines what a const object would be like:
Not usually, this is the same as C's definition of const int in the header file, each compilation unit that contains the header file will define the object. However, because the object is const, it has no effect. However: there are 2 situations that could undermine the situation:
1. If it involves taking an address to the Const object and relying on the uniqueness of the address, then the address can be different in different compilation units. (but it is seldom done in general)
2. If the object has a mutable variable and a compilation unit modifies it, the other compilation unit is not affected.

Why static constants of a class cannot be initialized in place:
Because this is equivalent to defining a const object in the header file. As an exception, Int/char can be initialized in-place, because these variables can be directly optimized to the immediate number, just like a macro.

Inline functions:
inline functions in C + + are similar to a macro, so there is no link property problem.

Why is the common use of inline functions defined in the header file:
Because compile-time compilation units are not known to each other, if the inline function is defined in a. cpp file, there is no way to find the definition of the function when compiling other compilation units that use the function, so the function cannot be expanded. So if the inline function is defined in the. cpp file, then only this CPP file can be used with this function.

What happens if the inline function is rejected in the header file:
If the inline function defined in the header file is rejected, the compiler automatically defines the function in each compilation unit that contains the header file and does not export the symbol.

If a static local variable is defined in an inline function that is rejected, the variable is defined where:
The earlier compilers would define one in each compilation unit, and therefore produce incorrect results, and the newer compiler would solve the problem with unknown means.

Why the Export keyword is not implemented:
Export requires the compiler to look for function definitions across compilation units, making compiler implementations very difficult.

C Compile and link process

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.