When we first came into contact with programming, the first small project was "Hello World", and in a short time we were able to write its Hello world in this language. But don't look, it's just a few letters. However, for the internal operating mechanism of the simple program of Hello World, most people are still unclear, so we will talk about the mechanism of program operation today.
Hello world How is this information displayed through the display? The code that the CPU executes and the code that we write certainly is different, what does she look like? And how do we change from the code we write to the code that the CPU can execute? Where is the code where the program runs? How are they organized? Where are the variables stored in the program? How is a function call present? This article will briefly discuss the operating mechanism of the program.
The process of developing the platform hidden
Each language has its own development platform, and most of our programs are also born here. From the program source code to the executable file conversion process is in fact a lot of steps and is very complex, but now the development platform to all of these things on their own, to bring us convenience at the same time she also hid a lot of implementation details. So most programmers are only responsible for writing code, and other complex transformations are done silently by the development platform.
As I understand it, the process from source code to executable file can be divided into the following stages:
1, from the source code to the machine language and the resulting machine language in accordance with certain rules organized. We'll call it file a for the moment.
2, the file A and run a required file B (such as library functions) linked together to form a file A +
3, the file A + load into the memory, run the file
(In fact, if you look at reference books or other materials, it may be more than just these steps, but here in order to simplify me to summarize it as 3 steps)
These things form the key steps of an executable file, indispensable. Now see the development platform is "blinded" it. The following sections will sweep through the fog and also the true face of your development platform.
Target file
In the computer field there is a classic word:
"Any problem in computer science can is sloved by another layer of indirecition"
"Any problem in the field of computer science can be solved by adding an intermediate layer"
For example, to achieve the conversion from A to B, you can first convert a to file A +, and then convert the file A + to the file we need B. (In fact, in Polya's "How to Slove it" in the face of this approach is also described.) The problem can be simplified by adding the middle layer when solving problems.
Then the process from the source code to the executable file can be understood as such. The same is true from the source code to the executable file, which solves the problem by adding an intermediate layer between them (constantly).
As mentioned above, first convert the source program to intermediate file A, and then convert the intermediate file into the target file we need.
This is the way of thinking when working with files.
In fact, the above-mentioned document a more professional statement is: the target file. She is not an executable program and needs to be linked and loaded with other target files before it can be executed. For a source program, the first thing the development platform has to do is translate the source program into machine language. One of the most important is compiling. Believe that many people know, is to translate the source code into machine language (in fact, is a bunch of binary code). Compiling knowledge is very important, but it is not the focus of this article, interested in self-Google.
Destination file format:
Now take a look at how the target file is organized (that is, the structure).
Origin:
Imagine if you were to design how would you organize these binary codes? Just like the items on the desk are neat and tidy, in order to facilitate the management of the translated binary code is also classified storage, the representation of the code to put together, representing the data together. In this way, the binary code is divided into different blocks to store. Such an area is what is called a segment (segment).
Standard:
And a lot of things in computer science, in order to facilitate people's communication, program compatibility and other issues. It also set the standard for this binary storage, so COFF (Common Object file format) was born. Today's windows, Linux, and other mainstream operating systems under the target file format and COFF is similar, can be considered as a variant of it.
A.out:
A.out is the default name for the target file. That is, when compiling a file, if you do not rename the compiled target file, the compilation will produce a file named A.out.
Why do you use that name in particular? Interested in being able to own Google.
The following diagram will give you a more intuitive view of the target file:
Is the typical structure of the target file, the actual situation may be different, but it is derived from this basis.
Elf file header: That is, the first segment in. The header is the header of the target file, which contains some basic information about the target file. such as the version of the file, the target machine model, the program entry address and so on.
Text snippet: The data inside is mainly the code part of the program.
Data segment: The data part of a program, such as a variable.
Reposition Segment:
The relocation section includes text relocation and data relocation, which contains the relocation information. In general, there are cases where external functions, or variables, are referenced in the code. Since it is a reference, these functions and variables do not exist within the target file. When using them, give them the actual address (the process occurs at the time of the link). It is these relocation tables that provide information for finding these actual addresses. After understanding the above, the text relocation and data relocation is not difficult to understand.
Symbol table: The symbol table contains all the symbolic information in the source code. Includes each variable name, function name, and so on. The information of each symbol is recorded in the code, such as the symbol "student" in the symbol, and the corresponding information of the symbol is included in the symbolic table. This includes information about the segment where the symbol resides, its properties (read and Write permissions), and so on.
In fact, the original origin of the symbol table can be said in the compilation of the lexical analysis phase. When doing lexical analysis, each symbol and its attributes in the code are recorded in the symbol table.
String table: A function similar to the symbol table, storing some string information.
One more thing to say is that the target file is stored in binary, which is itself a binary file.
The actual target file will be more complex than this model, but the idea is the same, it is stored by type, plus some information that describes the segment and link of the target file information.
A.out Split
Hello World
Word ..., let's take a look at the target file that was formed after the compilation of Hello World, which is described in C.
Simple Hellow World Source code:
In order to have data available in the data segment, the "int a=5" is added here.
If you are on a VC, click Run to see the results.
To be able to see exactly how the internals are handled, we use GCC to compile.
Run
GCC hello.c
Look at our directory, there are more target file a.out.
Now what we want to do is to see what is in the a.out, there may be children's shoes recalled to use Vim text view, then I was so naïve to think. But what a.out is, how can it be so simple to expose it. Yes, VIM does not work. "Most of the problems we encountered were already met and solved by our predecessors," the Objdump, a powerful tool called the. With it, we will be able to thoroughly understand the various details of the target file, and of course, a readelf is also very useful, this is described in the following.
These two tools generally have a self-contained Linux, can be self-Google
Note: The code here is mostly compiled with GCC under Linux, and the target file is Objdump, readelf. But I will put all the results of the operation, so no previous contact with the Linux children's shoes to see the contents of the following is absolutely no problem oh. I use Ubuntu, it feels good ~
The following is the organizational structure of a.out: (Start address, size, etc.) for each paragraph
The command to view the target file is objdump-h a.out
As you can see in the format of the target file described above, it is classified as stored. The target file is divided into 6 segments.
From left to right, the first column (IDX name) is the name of the segment, the second (size) is the size, the VMA is the virtual address, the LMA is the physical address, and file off is the offset within the files. This is the distance from the paragraph relative to a reference (usually the beginning of a paragraph). The last ALGN is a description of the segment attribute, which is ignored for the time being
"Text" segment: Code snippet.
The "Data" segment, which is the above-mentioned segment, holds the data in the source code, typically with initialized data.
The "BSS" segment: is also a data segment that holds uninitialized data because the data has not been allocated space, so it is stored separately.
"Rodata" segment: Read-only data segment, the data stored inside is read-only.
"Cmment" contains compiler version information.
The rest of the two paragraphs have no practical meaning for our discussion, and we will not introduce them. Think they contain some information that is linked, compiled, and loaded.
Note:
The destination file format here only lists the main parts of the actual situation. In fact, there are some tables that are not listed. If you are also using Linux, you can use Objdump-x to list more detailed sections.
Deep a.out
The above section describes the typical segments in the target file by example, mainly the information of the segment, such as the size of the relevant properties.
So what exactly is in these paragraphs, what's in the "Text" section, or what's in our objdump.
Objdump-s a.out can view the hexadecimal format of the target file by using the-s option.
View the results as follows:
As shown, the hexadecimal representation of each segment is listed. You can see that the picture is divided into two columns, the left column is a hexadecimal representation, the right side of the corresponding information.
It is obvious that "Hello World" is found in the "Rodata" read-only data segment. Sweat, as if the program "Hello" wrong, the back of a "w", Trouble,. Forgive the next ha.
You can also check the ASCII value of "Hellow world", and the corresponding hex is the contents.
"comment" The paragraph above contains some version information about the compiler, which is followed by the GCC compiler, followed by the version number.
A.out Disassembly
The process of compiling the source text first into the assembly form, and then translated into machine language. (add middle layer) see so many a.out, and then study his assembly form is hate necessary
objdump-d a.out can list the form of a file. But here is only the main part, that is, main function part, in fact, the beginning of the main function execution and the main function after the execution of more work to do.
That is, initializing the function execution environment and releasing the space occupied by the function.
In the above figure, the left side is the 16 binary form of the code, and the left is the assembly form. The Assembly of familiar children's shoes should be able to read the majority, here is not more.
A.out header File
In the introduction of the target file format, referring to the concept of the document, which contains some basic information of the target file. such as the version of the file, the target machine model, the program entry address and so on.
Is the form of the file header:
Can be viewed with readelf-h. (In view of hello.o, it is a file hello.c compiled but not linked by the source file.) This is mostly the same as viewing a.out)
The diagram is divided into two columns, the left column represents the attribute, and the right is the property value. The first line is often called the magic number. The following is a series of numbers, the specific meaning is not much to say, you can go to Google.
Next is some information related to the target file. Since we have little to do with the issues we are discussing, we are not going to discuss it here.
Above is the content with a specific example of the target file inside the organization, the target file is just an intermediate process in the process of generating an executable file, for how the program is running has not been discussed, how the target file is transformed into an executable file and how the executable is executed will be discussed in the following section
A simple understanding of links
The link is popular to put a few executables.
If program a refers to a function defined in file B, in order for a function to execute normally, you need to put the function part of B in the source code of a, then merge A and b into a file is the link.
There are specialized procedures for linking programs, called Linker. He processed some of the input target files to synthesize an output file. These target files tend to have mutual data and function references.
Above we have seen the disassembly of Hello World, a file that has not yet been linked, that is, when referencing an external function, it is not known by the address:
Such as:
, the CAL directive calls the printf () function, because the printf () function is not in this file, so it cannot determine its address, and in hexadecimal it uses "FF FF FF" to represent its address. After the link, the address will become the actual address of the function, should be connected after the function has been loaded into the file.
Classification of Links: The combination of a related data or function as a file can be divided into static links and dynamic links.
Static Links:
The link work is done before the program executes. That is, the file can be executed after the link is completed. But there is one obvious drawback, such as library functions. If a library function is required for both file A and file B, the library function is available in the file after the link is completed. When a and B are executed simultaneously, there is a two copy of the library function in memory, which undoubtedly wastes storage space. This waste is especially noticeable when scale is enlarged. Static links also have disadvantages such as not being easy to upgrade. To solve these problems, many programs now use dynamic linking.
Dynamic linking: Unlike static links, dynamic links are linked only when the program executes. That is when the program loads and executes. As in the above example, if both A and B use the library function fun (), A and B will only need to have a copy of fun () in memory when executing.
There is a lot of knowledge about links, which will be discussed with special articles later. This is not the start of the talk.
A simple explanation of the load
We know that the program to run is bound to load the program into memory. In the past, the entire program has been loaded into physical memory, and now generally adopt a virtual storage mechanism, that is, each process has a complete address space, giving the impression that each process can use the completed memory. A memory manager then maps the virtual address to the actual physical memory address.
According to the above description, the address of the program can be divided into virtual address and actual address. The virtual address is the address she has in her virtual memory space, and the physical address is the actual address that she was loaded into.
When you look at the paragraph above, you may have noticed that because the file is unlinked and not loaded, the virtual and physical addresses of each segment are 0.
The process of loading can be understood by assigning a virtual address to each part of the program before establishing a mapping of the virtual address to the physical address. In fact, the key part is the virtual address to the physical address mapping process. After the program is completed, the CPU's program counter PC points to the code starting position in the file, and the program executes sequentially.
The purpose of writing this article is to comb the mechanism of how the program runs, and what is hidden behind the execution of an executable file. There are a number of intermediate steps that typically go through the source code to the executable file, and each intermediate step generates an intermediary file. It's just an integrated development environment right now. These steps are hidden, and we are gradually ignoring these important technical insiders when we are accustomed to integrating the development environment. This article also only introduces the main line of the process. Each of these details can be expanded to be sufficient to use an article to discuss.
I think after reading this article, people will not think "Hello World" is just a very simple small experiment, but also hope that you can learn from this article what is the operating mechanism of the program and how to run it.