ELF File, elf
The ELF file format is a development standard. executable files of Various UNIX systems use the ELF format, which has three different types:
- Relocated target file
- Executable files
- Shared Library
Now, let's analyze the format of the target file max. o generated after compilation and the file max generated after linking in the previous article to understand the process of assembly, linking and loading and execution.
1. Target File
The ELF file format provides two different perspectives. For the assembler and the linker, the ELF file is a set of sections described by the Section Header Table. When you execute an ELF file, in the loader's view, it is a set of segments described by the Program Header Table, as shown in:
On the left side is the assembler and the linker. The ELF Header at the beginning describes the basic information such as the architecture and operating system, it also points out the location of Section Header Table and Program Header Table in the file. The Program Header Table is not used in the process of assembly and link, so it is dispensable, section Header Table stores the description of all sections. From the perspective of the loader, the file starts with the ELF Header. The description of all segments is saved in the Program Header Table. The Section Header Table is not used during loading, so it is dispensable. Note that Section Header Table and Program Header do not have to be at the beginning and end of the file. The location is pointed out by the ELF Header.
The section declared in. Section in the assembler will become the Section in the target file, and the assembler will automatically add some sections (such as symbol tables ). A Segment is a region with the same attributes loaded to the memory when the program runs. It consists of one or more sections. For example, two sections must be read and writable after being loaded to the memory, it belongs to the same Segment. Some sections only make sense for the assembler and the linker. They are not used at runtime and do not need to be loaded into memory. Therefore, the sentence does not belong to any Segment.
The target file needs to be further processed by the connector, so there must be Section Header Table; the executable file needs to be loaded and run, so there must be a Program Header Table; and the shared library needs to load and run, dynamic links must be made during loading, so both Section Header Table and Program Header Table exist.
The following uses the readelf tool to read the ELF Header and Section Header Table of the target file max. o.
The ELF Header describes that the operating system is UNIX and the architecture is 80386. the Section Header Table contains eight Section headers. The position (or file address) in the file starts from 180 (0xc8). Each 40-byte Header contains 320 bytes and ends with the file address 0x1f3. The target file does not have a Program Header.
The description of each Section is read from the Section Header, where. text and. data are the sections declared in the assembler, and other sections are automatically added by the assembler. Addr is the address that these segments are loaded into the memory (the addresses in the program are all virtual addresses). The address to be loaded must be filled in at the link time. It is currently blank, so it is 00000000. the Off and Size columns indicate the file addresses of each Section, such. data starts from the file address 0x60, with a total of 0x24 bytes. In the max program, we define nine 4-byte integers, 36 bytes in total, and 0x24 in hexadecimal notation. Based on the above information, we can describe the layout of the entire target file:
(Note: The memory address on the reference books is different from the actual address. It is for reference only)
We use the hexdump tool to print all the bytes of the target file.
The left column is the address in the file, with the hex representation of each byte in the middle, and the right column is the character corresponding to the ASCII code. An asterisk (*) indicates that all omitted parts are 0 .. The data segment corresponds to this one:
This section will be loaded to the memory in the future.
The. shstrtab and. strtab sections store Both ASCII codes:
Visible. shstrtab stores the names of each Section, and. strtab stores the names of symbols used in the program. Each name is a string ending with '\ 0.
If the global variable of C language is not initialized in the code, it will be initialized with 0 when the program is loaded. This type of data belongs. bss segment, IT and. data segments are also readable and writable data, but in ELF files. the data segment needs to occupy part of the space to save the initial value. bss segment is not required. That is to say, the. bss segment occupies a Section Header in the file without a corresponding Section. The size of the memory occupied by the. bss segment during program loading is described in Section Header.
We continue to analyze the last part of readelf output, which is the Information read from. rel. text and. symtab sections.
. Rel. text tells the linker where the command needs to be relocated.
. Symtab is a symbol table. The Ndx column is the Section number of each symbol. For example, the Section number of data_items in the third Section (that is,. data) can be found in Section Header Table. The Value column is the address represented by each symbol in the target file. The symbolic address is the relative address of the Section where the symbol is located. For example, if data_items is at the beginning of the. data segment, the address is 0. From the Bind Column, we can see that the _ start symbol is GLOBAL, while other symbols are LOCAL.
Currently, only the. text Segment is left without analysis. The objdump tool can disassemble the machine commands in the program. is the result of the disassembly identical to that of the original assembly code? Comparison:
The left side is the machine Instruction byte, and the right side is the disassembly result. Obviously, all the symbols are replaced with the address, for example, je 23. Note that the memory address is not represented by the number of $, but not the number of immediate values. The <loop_exit> behind this command is not a part of the command, but the name of the symbol found by the anti-assembler from. symtab and. strtab. It is written later for better readability. Currently, the addresses in all redirect commands and memory access commands (mov 0x0 (, % edi, 4), and % eax) are the relative addresses of the symbols. Next, the linker will modify these commands, change the address to the memory address when loading. These commands can be correctly executed.
Ii. executable files
Analyze the executable file max according to the steps in the previous section to see what changes have been made to the linker.
In the ELF Header, the Type is changed to EXEC, from the target file to the executable file, and the Entry point is changed to 0x8048074 (this is the address of the _ start symbol ), we can also see that there are two more Program headers and two fewer Section headers.
In Section Heade Table, the loading addresses of. text and. data are changed to 0x8048074 and 0x80490a0 .. The bss segment is not used, so it is deleted .. The rel. text Segment is used for the link process. It is useless when the link is complete, so it is also deleted.
The exclusive Program Header Table describes the information of two segments .. The text Segment and the preceding ELF Header and Program Header Table form a Segment (FileSize indicates that the total length is 0x9e ),. the data Segment is composed of another Segent (the total length is 0x38). The Segment ADDR column indicates that the first Segment is loaded to the virtual address 0x08048000, and the second Segment is loaded to the address 0x080490a0. In the Fig column, the access permission of the first Segment is readable and executable, and that of the second Segment is readable and writable. The value 0x1000 (4 K) of the last Align column is the memory page size of the x86 platform. When loading, one page in the file is required to correspond to one page in the memory. The relationship is as follows:
This executable file is very small and does not exceed the size of one page in total, but the two segments must be loaded to two different pages in the memory, because the MMU permission protection mechanism is based on the page, only one permission can be set for a page. In addition, it also specifies the offset of each Segment in the file page, and the offset of loading to the memory page, for example, the offset of the second Segment in the file is 0xa0, the offset in the Memory Page 0x0804 9000 is still 0xa0, so it starts from 0x0804 90a0. This is to simplify the implementation of the linker and loader. It can also be seen that the loading address of the. text Segment should be 0x0804 8074, which is also the address of the _ start symbol and the entry address of the program. The original values in the symbol table of the target file are relative addresses. Now they are all changed to absolute addresses. In addition, there are three symbols _ bss_start, _ edata, and _ end, which are added during the link process. The loader can use this information to initialize the. bss segment to 0.
Let's take a look at the disassembly result:
The relative addresses in the command are changed to absolute addresses. Let's carefully check the changes. First, let's look at the jump command. The command for the original target file is as follows:
......11: 74 10 je 23 <loop_exit>......1d: 7e ef jle e <start_loop>......21: eb eb jmp e <start_loop>......
Now it is changed to the following:
......8048085: 74 10 je 8048097 <loop_exit>......8048091: 7e ef jle 8048082 <start_loop>......8048095: eb eb jmp 8048082 <start_loop>......
Changed? In fact, the results of the disassembly are different, and the command is not changed at all. Why can I jump to a new address without changing the command? Because the jump instruction specifies the number of bytes to jump forward or backward relative to the current instruction, instead of specifying a complete memory address. The memory address has 32 bits, and these jump commands only have 16 bits, obviously, it is impossible to specify a complete memory address, which is called relative redirect.
Look at the memory access command. The command for the original target file is as follows:
......5: 8b 04 bd 00 00 00 00 mov 0x0(,%edi,4),%eax......14: 8b 04 bd 00 00 00 00 mov 0x0(,%edi,4),%eax......
Now it is changed to the following:
......8048079: 8b 04 bd a0 90 04 08 mov0x80490a0(,%edi,4),%eax......8048088: 8b 04 bd a0 90 04 08 mov0x80490a0(,%edi,4),%eax......
The address in the command is originally 0x0000 0000, and is now changed to 0x0804 09a0 (note that it is a small-end byte order ). So how does the linker know to change these two places? It is modified based on the relocation information provided in the. rel. text section of the original target file:
......Relocation section '.rel.text' at offset 0x2b0 contains 2 entries:Offset Info Type Sym.Value Sym. Name00000008 00000201 R_386_32 00000000 .data00000017 00000201 R_386_32 00000000 .data......
The Offset value in the first column is where the. text Segment needs to be changed. The relative addresses in the. text Segment are 8 and 0x17, which is exactly the 00 00 00 position in the two commands.