Analysis of executable file formats on UNIX/LINUX platforms (1)

Source: Internet
Author: User

This article discusses three main executable file formats in UNIX/LINUX:. out (compiler and link editor output editor and link editor output), COFF (Common Object File Format), ELF (Executable and Linking Format Executable and link Format ). The first is a summary of the executable file format, and describes the ELF file loading process to reveal the relationship between the executable file content and the loading operation. Later, we discussed the three file formats and focused on the dynamic connection mechanism of ELF files. In the meantime, we also discussed the advantages and disadvantages of various file formats. Finally, I would like to give a brief summary of the three formats of executable files, and give some comments to the author on the formats of executable files.

Overview of executable file formats

Compared with other file types, executable files may be the most important file types in an operating system, because they are the real executors of operations. The size, speed, resource usage, scalability, and portability of executable files are closely related to the definition of file formats and file loading processes. Studying the formats of executable files is very meaningful for writing high-performance programs and using hacker technologies.

Regardless of the executable file format, some basic elements are required. Obviously, the file should contain code and data. Because the file may reference the symbols (variables and functions) defined by the external file, the relocation information and symbol information are also required. Some auxiliary information is optional, such as debugging information and hardware information. Basically, any executable file format stores the preceding information by intervals, which is called segments or sections ). The meaning of the middle section and section of different file formats may be slightly different, but it can be clearly understood Based on the context, which is not a key issue. Finally, an executable file usually has a file header to describe the overall structure of the file.

Compared with executable files, there are three important concepts: compile, link, or link, and load ). The source program file is compiled into the target file, multiple target files are connected into a final executable file, and the executable file is loaded into the memory for running. This article focuses on the formats of executable files, so the loading process is also focused on. The following is a brief description of the ELF file loading process on LINUX.

1: The kernel first reads the header of the ELF file, then reads various data structures according to the data instructions in the header, finds the segments marked as loadable, and calls the mmap () function () load the segment content to the memory. Before loading, the kernel passes the segment mark directly to mmap (). The segment mark indicates whether the segment is readable, writable, and executable in the memory. Obviously, the text segment is read-only and executable, while the data segment is readable and writable. This method utilizes the memory protection functions of modern operating systems and processors. The well-known Shellcode (Reference 17) Compilation technique is a practical example of breaking through this protection function.

2: The kernel analyzes the dynamic connector name marked as PT_INTERP In the ELF File and loads the dynamic connector. The dynamic connector for modern LINUX systems is usually/lib/ld-linux.so.2, detailed descriptions are provided later.

3: The kernel sets some tag-value pairs in the stack of the new process to indicate operations related to the dynamic connector.

4: The kernel passes the control to the dynamic connector.

5: The dynamic connector checks the program's dependence on external files (shared libraries) and loads them as needed.

6: The dynamic connector is used to relocate the external reference of the program. Generally speaking, it is used to tell the program the address of the external variable/function referenced by it, this address is located within the memory range where the shared library is loaded. Dynamic connection also features a latency (Lazy) Positioning feature, that is, it is only relocated when "real" requires referencing symbols, which is of great help to improve the program running efficiency.

7: The dynamic connector executes the Code marked as. init In the ELF File to initialize the program running. In early systems, the initialization Code corresponds to the function _ init (void) (the function name is forcibly fixed). In modern systems, the corresponding form is

Void

_ Attribute (constructor ))

Init_function (void)

{

......

}

The function name is arbitrary.

8: The dynamic connector passes the control to the program, starting from the entry point of the Program defined in the ELF file header. In a. out format and ELF format, the value of the program entry point exists explicitly, while in COFF format, it is implicitly defined by the standard.


From the above description, we can see that the most important thing to load a file is to load the program segment and data segment to the memory, and relocate the external definition symbol. Relocation is an important concept in program connection. We know that an executable program is usually composed of a main program file, a number of target files, and several Shared Libraries. (Note: You can use some special skills or compile a program without the main function. For details, see references. 2) a c program may reference the variables or functions defined in the shared library, in other words, the address of these variables/functions must be known when the program is running. In static connections, all external definitions to be used by the program are completely included in the executable program, while dynamic connections only set reference information for the relevant external definitions in the executable file, the real relocation is when the program is running. There are two major problems with the static connection method: if there are any changes to the variables or functions in the library, you must re-compile the connection program. If multiple programs reference the same variables/functions, this variable/function appears multiple times in the file/memory, wasting hard disk/memory space. Compare the sizes of executable files generated by the two connection methods. The obvious difference is displayed.

Analysis of a. out File Format

The a. out format is slightly different on different machine platforms and UNIX operating systems. For example, there are 6 sections on the MC680x0 platform. The most "standard" format is discussed below.

The a. out file contains seven sections in the following format:

Exec header (Execution header, which can also be understood as the file header)

Text segment)

Data segment)

Text relocations (text relocation segment)

Data relocations (data relocation segment)

Symbol table)

String table (string table)

Data Structure of the execution header:

Struct exec {

Unsigned long a_midmag;/* magic number and other information */

Unsigned long a_text;/* length of the text segment */

Unsigned long a_data;/* data segment length */

Unsigned long a_bss;/* length of the BSS segment */

Unsigned long a_syms;/* symbol table length */

Unsigned long a_entry;/* program entry point */

Unsigned long a_trsize;/* length of the text relocation table */

Unsigned long a_drsize;/* length of the Data relocation table */

};


The file header mainly describes the length of each section. The most important field is a_entry (program entry point ), it represents the entry for the system to start executing program code after loading the program and testing various environments. This field also appears in the header of the ELF File discussed later. By. we can see from the output format and the header data structure that. the out format is very compact and only contains the information (text, data, and BSS) required for running the program, and the order of each section is fixed. This structure lacks scalability. For example, it cannot contain common debugging information in "modern" executable files. the tool used for debugging out files is adb, and adb is a machine language debugger!


The a. out file contains the symbol table and two relocation tables. The content of these three tables takes effect when connecting the target file to generate an executable file. In the final executable a. out file, the length of the three tables is 0. A. the out file includes all external definitions in executable programs during connection. From the perspective of program design, this is a hard encoding method, or it can be called strong coupling between modules. In the subsequent discussions, we will see how the ELF format and dynamic connection mechanism are improved.

A. out is an executable file format used by early UNIX systems. It was designed by AT&T and is now basically replaced by the ELF file format. The Design of a. out is relatively simple, but its design concept is obviously inherited and carried forward by the subsequent executable file format. You can refer to reference 16 and read reference 15 source code to learn more about the. out format. Reference 12 discusses how to run a. out file in "modern" Red Hat Linux.

COFF file format Analysis

COFF format than. the out format must be more complex. The most important thing is to include a section table. text ,. data, and. in addition to the bss segment, it can also contain other segments. An optional Header is also added. Different Operating Systems can define a specific header.

The COFF file format is as follows:

File Header)

Optional Header (Optional file Header)

Section 1 Header (segment Header)

.........

Section n Header (Section Header)

Raw Data for Section 1 (Section Data)

Raw Data for Section n (Section Data)

Relocation Info for Sect. 1 (segment Relocation data)

Relocation Info for Sect. n (segment Relocation data)

Line Numbers for Sect. 1 (row number data)

Line Numbers for Sect. n (row number data)

Symbol table)

String table (String table)

Data Structure of the file header:

Struct filehdr

{

Unsigned short f_magic;/* Magic Number */

Unsigned short f_nscns;/* number of nodes */

Long f_timdat;/* file creation time */

Long f_symptr;/* offset of the symbol table to the file */

Long f_nsyms;/* Number of symbol table entries */

Unsigned short f_opthdr;/* optional Header Length */

Unsigned short f_flags;/* flag */

};

The magic number in the COFF file header is of different significance from the other two formats. It indicates the target machine type, for example, 0x014c is relative to the I386 platform, and 0x268 is relative to the Motorola 68000 series. When the COFF file is an executable file, the value of the field f_flags is F_EXEC (0X00002). It also indicates that the file has no unparsed symbols. In other words, that is, the relocation has been completed during the connection. We can also see that the original COFF format does not support dynamic connections. To solve this problem and add some new features, some operating systems have extended the COFF format. Microsoft has designed a file format named PE (Portable Executable). The main extension is to add some specialized headers to the COFF file header. For details, see reference 18, some UNIX systems have also extended the COFF format, such as the XCOFF (extended common object file format) format. Dynamic connections are supported. For more information, see reference 5.


The header next to the file is an optional Header. The COFF file format specification specifies that the length of the optional Header can be 0, but the optional Header must exist in LINUX. The data structure of the optional headers in LINUX is as follows:

Typedef struct

{

Char magic [2];/* magic Number */

Char vstamp [2];/* version */

Char tsize [4];/* Text Segment Length */

Char dsize [4];/* initialized data segment length */

Char bsize [4];/* uninitialized data segment length */

Char entry [4];/* program entry point */

Char text_start [4];/* base address of the text segment */

Char data_start [4];/* Data Segment Base Address */

}

COFF_AOUTHDR;

When the field magic is 0413, the COFF file is executable. Note that the program entry point is explicitly defined in the optional Header. The standard COFF file does not explicitly define the value of the program entry point, usually from. the text section is executed, but this design is not good.

As we mentioned above, the COFF format is better than that of. the out format has an extra segment table, and a segment header entry describes the details of a section data. Therefore, the COFF format can contain more sections, or you can add specific sections as needed, it is embodied in the definition of the COFF format and the COFF format extension mentioned earlier. In my opinion, the appearance of a segment table may be the greatest improvement in COFF format compared to a. out format. Next we will briefly describe the data structure of the Section in the COFF file. Because the significance of the section is more reflected in the compilation and connection of the program, this article will not describe it more. In addition, the ELF format and COFF format have very similar definitions of sections. In the subsequent ELF format analysis, we will omit the relevant discussions.

Struct COFF_scnhdr

{

Char s_name [8];/* section name */

Char s_paddr [4];/* physical address */

Char s_vaddr [4];/* virtual address */

Char s_size [4];/* segment length */

Char s_scnptr [4];/* offset of the section data to the file */

Char s_relptr [4];/* segment relocation information offset */

Char s_lnnoptr [4];/* offset of the row information */

Char s_nreloc [2];/* Number of relocated entries */

Char s_nlnno [2];/* Number of Information entries in the row */

Char s_flags [4];/* field mark */

};

Note: In the LINUX system, the comment on the s_paddr field in the header file coff. h is "physical address", but it seems to be understood as "the length of space occupied by the Section loaded into memory ". Field s_flags indicates the type of the Section, such as text segment, data segment, and BSS segment. The line information also appears in the COFF section. The line information describes the ing relationship between the binary code and the source code line number, which is useful for debugging.

ELF File Format Analysis


There are three types of ELF files: relocable files, which are also known as target files with the suffix. o. Shared File: the Library file, which is usually referred to as. so. Executable files: The file formats discussed in this article. In general, the differences between the formats of executable files and the formats of the two files are mainly due to the differences in the observation angle: one is called the Linking View and the other is called the Execution View ).

First, let's take a look at the overall layout of the ELF file:

ELF header (ELF header)

Program header table (Program header table)

Segment1 (segment 1)

Segment2 (Segment 2)

.........

Sengmentn (segment n)

Setion header table (optional)

Segments are composed of several sections. The Section header table describes the information of each Section. The header table is optional for executable programs. In reference 1, the author talked about setting all the data in the header table to 0, and the program can run correctly! The ELF header is a road map of this document, which describes the structure of the file in general. The data structure of the ELF header is as follows:

Typedef struct

{

Unsigned char e_ident [EI_NIDENT];/* magic number and related information */

Elf32_Half e_type;/* target file type */

Elf32_Half e_machine;/* hardware system */

Elf32_Word e_version;/* target file version */

Elf32_Addr e_entry;/* program entry point */

Elf32_Off e_phoff;/* program header offset */

Elf32_Off e_shoff;/* segment header offset */

Elf32_Word e_flags;/* processor-specific flag */

Elf32_Half e_ehsize;/* ELF header length */

Elf32_Half e_phentsize;/* the length of one entry in the program header */

Elf32_Half e_phnum;/* number of entries in the program header */

Elf32_Half e_shentsize;/* the length of one entry in the header of the Section */

Elf32_Half e_shnum;/* number of entries in the header */

Elf32_Half e_shstrndx;/* Partition Table index in the header */

} Elf32_Ehdr;

E_ident [0]-e_ident [3] contains the magic number of ELF files, which are 0x7f, 'E', 'l', and 'F' in sequence '. Note that any ELF file must contain this magic number. Reference 3 describes how to view ELF magic numbers using programs, tools,/Proc file systems, and other methods. E_ident [4] indicates the number of digits in the hardware system. 1 indicates 32 bits, and 2 indicates 64 bits. E_ident [5] indicates the data encoding method. 1 indicates the small Indian sorting (the largest meaningful byte occupies the lowest address ), 2 represents the largest Indian Order (the largest meaningful byte occupies the highest address ). E_ident [6] specifies the ELF header version, which must be 1 currently. E_ident [7] To e_ident [14] is a fill character, usually 0. ELF format standards define these bytes to be ignored, but these bytes are actually fully usable. For example, if the _ ident [7] of virus Lin/Glaurung.676/666 (reference 1) is set to 0x21, the file is infected or executable code is stored (Reference 2 ). Most fields in the ELF header describe the data in the Child header, which has a relatively simple meaning. It is worth noting that some viruses may modify the value of field e_entry (program entry point) to point to the virus code, such as the above mentioned virus Lin/Glaurung.676/666.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.