Compilation development in Linux

Source: Internet
Author: User
(Note: This semester I have a compilation course and I will find a document to learn)
The advantage of assembly language is that it is fast and can operate on hardware directly, which is very important for key applications such as various processing. Linux is an operating system developed in C language, which makes many programmers forget that in Linux, the underlying language of assembly can be directly used to optimize program performance. This article provides a guide for programmers who compile assembly code on the Linux platform, introduces the syntax format and development tools of the Linux assembly language, and describes how to develop a practical Linux assembly program with examples.
 

I. Introduction
As one of the most basic programming languages, assembly language, although not widely used, is of no doubt important because it can accomplish functions that many other languages cannot do. Taking the Linux kernel as an example, although most of the Code is written in C, it is inevitable to use assembly code in some key areas, mainly in the Linux Startup part. Due to the close relationship between this part of code and hardware, even the C language can be a little powerless, while the assembly language can be very good to foster strengths and circumvent weaknesses, to maximize the performance of the hardware.

 

In most cases, Linux programmers do not need to use the assembly language, because even underlying programs such as hardware drivers can be fully implemented in the Linux operating system using the C language, in addition, the GCC excellent compiler has been able to optimize the final generated code. There are indeed enough reasons for us to temporarily leave the Assembly Language aside. But the implementation is that Linux Programmers sometimes still need to use assembly, or have to use assembly. The reason is simple: simplification, efficiency, and libc independence. If we want to port Linux to a specific embedded hardware environment, we must first face problems such as how to reduce the system size and improve the execution efficiency. At this time, only the assembly language may be able to help.

 

The Assembly Language Directly Interacts with the underlying software or even hardware of the computer. It has the following advantages:
 

Direct access to hardware-related memory or I/O ports;
It can completely control the generated binary code without being limited by the compiler;
More accurate control of key code to avoid deadlocks caused by concurrent thread access or hardware device sharing;
Optimize the code based on specific applications to improve the running speed;
The ability to maximize the functionality of the hardware.

At the same time, we should also realize that the Assembly Language is a very low-level language, which is only higher than directly writing binary machine instruction codes. Therefore, it inevitably has some disadvantages:
 

The written code is difficult to understand and difficult to maintain;
It is easy to generate bugs and difficult to debug;
Only specific architecture and processors can be optimized;
Low development efficiency, long time and monotonous.

The code written in assembly language in Linux has two different forms. The first is complete assembly code, which means that the entire program is all written in assembly language. Despite the complete compilation code, the compilation tools on the Linux platform also absorb the advantages of the C language, so that programmers can use the # include, # ifdef, and other pre-processing commands, and can simplify the Code through macro definition. The second is embedded assembly code, which refers to the assembly code snippets that can be embedded into C language programs. Although ansi c language standards do not provide relevant provisions on Embedded Assembly Code, all the actually used C compilers have been expanded in this regard, of course, this includes GCC on the Linux platform.

 

Ii. Linux Assembly syntax format
The vast majority of Linux programmers have previously only been familiar with DOS/Windows assembly languages. These Assembly codes are intel-style. However, in UNIX and Linux systems, the at&t format is mostly used. The two are quite different in syntax format:
 

In at&t Assembly format, Register names must be prefixed with '%'. In Intel assembly format, Register names do not need to be prefixed. For example:

 

At&t format
Intel format

Pushl % eax
Push eax

 

In at&t Assembly format, the prefix '$' is used to represent an immediate operand. In Intel assembly format, the representation of immediate numbers does not contain any prefix. For example:

 

At&t format
Intel format

Pushl $1
Push 1

 

The source and target operands in at&t and Intel are in the opposite position. In Intel assembly format, the destination operand is on the left of the source operand, while in at&t Assembly format, the destination operand is on the right of the source operand. For example:

 

At&t format
Intel format

Addl $1, % eax
Add eax, 1

 

In at&t Assembly format, the length of an operand is determined by the last letter of the operator. The suffixes 'B', 'w', and 'l' indicate that the operands are bytes (byte, 8 bits) characters (word, 16 bits) and long characters (long, 32 bits). In Intel assembly format, the length of an operand is expressed by prefix such as "Byte PTR" and "word PTR. For example:

 

At&t format
Intel format

Movb Val, % Al
MoV Al, byte PTR Val

 
In at&t Assembly format, the prefix '*' must be added before the operands of absolute transfer and call commands (jump/call), but not in Intel format.

The operation code of the remote Transfer Instruction and remote sub-call instruction is "ljump" and "lcall" in at&t Assembly format, while "JMP far" and "Call far" in Intel assembly format ", that is:

 

At&t format
Intel format

Ljump $ section, $ offset
JMP far section: Offset

Lcall $ section, $ offset
Call far section: Offset

 

The corresponding remote return command is:

 

At&t format
Intel format

LRET $ stack_adjust
RET far stack_adjust

 

In at&t Assembly format, the addressing method of memory operands is

 

Section: disp (base, index, scale)

 

In Intel assembly format, the addressing mode of memory operands is:

 

Section: [base + Index * scale + disp]

 

Because Linux uses 32-bit linear addresses in protection mode, the following address calculation method is used instead of considering the segment base address and offset when calculating the address:

 

Disp + base + Index * Scale

 

The following is an example of some memory operations:

 

At&t format
Intel format

Movl-4 (% EBP), % eax
MoV eax, [EBP-4]

Movl array (, % eax, 4), % eax
MoV eax, [eax * 4 + array]

Movw array (% EBX, % eax, 4), % CX
MoV CX, [EBX + 4 * eax + array]

Movb $4, % FS :( % eax)
MoV FS: eax, 4

 

3. Hello world!
I really don't know what will happen if I break this tradition, but since the first example of all programming languages is to print a string "Hello World!" on the screen! ", Then we will introduce the Assembly Language Programming in Linux in this way.

 

In Linux, there are many ways to display a string on the screen, but the simplest way is to use the system call provided by the Linux kernel. The biggest advantage of using this method is that it can communicate directly with the kernel of the operating system. You do not need to link a function library such as libc or use the elf interpreter, therefore, the code size is small and the execution speed is fast.

 

Linux is a 32-bit operating system running in protected mode. It adopts the flat memory mode. Currently, binary code in ELF format is the most commonly used. An executable program in the ELF format is generally divided into the following parts :. text ,. data and. BSS, where. text is a read-only code area ,. data is a readable and writable data area, while. BSS is a readable and writable data zone without initialization. Code and data zones are collectively called sections in elf. You can use other standard sections or add custom sections as needed, but at least one elf executable program should have one. text section. The following is our first assembler, In the at&t assembly language format:

 

Example 1. at&t format

 
 

# Hello. s
. Data # Data Segment Declaration
MSG:. String "Hello, world! // N "# string to be output
Len =.-MSG # String Length

. Text # code snippet Declaration
. Global _ start # specify the entry function

_ Start: # display a string on the screen
Movl $ Len, % edX # parameter 3: String Length
Movl $ MSG, % ECx # parameter 2: string to be displayed
Movl $1, % EBX # parameter 1: file descriptor (stdout)
Movl $4, % eax # system call number (sys_write)
Int $0x80 # Call the kernel function

# Exit the program
Movl $0, % EBX # parameter 1: Exit code
Movl $1, % eax # system call number (sys_exit)
Int $0x80 # Call the kernel function

 

When I first came into contact with at&t-formatted assembly code, many programmers thought it was too obscure. It doesn't matter. On the Linux platform, you can also use the Intel format to compile the assembly program:

 

Example 2. Intel format

 
 

; Hello. ASM
Section. data; data segment Declaration
Msg db "Hello, world! ", 0xa; string to be output
Len equ $-MSG; String Length

Section. Text; code segment Declaration
Global _ start; specifies the entry function

_ Start:; display a string on the screen
MoV edX, Len; parameter 3: String Length
MoV ECx, MSG; parameter 2: string to be displayed
MoV EBX, 1; parameter 1: file descriptor (stdout)
MoV eax, 4; system call number (sys_write)
Int 0x80; call the kernel function

; Exit the program
MoV EBX, 0; parameter 1: Exit code
MoV eax, 1; system call number (sys_exit)
Int 0x80; call the kernel function

 

Although the syntax used by the above two assembler programs is completely different, the function is to call the sys_write provided by the Linux kernel to display a string, and then call sys_exit to exit the program. In the Linux kernel source File Include/asm-i386/unistd. H, you can find the definitions of all system calls.

 

Iv. Linux assembly tools
There are many types of assembler tools on Linux, but like DOS/Windows, the most basic tools are assembler, connector, and debugger.

 

1. Assembler

The assembler is used to convert source programs written in assembly languages into binary-format target codes. The standard assembler on the Linux platform is gas, which is the background compilation tool on which GCC depends. It is usually included in the binutils software package. Gas uses the standard at&t Assembly syntax and can be used to compile programs written in at&t format:

 
 

[Xiaowp @ Gary Code] $ as-O hello. O hello. s

 

Another commonly used assembler on Linux is NASM. It provides good macro commands and supports a considerable number of target code formats, including bin and. out, coff, elf, and RDF. NASM uses a manually compiled syntax analyzer, so the execution speed is much faster than that of gas. More importantly, it uses Intel assembly syntax, it can be used to compile assembler programs written in Intel syntax format:

 
 

[Xiaowp @ Gary Code] $ NASM-F elf hello. ASM

 

2. linker

The target code generated by the assembler cannot run directly on the computer. It must be processed by the linker to generate executable code. The linker is usually used to connect multiple target codes into one executable code. In this way, the entire program can be divided into several modules for separate development before they can be combined into an application. Linux uses LD as a standard linking program, which is also included in the binutils package. After compilation and generation of the target code through gas or NASM, the assembler can use LD to link it to an executable program:

 
 

[Xiaowp @ Gary Code] $ LD-S-O hello. o

 

3. Debugger

Some people say that the program is not compiled but called out. It shows that debugging plays an important role in software development, especially when programming in assembly languages. In Linux, You can debug assembly code by using a general debugger such as GDB and DDD, or by using an ALD (assembly language debugger) Specially Used to debug assembly code ).

 

From the perspective of debugging, the advantage of using gas is that you can include the symbol table in the generated target code ), in this way, you can use GDB and DDD for source code-level debugging. To include the symbol table in the generated executable program, you can compile and link the table in the following way:

 
 

[Xiaowp @ Gary Code] $ as -- maid-O hello. O hello. s
[Xiaowp @ Gary Code] $ LD-O hello. o

 

When running the as command, the parameter-ststabs can tell the assembler to add a symbol table to the generated target code. Note that the-S parameter is not added when the LD command is used for link, otherwise, the symbol table in the target code will be deleted during the link.

 

Debugging assembly code in GDB and DDD is the same as debugging C-language code. You can set breakpoints to interrupt program running and view the current values of variables and registers, you can also track the code in one step. Figure 1 shows the scenario when debugging assembly code in DDD:

 

 
Figure 1 debug assembler with DDD

Assembly programmers usually face some harsh software and hardware environments. The short and concise ALD may better meet the actual needs. Therefore, the following describes how to use the ALD to debug the assembly program. First, run the ALD command in the command line mode to start the debugger. The parameter of this command is the executable program to be debugged:

 
 

[Xiaowp @ Gary Doc] $ ALD hello
Assembly Language debugger 0.1.3
Copyright (c) 2000-2002 Patrick alken

Hello: Elf intel 80386 (32 bit), LSB, executable, Version 1 (current)
Loading debugging symbols... (15 symbols loaded)
ALD>

 

When the ALD prompt appears, run the disassemble command to decompile the code segment:

 
 

ALD> disassemble-S. Text
Disconfiguring section. Text (0x08048074-0x08048096)
08048074 ba0f000000 mov edX, 0xf
08048079 b998900408 mov ECx, 0x8049098
0804807e bb0000000 mov EBX, 0x1
08048083 b804000000 mov eax, 0x4
08048088 CD80 int 0x80
0804808a bb00000000 mov EBX, 0x0
0804808f b80000000 mov eax, 0x1
08048094 CD80 int 0x80

 

The first column of the output information is the address code corresponding to the command. It can be used to set the breakpoint during program execution:

 
 

ALD> Break 0x08048088
Breakpoint 1 set for 0x08048088

 

After the breakpoint is set, run the Run Command to run the program. When a breakpoint occurs, the system automatically suspends the program and displays the current values of all registers:

 
 

ALD> RUN
Starting program: Hello
Breakpoint 1 encountered at 0x08048088
Eax = 0x00000004 EBX = 0x00000001 ECx = 0x08049098 edX = 0x0000000f
ESP = 0xbffff6c0 EBP = 0x00000000 ESI = 0x00000000 EDI = 0x00000000
DS = 0x0000002b es = 0x0000002b FS = 0x00000000 GS = 0x00000000
Ss = 0x0000002b cs = 0x00000023 EIP = 0x08048088 eflags = 0x00000246

Flags: pf ZF if

08048088 CD80 int 0x80

 

To debug the assembly code in one step, run the following command:

 
 

ALD> next
Hello, world!
Eax = 0x0000000f EBX = 0x00000000 ECx = 0x08049098 edX = 0x0000000f
ESP = 0xbffff6c0 EBP = 0x00000000 ESI = 0x00000000 EDI = 0x00000000
DS = 0x0000002b es = 0x0000002b FS = 0x00000000 GS = 0x00000000
Ss = 0x0000002b cs = 0x00000023 EIP = 0x0804808f eflags = 0x00000346

Flags: pf zf tf if

0804808f b80000000 mov eax, 0x1

 

To obtain a detailed list of all the Debugging commands supported by ALD, you can use the help command:

 
 

ALD> help
Commands may be abbreviated.
If a blank command is entered, the last command is repeated.
Type 'help <command> 'for more specific information on <command>.

General commands
Attach clear continue detach disassemble
Enter examine file help load
Next quit register run set
Step unload window write

Breakpoint related commands
Break Delete disable enable ignore
Lbreak tbreak

 

V. System Call
Even the simplest assembler program will inevitably use operations such as input, output, and exit. To perform these operations, you must call the services provided by the operating system, that is, system calls. Unless your program only performs addition, subtraction, multiplication, division, and other mathematical operations, it will be difficult to avoid using system calls. In fact, except for different system calls, assembly programming of various operating systems is often very similar.

 

On the Linux platform, there are two ways to use system calling: using the encapsulated C library (libc) or calling directly through assembly. The method of using the Linux kernel service is the most efficient way to directly call the system call through the assembly language, because the generated program does not need to be linked to any library, but directly communicates with the kernel.

 

Like Dos, system calls in Linux are also implemented through interruptions (INT 0x80. When executing the int 80 command, the register eax stores the function number of the system call, and the parameters passed to the system call must be placed in the registers EBX, ECx, EDX, ESI, in EDI, after the system call is completed, the return value can be obtained in the register eax.

 

All system call function numbers can be found in the file/usr/include/bits/syscall. h. For ease of use, they are defined using macros such as sys _ <Name>, such as sys_write and sys_exit. For example, the frequently used write function is defined as follows:

 
 

Ssize_t write (int fd, const void * Buf, size_t count );

 

The function is ultimately implemented through the sys_write system call. According to the above conventions, the parameters FB, Buf, and count exist in the registers EBX, ECx, and EDX respectively, while the system call number sys_write is placed in the register eax, after the int 0x80 command is executed, the returned value can be obtained from the register eax.

 

You may have discovered that at most five registers can be used to save parameters during system calls. Is the number of parameters called by all systems not greater than 5? Of course not. For example, the MMAP function has six parameters. These parameters must be passed to the system to call sys_mmap:

 
 

Void * MMAP (void * Start, size_t length, int Prot, int flags, int FD, off_t offset );

 

When the number of parameters required for a system call is greater than 5, when the int 0x80 command is executed, the system call function number still needs to be saved in the register eax, the difference is that all parameters should be placed in a contiguous memory area, and the pointer pointing to the memory area should be saved in the register EBX. After the system call is complete, the returned values are still stored in the register eax.

 

Because we only need a contiguous memory area to store system call parameters, we can use stacks to pass the parameters required for system calls just like common function calls. Note that Linux uses the C-language call mode, which means that all parameters must be pushed to the stack in the reverse order, that is, the last parameter is pushed to the stack first, the first parameter is then last written to the stack. If the stack is used to pass the parameters required by the system call, the current value of the stack pointer should also be copied to the Register EBX when the int 0x80 command is executed.

 

Vi. Command Line Parameters
In Linux, when an executable program is started through a command line, the required parameters are saved to the stack: argc first, next, the array argv pointing to the parameters of each command line, and finally the environment variable pointer data envp. When compiling an assembly language program, you often need to process these parameters. The following Code demonstrates how to process command line parameters in assembly code:

 

Example 3. process command line parameters

 
 

# Args. s
. Text
. Globl _ start

_ Start:
Popl % ECx # argc

Vnext:
Popl % ECx # argv
Test % ECx, % ECx # NULL pointer indicates end
JZ exit

Movl % ECx, % EBX
Xorl % edX, % edX
Strlen:
Movb (% EBX), % Al
INC % edX
INC % EBX
Test % Al, % Al
Jnz strlen
Movb $10,-1 (% EBX)

Movl $4, % eax # system call number (sys_write)
Movl $1, % EBX # file descriptor (stdout)
Int $0x80

JMP vnext

Exit:
Movl $1, % eax # system call number (sys_exit)
Xorl % EBX, % EBX # exit code
Int $0x80

RET

 

VII. GCC inline assembly
Although compiled programs run fast, the development speed is very slow and the efficiency is very low. If you only want to optimize key code segments, it may be better to embed Assembly commands into a C-language program to take full advantage of the respective features of the advanced language and assembly language. But in general, embedding Assembly statements in C code is much more complicated than the "pure" assembly language code, because it is necessary to solve how to allocate registers, and how to combine with the variables in C code.

 

GCC provides good support for inline assembly. The most basic format is:

 
 

_ ASM _ ("ASM statements ");

 

For example:

 
 

_ ASM _ ("NOP ");

 

To execute multiple Assembly statements at the same time, separate the statements with "// N // t", for example:

 
 

_ ASM _ ("pushl % eax // N // t"
"Movl $0, % eax // N // t"
"Popl % eax ");

 

The Assembly statements embedded in C code are hard to have nothing to do with other parts. Therefore, the complete inline assembly format must be used more often:

 
 

_ ASM _ ("ASM statements": outputs: Inputs: registers-modified );

 

The Assembly statement inserted into the C code is separated by ":". The first part is the assembly code, which is usually called the instruction department, the format is basically the same as that used in assembly languages. The command part is required, while the other part can be omitted based on the actual situation.

 

When embedding Assembly statements into C code, how to combine operands with variables in C code is a big problem. GCC uses the following method to solve this problem: the programmer provides specific instructions, and the use of registers only requires the "sample" and constraints, GCC and gas are responsible for how to combine registers and variables.

 

In the command department of the GCC inline assembly statement, the number prefixed with '%' (for example, % 0, % 1) indicates the "sample" operand of the register. When the instruction Department uses several sample operands, it indicates that several variables need to be combined with registers, so that GCC and gas will properly process the compilation and compilation according to the given constraints. Because the sample operand uses '%' as the prefix, two '%' should be added before the register name in case of specific registers to avoid confusion.

 

The output department is followed by the instruction Department. It is a condition that specifies how the output variable is combined with the sample operand. Each condition is called a "constraint" and can contain multiple constraints when necessary, separate them with commas. Each output constraint starts with the '=' sign, followed by a description of the operand type, and finally the constraint on how to combine with the variable. All registers or operands that combine with the operands described in the output part do not retain the content before execution after the embedded assembly code is executed, this is the basis for GCC in scheduling registers.

 

The output part is followed by the input part. The format of the input constraint is similar to that of the output constraint, but it does not contain the '=' sign. If a register is required for an input constraint, GCC allocates a register for it during preprocessing and inserts necessary commands to load the operands into the register. Registers or operands that are combined with the operands described in the input part are not reserved after the embedded assembly code is executed.

 

Sometimes in some operations, in addition to the registers used for data input and output, multiple registers are also used to save the intermediate calculation results, which will inevitably destroy the content of the original register. In the last part of the GCC inline assembly format, you can describe the registers that will produce side effects so that GCC can take appropriate measures.

 

The following is a simple example of inline assembly:

 

Example 4: inline assembly

 
 

/* Inline. C */

Int main ()
{
Int A = 10, B = 0;

_ ASM _ volatile _ ("movl % 1, % eax; // n/R"
"Movl % eax, % 0 ;"
: "= R" (B)/* output */
: "R" (a)/* input */
: "% Eax");/* unaffected REGISTERS */

Printf ("Result: % d, % d // n", a, B );
}

 

The preceding Procedure assigns the value of variable A to variable B, which must be described as follows:

 
Variable B is the output operand, which is referenced by % 0, and variable A is the input operand, which is referenced by % 1.
Both the input and output operations use R constraints to store variables A and B in registers. The difference between an input constraint and an output constraint is that an output constraint has one constraint modifier '= '.
When using the register eax in an inline assembly statement, add two '%' before the register name, that is, % eax. In inline assembly, variables are identified by % 0, % 1, and so on. Any identifier with only one '%' is regarded as an operand rather than a register.
The last part of the inline assembly statement tells GCC that it will change the value in the register eax. GCC should not use this register to store any other value during processing.
Because variable B is specified as the output operand, after the inline assembly statement is executed, the saved value is updated.

The operands used in inline assembly start from the first constraint in the output part and start from 0. Each constraint is counted once. When the instruction Part references these operands, you only need to add '%' before the serial number as the prefix. Note that When referencing an operand, the instruction department of an inline assembly statement always uses it as a 32-bit long word, but the actual situation may need words or bytes, therefore, the correct qualifier should be specified in the constraints:

 

Qualifier
Meaning

"M", "V", "O"
Memory Unit

"R"
Any register

"Q"
Registers eax, EBX, ECx, and EDX

"I", "H"
Direct operand

"E" and "F"
Floating Point Number

"G"
Arbitrary

"A", "B", "C", "D"
Registers eax, EBX, ECx, and EDX

"S" and "D"
Register ESI and EDI

"I"
Constant (0 to 31)

 

VIII. Summary
The Linux operating system is written in C language. Compilation is only thought of when necessary, but it is a very important way to reduce the code size and optimize the code performance, assembly is the best choice, especially when directly interacting with hardware. Linux provides excellent tools to support Compilation Program Development. Using GCC inline assembly can give full play to the advantages of C and assembly languages.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.