I. Statement
Although most of Linux's core code is written in C language, some of them are inevitably written in assembly language. Some assembly language code is directly written in the Assembly source program, especially the Linux Startup code. Others are embedded in the C language program using the GCC embedded assembly language. This article briefly introduces the Embedded Assembly Language in GCC. It mainly aims to help those who begin to read the core Linux code to get started faster.
The main sources of information written in this article are the two info files of GNU: as.info and gcc.info. If you think the introduction in this article is not detailed enough, you can refer to these two files. Of course, you can obtain more authoritative information by directly checking the two files. If you do not want to be confused by a large amount of information in these two documents, I suggest you read this article first and then read more authoritative information if necessary.
II. Introduction
A considerable amount of assembly language code still exists in the core code of Linux. If you want to read the Linux code smoothly, you cannot bypass this part of the code. In the Assembly Language code used in Linux, there are two main formats: one is to directly write the source code of the assembly language, this part is mainly the startup code of some Linux; the other part is embedded in the C language code of Linux using the embedded assembly language statement ASM of GCC. This article mainly introduces the second form of assembly language code.
First, let me introduce the syntax format of the as-supported assembly language. As you know, the assembly language format we are learning now is mainly intel-style, while the core code in Linux is at&t-format assembly language code, it should be said that most people do not know much about this format of assembly language, so I think it is necessary to introduce it.
Next, I will introduce the GCC embedded assembly language format. The embedded assembly language of GCC provides a good way to directly embed Assembly commands in the C language source program, which can directly control the generated command sequence, it has a good interface with C language, so this statement is used in many places in Linux code.
Iii. GCC embedded assembly language statement ASM
Using the gcc asm statement, you can directly embed assembly language commands in the C language code. You can also use the C language expression to specify the operands used by the Assembly commands. This feature provides great convenience.
To use this feature, you must first write an assembly instruction template (this template is somewhat similar to the instruction template in the machine description file), and then specify a qualified string for each operand. For example:
Extern _ inline _ void change_bit (int nr, volatile void * ADDR)
{
_ ASM _ volatile _ (lock_prefix
"Btcl % 1, % 0"
:"= M"(ADDR)
:"IR"(NR ));
}
In the above function:
Lock_prefix: This is a macro. If _ SMP __is defined, it is extended"Lock;"Used to specify the prefix of the bus lock; otherwise, it is extended"".
ADDR: this is also a macro, defined as (* (volatile struct _ dummy *) ADDR)
"Btcl % 1, % 0": This is the embedded assembly language instruction. btcl is the instruction operation code. % 1 and % 0 are placeholders of the two operands of this instruction. The following two qualified strings are used to describe these two operands.
:"= M"(ADDR): the qualified string after the first colon is used to describe the "output" operand in the instruction. The ADDR In the scratch sign associates the operand with the C language variable. This qualified string indicates that "% 0" in the instruction is the memory operand pointed to by the ADDR pointer. This is an output type memory operand.
:"IR"(NR): the qualified string after the second colon is used to describe the "input" operand in the instruction. This qualified string indicates that "% 1" in the instruction is the variable Nr. This operand can be an immediate operand or a register operand.
* Note: The correspondence between a character string and Its placeholder operations is as follows: In all qualified strings (including all qualified strings after the first colon and after the second colon ), the first string is used to describe the operand "% 0", the second string to describe the operand "% 1", and so on.
① Assembly instruction Template
The assembly instruction template in the ASM statement mainly consists of the assembly instruction sequence and the qualified string. An ASM statement can contain multiple Assembly commands. In the Assembly instruction sequence, use the placeholder operands to reference the variables in the C language. An ASM statement can contain up to ten placeholder operands: % 0, % 1,..., % 9. The following part of the assembly instruction sequence is the operand limitation string, which limits the placeholders in the instruction sequence. The defined content includes: The placeholder corresponds to the C language variable, and the type of the operand. A restricted string can be divided into three parts: the output operand limit string (the qualified string after the first colon after the command sequence), and the input operand limit string (between the first colon and the second colon ), there is also a third type of restricted string after the second colon. Use commas to separate a specified string of the same type. The first qualified string in the ASM statement is used to describe the Placeholder "% 0", the second is used to describe the Placeholder "% 1", and so on (regardless of the type of the qualified string ). If there is no output operand in the command sequence, there should be two colons before the first qualified string (this string is used to describe the input operand) in the statement, the compiler knows that the instruction has no output operations ).
The C language variable corresponding to the output operand in the command should have the left value type, of course, there is no such left value limit for the output operand. The output operand must be written only. That is to say, it is not directly supported for ASM to extract an operand and save the result back to the result after executing a certain Calculation For this type of assembly instruction, the specific format must be used. If an assembly instruction contains an input-output type operand, two placeholders must be used in the template to reference different functions of the operand: one for input and the other for output. For example:
ASM ("Addl % 2, % 0":"= R"(FOO ):"0"(FOO ),"G"(Bar ));
In the preceding command,
"% 0 "is an input-output type operand,
"= R"(FOO) is used to limit its output function. The output result of this command is stored in the foo variable of C language;
There is no explicit "% 1" operand in the instruction, but there is a qualified string for it"0"(FOO), in fact, the implicit "% 1" operand in the instruction is used to describe the input function of the "% 0" operand, in its qualified string"0"The operands of "% 1" and "% 0" are limited"
Has the same address. The template in the preceding command can be understood as follows: This command adds the values in "% 1" and "% 2", and the calculation result is stored in "% 0, "% 1" in the command has the same address as "% 0. Note: used to describe "% 1""0"The character limit is sufficient to ensure that "% 1" and "% 0" have the same address.
However, if you use the following command to complete this input-output operation, it will not work properly:
ASM ("Addl % 2, % 0":"= R"(FOO ):"R"(FOO ),"G"(Bar ));
Although "% 0" and "% 1" in this command also reference the C language variable Foo, GCC does not guarantee that they have the same address in the generated assembler.
Some assembly commands may change the values of some registers, which must be notified to the compiler in the corresponding assembly instruction template. Therefore, there is a third type of qualified strings in the template. They are separated by colons after the input operand qualified strings. These strings are the names of some registers, which indicate that the command will change the contents of these registers.
Some hardware registers may be directly referenced in Embedded Assembly commands. We already know that in at&t-format assembly languages, Register names are prefixed with "%, to retain this "%" in the generated assembler, the reference to the hardware register in the ASM statement must use "%" as the prefix of the register name. If the Assembly instruction changes the content of the hardware register, do not forget to notify the compiler (add the corresponding string to the third type of limit string ). Some commands may change the content of the CPU flag register eflag, so you need to add the third type of qualified string"CC".
To prevent GCC from changing the Assembly commands in ASM during optimization, you can"ASM"Add the keyword"Volatile"Modifier.
Multiple assembly language commands can be described in an ASM statement. Each assembly command is separated by ";" or "/N.
② Operand character Restriction
In an operand-qualified string, the specified character is used to describe the corresponding operand. Some common qualified characters include: (there are some not involved qualified characters, see gcc.info)
1."M": The operand is a memory variable.
2."O": The operand is a memory variable, but its addressing method must be "offset", that is, base addressing or base addressing.
3."V": The operand is a memory variable and Its addressing mode is not of the "offset" type.
4." ": The operand is a memory variable, and its address is automatically incremental.
6."R": The operands are General registers.
7."I": The operand is the immediate operand. (The value can be determined during assembly)
8."N": The operand is the immediate operand. Some systems do not support immediate operations except words (dubyte). These operations must be"N"Instead"I".
9."G": The operand can be an immediate number, memory variable, or register, as long as the Register belongs to a general register.
10."X": The operand can be of any type.
11."0","1",...,"9": The operand matches a specified operand. That is, the specified operand. For example"0"To describe"% 1"Operand, then"% 1"Actually"% 0"Operand.
12."P": The operand is a valid memory address (pointer ).
13."=": The operand is only written in the instruction (output operand ).
14."+": The operands are read-write (input-output) in the instruction ).
22."F": Floating point register.
23."T": The first floating point register.
24."U": The second floating point register.
27."I": The number of values between 0 and 31. (For 32-bit shift instructions)
28."J": The number of integers between 0 and 63. (For 64-bit shift instructions)
29."N": The immediate number between 0 and. (Used"Out"Command)
30."G": Standard 80387 floating point constant.
Note: Some uncommon restrictions are not described here, and some other restrictions, such"%","&"Due to my lack of compiler knowledge, I am not very familiar with their meanings. If you are willing to add them, I am very grateful! However, the limitation characters in the core code are similar to the above.
Supplement to the article "embedded assembly language in GCC"
The first time I came into contact with the at&t-format assembly code, I looked at the pile of strange and strange symbols. It was a bit painful, so I had to slowly chew on the GCC documentation, it took some time to get started. Later, I found the article "embedded assembly language in GCC" written by lingxi on the Internet, which is of great benefit after reading it. After a few months of access to the source code, I gradually gained some experience. In order to prevent me from suffering from the initial contact with the compilation code of at&t format, I will sort out this article and share it with you. If there are any errors, you are welcome to correct them and improve them together.
This article provides an example to explain the Embedded Assembly Language in GCC.
I. How GCC processes embedded assembly languages
When compiling an embedded assembly language, GCC takes the following steps:
Variable input: according to the content of the qualifier, the input operands are placed in the appropriate register. If the Delimiter is specified as the immediate number ("I") Or memory variable ("M"), This step is omitted. If the qualifier does not specify the type of the input operand (for example"G"), GCC determines whether to input the operand to a register as needed. in this way, each placeholder corresponds to a register, memory variable, or immediate number. this is the explanation of the content after the second colon. for example ::"A"(FOO ),"I"(100 ),"M"(Bar) indicates that % 0 corresponds to eax register, % 1 corresponds to 100, and % 2 corresponds to the memory variable bar.
Code Generation: Then, based on this one-to-one correspondence relationship (the output operator should also be included), use these registers, memory variables, or the immediate number is used to replace the placeholder (a bit like a macro operation) in the assembly code. Note that the first step does not check whether the Assembly Code generated by this replacement operation is legal, for example, if there is such a command ASM ("Movl % 0, % 1"::"M"(FOO ),"M"(Bar); if you use the GCC-C-s option to compile the source file, then in the generated Assembly file, you will see that a command such as movl Foo and bar is generated, this is obviously incorrect. this error will be found later in the compilation check.
Variable output: according to the specified output delimiter, the register content is output to a memory variable. If the delimiter of the output operand is specified as a memory variable ("M"), Then this step is omitted. This is the explanation of the content after the first colon, such as: ASM ("MoV % 0, % 1":"= M"(FOO ),"="(Bar) :); after compilation:
# App
Movl Foo, eax
# No_app
Movl eax, bar
Although this statement is a bit strange, it reflects the GCC operation mode.
Take a piece of code in arch/i386/kernel/APM. C as an example to compare the situations before and after compilation:
Compilation code after source program Compilation
_ ASM __(
"Pushl % EDI/n/T"
"Pushl % EBP/n/T"
"Lcall % CS:/n/T"
"SETC % Al/n/T"
"Addl % 1, % 2/n/T"
"Popl % EBP/n/T"
"Popl % EDI/n/T"
:"="(Ea ),"= B"(EB ),
"= C"(EC ),"= D"(Ed ),"= S"(ES)
:"A"(Eax_in ),"B"(Ebx_in ),"C"(Ecx_in)
:"Memory","CC");
Movl eax_in, % eax
Movl ebx_in, % EBX
Movl ecx_in, % ECx
# App
Pushl % EDI
Pushl % EBP
Lcall % CS:
SETC % Al
Addl EB, EC
Popl % EBP
Popl % EDI
# No_app
Movl % eax, EA
Movl % EBX, EB
Movl % ECx, EC
Movl % edX, Ed
Movl % ESI, es
2. Explanation of the content after the third colon
The content after the third colon is mainly for GCC optimization. It tells GCC usage of registers and memory in this assembly code to avoid GCC errors during optimization. It can be"Eax","EBX","ECX"Indicates that the current Assembly Code performs explicit operations on the Register, such as ASM ("MoV % eax, % 0",:"= R"(FOO )::"Eax"); In this way, GCC will avoid using eax as a temporary variable during optimization, or avoid the memory variable cached to eax passing through this assembly code.
The following code uses gcc-O2-level optimization. It shows that after the third colon in the embedded assembly"Eax"Function
Compilation code after source program Compilation
Int main ()
{
Int bar = 1;
Bar = fun ();
Bar ++;
Return bar;
}
Pushl % EBP
Movl % ESP, % EBP
Call fun
Incl % eax # obviously, bar uses the eax register by default
Leave
RET
After assembly is added:
Compilation code after source program Compilation
Int main ()
{
Int bar = 1;
Bar = fun ();
ASM volatile ("":::"Eax");
Bar ++;
Return bar;
}
Pushl % EBP
Movl % ESP, % EBP
Call fun
Movl % eax, % edX # bar to use the edx register
Incl % edX
Movl % edX, % eax # Return Value of put main ()
RET
"Merory"It is a common limitation, which indicates that the assembly code changes the memory in an unpredictable way, so that GCC will not allow the memory variables cached in the Register to use the Register through the assembly code during optimization, otherwise, a synchronization error may occur. with the above example, this problem is easy to understand.
3. Pair"&"Description of the qualifier
This is a common delimiter used for output. It tells GCC that the registers used for output operations cannot be used again.
For"G","R"To effectively use a few general registers, GCC generally selects the same register for the input and output operands. However, if the code is not compiled properly, unexpected errors may occur, for example:
ASM ("Call fun; MoV EBX, % 1":"="(FOO ):"R"(Bar ));
The GCC compilation result is that Foo and bar both use the eax register:
Movl bar, eax
# App
Call fun
Movl EBX, eax
# No_app
Movl eax, foo
The original intention of this Code is to put the return value of the fun () function into the foo variable, but it takes a bit of money to get out of the process, and the return value is washed out with the value of EBX. Therefore, this is a piece of wrong code, the solution is to add an output operand"&"Qualifier:
ASM ("Call fun; MoV EBX, % 1":"= &"(FOO ):"R"(Bar ));
In this way, GCC will make the input operand look for another height and no longer use the eax register.
Note:
In fact, it is meaningless to read code. Only GCC or compilation is concerned.
If you need to know the role of &, Let's explain it here.
1. When GCC processes embedded assembly, if the two input operation values are the same, they may be allocated to the same register to reduce the use of the register.
2. GCC regards the Embedded Assembly as a whole. It does not know whether it is an instruction or multiple instructions. Generally, it considers that the input operand has not changed while the output result of the assembly instruction, this assumption is mostly true, but it may not be true when there are multiple commands. For example, the definition is as follows:
# Define Add1 (a, B) ASM ("Incl % 0/n/taddl % 2, % 0":"= R"(RES ):"0"(),"R"(B ))
Calculate A + B + 1. Let's take a look at the commands generated by Add1 (A, A), assuming that % eax contains the value of:
# % 0 = eax, % 2 = eax
# App
Incl % eax
Addl % eax, % eax
# No_app
The result is incorrect. It is easy for anyone who reads or writes the Assembly to see this, but GCC does not know it. Add a & and define it:
# Define Add1 (a, B) ASM ("Addl % 2, % 0":"= & R"(RES ):"0"(),"R"(B ))
Tell the GCC parameter Number % 0 to be earlyclobber and do not assign it to the same register as % 2. In this way, Add1 (A, A) generates the command:
Movl % eax, % edX
# % 0 = edX, % 2 = eax
# App
Incl % edX
Addl % eax, % edX # edX = Output
# No_app
4.% quot ;&"Description of the qualifier (supplemented by la)
%: Indicates the number of operations that can be exchanged with the next operand in the instruction. This means that compilation can exchange the numbers of these two operations to satisfy the operand constraints in a less costly way, this is often used in the instruction template of the addition command that really has only two operands. The result of this addition command must be stored in one of the two operands.