Support for fast system calls of new CPUs in Linux 2.6 Reposted from: ibmdeveloperworks China Liu Zui Linux enthusiasts May 2004
This article analyzes the implementation of the Intel CPU quick System Call Command sysenter/sysexit introduced in Linux 2.6. Linux drivers and kernel developers can use this mechanism in their own code to improve system performance by understanding the mechanism of fast System Call commands, it also avoids some limitations caused by the fast System Call method (such as nested system calls in system calls ).
Preface In the Linux 2.4 kernel, the user-state ring3 code requests the kernel-state ring0 code to complete some functions through the system call, and the system calls the Soft Interrupt command (INT 0x80). In x86 protection mode, when processing int interrupt commands, the CPU first extracts the corresponding door descriptor from the interrupt description table IDT to determine the category of the door descriptor, then, check the level CPL of the gate descriptor DPL and INT command callers. When CPL <= DPL means that the int caller level is higher than the specified level of the descriptor, the call can be successful, at last, the stack, jump, and permission level are upgraded Based on the descriptor content. After the kernel code is executed, the iret command is called to return. The iret command restores the user stack and jumps to lower-level code. In fact, in the case of system calls, the ring3 enters ring0, which wastes a lot of CPU cycles. For example, system calls must be directed from ring3 to ring0 (except for the int command called by the kernel, most of which are performed by the hacker kernel module). The level before and after permission escalation is fixed, cpl must be 3, and the DPL of int 80 must be 3, so that the CPU checks the DPL of the gate Descriptor and the CPL of the caller is completely unnecessary. Because of this, intel X86 CPU starts to support the new system call command sysenter/sysexit after PII 300 (family 6, model 3, stepping 3. The sysenter command is used to enter ring0 from ring3, And the sysexit command is used to return ring3 from ring0. Because there is no handle for privileged-level checks and there is no operation on the stack, the execution speed is much faster than int N/iret. Performance Comparison of Different system call methods: The following is a comparison of the performance of sysenter/sysexit commands and int n/iret commands on Intel Pentium CPU from the Internet: Table 1: System Call performance test hardware: Intel Pentium iii cpu, 450 mhzprocessor family: 6 model: 7 stepping: 2
|
Time spent in user mode |
Core mode time spent |
System Call Based on sysenter/sysexit commands |
9.833 microseconds |
6.833 microseconds |
System Call Based on interrupt int n command |
17.500 microseconds |
7.000 microseconds |
Data source: [1] Data source: [2]
Table 2: Comparison of int 0x80 and sysenter execution speeds on various CPUs
CPU |
Int0x80 |
Sysenter |
Athlon XP 1600 + |
277 |
169 |
800 MHz Mode 1 athlon |
279 |
170 |
2.8 GHz P4 Northwood HT |
1152 |
442 |
The above data is the average value of the CPU clock cycle consumed by 100000 getppid () system calls. Data source [3] Since the launch of this technology, people have been considering adding support for such commands in Linux. In the kernel.org mail list, A large number of emails with the topic "Intel P6 vs P7 System Call performance" discussed the necessity of using such commands. The reason listed in the email is that intel has problems in the design of Pentium 4, system calls that interrupt the execution of Pentium 4 are 5 to 5 more than the CPU clock cycles consumed by Pentium 3 and AMD athlon ~ 10 times. Therefore, on the Pentium 4 platform, it is imperative to use sysenter/sysexit commands to execute system calls. Sysenter/sysexit System Call mechanism: Section 4.8.7 describes the sysenter/sysexit instructions in Intel's software developer Manual (vol.2b, Vol.3. As described in the Manual, the sysenter command can be used for user code at privileged level 3 to call system kernel code at privileged level 0, while the sysexit command is used for system code at privileged level 0 to return to user space. The sysenter command can be called at the three privileged levels of 3, 2, and 1 (only privileged level 3 is used in Linux), while the sysexit command can only be called at the privileged level of 0. The system that executes the sysenter command must meet two conditions: 1. The target RING 0 code segment must be a 4 GB readable and executable non-consistent code segment in flat mode. 2. The target ring0 stack segment must be a 4 GB readable and writable extended stack segment in flat mode. The intel Manual also mentions the difference between sysenter/sysexit and int n/iret commands, that is, the sysenter/sysexit commands are not paired, the sysenter command does not stack the returned address required by sysexit. The address returned by sysexit is not necessarily the next instruction address of the sysenter command. The redirection of the sysenter/sysexit command address is implemented by setting a set of special registers. These registers include: Sysenter_cs_msr-specifies the code segment selector of the Ring 0 code to be executed. It can also obtain the segment selector of the stack segment used by the target RING 0; Sysenter_eip_msr-used to specify the starting address of the Ring 0 code to be executed; Sysenter_esp_msr-the stack pointer used to specify the Ring 0 code to be executed These registers can be set through the wrmsr command. When the wrmsr command is executed, the configured values are specified through the edX and eax registers, the high 32 bits of the specified value of edX, and the low 32 bits of the value specified by eax, when the preceding registers are set, edX is 0. MSR registers are filled with the specified MSR registers through the ECX registers. The registers of sysenter_cs_msr, sysenter_esp_msr, and sysenter_eip_msr correspond to 0X174, 0x175, note that the wrmsr command can only be executed at ring 0. Here we also introduce a feature, that is, the code segment descriptors of ring0 and ring3 are sequentially arranged in the gdt Global Descriptor Table so that you only need to know the code segment descriptors of ring0 specified in sysenter_cs_msr, the stack segment descriptor of ring0 and the code segment descriptor and stack segment descriptor of ring3. After the ring3 code calls the sysenter command, the CPU performs the following operations: 1. Load the value of sysenter_cs_msr to the CS register 2. Load the value of sysenter_eip_msr to the EIP register. 3. Load the value of sysenter_cs_msr plus 8 (stack segment descriptor of ring0) to the SS register. 4. Load the value of sysenter_esp_msr to the ESP register. 5. Switch the privileged level to ring0 6. If the VM flag of the eflags register is set, clear the flag. 7. Start executing the specified ring0 code After the ring0 code is executed and the sysexit command is called to return the ring3 code, the CPU will perform the following operations: 1. Load the value of sysenter_cs_msr plus 16 (code segment descriptor of ring3) to the CS register. 2. Load the edX value of the Register to the EIP register 3. Load the value of sysenter_cs_msr plus 24 (stack segment descriptor of ring3) to the SS register. 4. Load the ECX value of the Register to the ESP register 5. Switch the privileged level to ring3 6. continue executing the ring3 code Therefore, before calling sysenter to enter ring0, you must use the wrmsr command to set information about ring0 code. Before calling sysexit, ensure the correctness of the registers EDX and ECx. How do I know whether the CPU supports sysenter/sysexit commands? According to Intel's CPU manual, we can use the cpuid command to check whether the CPU supports the sysenter/sysexit command. In this way, assign the eax register value to 1 and call the cpuid command, in the register edX, the 11th-bit (this bit is named SEP) indicates whether it is supported. After calling the cpuid command, you also need to check the family, model, and stepping attributes of the CPU for confirmation, because it is said that the Pentium Pro processor will report Sep but does not support the sysenter/sysexit command. Only when family is greater than or equal to 6, model is greater than or equal to 3, and stepping is greater than or equal to 3 can we confirm that the CPU supports the sysenter/sysexit command. Support for sysenter/sysexit system calling in Linux In kernel 2.4, the latest version 2.4.26-RC2 has not been added with support for sysenter/sysexit commands. The earliest support for sysenter/sysexit commands was in 2002. It was written by Linus Torvalds and added to the 2.5 kernel for the first time. After multiple tests and multiple patches, finally, it was officially added to the kernel of version 2.6. Http://kerneltrap.org/node/view/531/1996 Http://lwn.net/Articles/18414/ When it comes to the completion of system calls, we cannot look at the kernel code in an isolated way. We know that most system calls are encapsulated into library functions for application calls. After an application calls a library function, the glibc library is responsible for entering the kernel to call the system call function. When the 2.4 kernel is added with the old version of glibc, all the library functions do is to use the int command to call the system. The system call interface provided by the kernel is very simple, as long as the entry of int 0x80 is provided in IDT, the database can complete the interrupt call. In the 2.6 kernel, the kernel code also supports int 0x80 interrupt mode and sysenter command mode call. Therefore, the kernel provides a piece of entry code for the user space, when the kernel is started, the system call method of this Code is determined based on the CPU type. For glibc, the system call can be completed by directly calling the entry code without considering the system call method. In this way, you can minimize the changes to glibc. In the glibc source code, you only need to replace the "int $0x80" command with "Call entry address. The following uses the kernel code 2.6.0 in combination with glibc2.3.3 that supports the sysenter call method as an example to analyze the specific implementation of the system call. Kernel preparation at startup The preceding entry code is divided into two files based on the call method. The Code supporting the sysenter command is included in the file ARCH/i386/kernel/vsyscall-sysenter.S, the code that supports int interrupt is included in arch/i386/kernel/vsyscall-int80.S, And the entry names are both _ kernel_vsyscall. the binary code compiled by these two files is composed of arch/i386/kernel/vsyscall. s, and export the start address and end address. 2.6 When the kernel is started, the new function sysenter_setup is called (see ARCH/i386/kernel/sysenter. c) in this function, the kernel maps a fixed address page (from 0xffffe000 to 4 K size of 0xffffffff) at the top of the virtual memory space to an idle physical memory page. Then, you can run the cpuid command to check whether the CPU supports the sysenter/sysexit command. If the CPU does not support this function, the entry code that uses the int call method is copied to this page and then returned. If the CPU supports the syseter/sysexit command, copy the entry code that uses the sysenter call method to this page. Use the macro on_each_cpu to execute the enable_sep_cpu function on each CPU. In the enable_sep_cpu function, the kernel sets SS1 In the TSS structure of the current CPU as the code segment used by the current kernel, and esp1 is a 256-byte stack retained in the TSS structure. In x86, SS1 and esp1 In the TSS structure are used to save the stack segment and stack pointer of the ring 1 process. The kernel cannot predict the exact value of ESP after calling the sysenter command to enter RING 0 at startup, and the application does not have the right to call the wrmsr command for dynamic settings, therefore, esp1 is used to point to a fixed buffer to fill the MSR register. Because ring 1 is not enabled at all, it will not affect the system. In the following article, we will introduce how the kernel repairs ESP to point to the correct RING 0 stack after entering RING 0. For more information about the TSS structure, see the Code include/asm-i386/processor. h ). Then, the kernel uses the wrmsr (MSR, val1, val2) macro to call the wrmsr command to set the MSR register for the current CPU. It can be seen that the third parameter called the macro, that is, EDX, is set to 0. The value of sysenter_cs_msr is set to the code segment used by the current kernel, and the value of sysenter_esp_msr is set to esp1, that is, it points to the stack in the TSS structure of the current CPU; sysenter_eip_msr is set to the interface function sysenter_entry for processing the sysenter command in the kernel (see ARCH/i386/kernel/entry. s ). In this way, the sysenter command preparation is complete. When the kernel is started, you can access the code page mapped to the kernel in the process space of each process. Of course, this page is read-only for applications. We can use the new LDD tool to view any executable program and see the following results:
[root@test]# file dynamicdynamic: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped[root@test]# ldd dynamic linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/libc.so.6 (0x4002c000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
|
This so-called "linux-gate.so.1" content is the kernel ing code, the system does not actually have such a link library file, its name is from the LDD own, in the old version of LDD, although this code can be detected, there may be some display problems because it is not named and the corresponding link library file cannot be found in the system. For more information about the background, see http://sources.redhat.com/mllibc-alpha/2003-09/msg00263.html. The user-mode function enters the kernel state. In order to use the new system call method with the kernel, some modifications should be made in glibc. This change has been included in the new glibc-2.3.2 (and later) in glibc source code sysdeps/Unix/sysv/Linux/i386/sysdep. in the hfile, the macro internal_syscall for processing system calls has different results under different compilation options. When you enable the i386_use_sysenter option that supports sysenter/sysexit commands, there are two methods for system calling. In the case of static links (with the-static option added during compilation, the "call * _ dl_sysinfo" command is used, and the "call * % GS: 0x10" command is used for dynamic connections. Which method is used by the glibc library to link the two cases? In fact, it is equivalent to calling the code of a fixed address. Next we will use a small program and GDB for verification. The first is a static Compilation Program, the code is very simple:
Add the static option to the code and use GCC for static compilation. Then load and decompile the main function with GDB.
[root@test opt]# gcc test.c -o ./static -static[root@test opt]# gdb ./static(gdb) disassemble main0x08048204 <main+0>: push %ebp0x08048205 <main+1>: mov %esp,%ebp0x08048207 <main+3>: sub $0x8,%esp0x0804820a <main+6>: and $0xfffffff0,%esp0x0804820d <main+9>: mov $0x0,%eax0x08048212 <main+14>: sub %eax,%esp0x08048214 <main+16>: call 0x804cb20 <__getuid>0x08048219 <main+21>: leave0x0804821a <main+22>: ret
|
As you can see, the main function calls the _ getuid function and decompile the _ getuid function.
(gdb) disassemble 0x804cb200x0804cb20 <__getuid+0>: push %ebp0x0804cb21 <__getuid+1>: mov 0x80aa028,%eax0x0804cb26 <__getuid+6>: mov %esp,%ebp0x0804cb28 <__getuid+8>: test %eax,%eax0x0804cb2a <__getuid+10>: jle 0x804cb40 <__getuid+32>0x0804cb2c <__getuid+12>: mov $0x18,%eax0x0804cb31 <__getuid+17>: call *0x80aa0540x0804cb37 <__getuid+23>: pop %ebp0x0804cb38 <__getuid+24>: ret
|
The above is only part of the _ getuid function. We can see that _ getuid assigned the eax register to the Function Code 0x18 called by the getuid system, and then called another function. Where is the function entry? View the value at the address 0x80aa054.
(gdb) X 0x80aa0540x80aa054 <_dl_sysinfo>: 0x0804d7f6
|
It does not look like code pointing to the kernel ing page, but you can confirm that the address that the _ dl_sysinfo Pointer Points to is 0x80aa054. Next, we try to start this program, stop in the first statement of the program, and then view the value of this place.
(gdb) b mainBreakpoint 1 at 0x804820a(gdb) rStarting program: /opt/staticBreakpoint 1, 0x0804820a in main ()(gdb) X 0x80aa0540x80aa054 <_dl_sysinfo>: 0xffffe400
|
We can see that the value pointed to by the _ dl_sysinfo pointer has changed, pointing to 0xffffe400. If we continue to run the program, the __getuid function will call the code at the address 0xffffe400. Next, we will compile the above code into a dynamic link, that is, the default method. We will use GDB to load and decompile the main function.
[root@test opt]# gcc test.c -o ./dynamic[root@test opt]# gdb ./dynamic(gdb) disassemble main0x08048204 <main+0>: push %ebp0x08048205 <main+1>: mov %esp,%ebp0x08048207 <main+3>: sub $0x8,%esp0x0804820a <main+6>: and $0xfffffff0,%esp0x0804820d <main+9>: mov $0x0,%eax0x08048212 <main+14>: sub %eax,%esp0x08048214 <main+16>: call 0x80482880x08048219 <main+21>: leave0x0804821a <main+22>: ret
|
Because the libc library is loaded only when the program is initialized, we start the program first, stop in the first statement of main, and then disassemble the getuid library function. .
(gdb) b mainBreakpoint 1 at 0x804820a(gdb) rStarting program: /opt/dynamicBreakpoint 1, 0x0804820a in main ()(gdb) disassemble getuidDump of assembler code for function getuid:0x40219e50 <__getuid+0>: push %ebp0x40219e51 <__getuid+1>: mov %esp,%ebp0x40219e53 <__getuid+3>: push %ebx0x40219e54 <__getuid+4>: call 0x40219e59 <__getuid+9>0x40219e59 <__getuid+9>: pop %ebx0x40219e5a <__getuid+10>: add $0x84b0f,%ebx0x40219e60 <__getuid+16>: mov 0xffffd87c(%ebx),%eax0x40219e66 <__getuid+22>: test %eax,%eax0x40219e68 <__getuid+24>: jle 0x40219e80 <__getuid+48>0x40219e6a <__getuid+26>: mov $0x18,%eax0x40219e6f <__getuid+31>: call *%gs:0x100x40219e76 <__getuid+38>: pop %ebx0x40219e77 <__getuid+39>: pop %ebp0x40219e78 <__getuid+40>: ret
|
As you can see, the library function getuid sets the eax register to the call number 0x18 called by the getuid system, and then calls the function pointed to by % GS: 0x10. In GDB, the data content of non-Ds segments cannot be viewed, so the actual values saved by % GS: 0x10 cannot be viewed. However, we can program the % GS: the value 0x10 is assigned to a local variable to obtain the value, which is 0xffffe400. The specific code will not be described here. It can be seen that in both static and dynamic ways, we finally came to the 0xffffe400 code, which is the system call entry code mapped by the kernel. In GDB, we can directly disassemble the code to view the code here.
(gdb) disassemble 0xffffe400 0xffffe414Dump of assembler code from 0xffffe400 to 0xffffe414:0xffffe400: push %ecx0xffffe401: push %edx0xffffe402: push %ebp0xffffe403: mov %esp,%ebp0xffffe405: sysenter0xffffe407: nop0xffffe408: nop0xffffe409: nop0xffffe40a: nop0xffffe40b: nop0xffffe40c: nop0xffffe40d: nop0xffffe40e: jmp 0xffffe4030xffffe410: pop %ebp0xffffe411: pop %edx0xffffe412: pop %ecx0xffffe413: retEnd of assembler dump.
|
This code is exactly the code in the arch/i386/kernel/vsyscall-sysenter. s file. Here, the entry code is before sysenter, and the kernel return processing code starts at 0xffffe410 (the sysenter_return mentioned later points to this ). In the entry code, the first step is to save the current ECx, EDX (because the sysexit Command needs to use these two registers) and EBP. Then call the sysenter command to jump to the kernel ring 0 code, that is, the sysenter_entry entry. Processing and returning in the kernel For the overall implementation of sysenter_entry, see ARCH/i386/kernel/entry. S. The kernel processes the sysenter Code differently than the int code. After the sysenter command enters RING 0, because the current ESP does not point to the correct Kernel stack, It is a buffer in the TSS structure of the current CPU (see the preceding section ), so the first thing to solve is to fix esp. Fortunately, the esp0 member in the TSS structure saves the ESP value in the ring 0 state, therefore, the esp0 value in the TSS structure is assigned to the ESP register. After ESP is restored to the correct stack, because sysenter does not enter RING 0 through the call gate, the context in the stack is different from the int command, after the int command enters RING 0, the following values are saved in the stack. Low address
Returns an EIP in user mode. |
User-mode CS |
User-mode eflags |
User-state esp |
User-mode SS (same as DS) |
High address Therefore, in order to simplify and reuse the code, the kernel will use the pushl command to put the above values into the stack. It is worth noting that the value of the corresponding user-state EIP is placed in the kernel in the stack, is a code label sysenter_return, which can be seen in the vsyscall-sysenter.S after the sysenter command (between them, there is a piece of NOP, which is the processing code when the kernel returns an error ). Next, the code for processing the system call is exactly the same as the code for handling the interrupt method. The kernel stores all the registers and then finds the corresponding system call entry in the system call table to complete the call. Finally, the kernel extracts the previously stored user-state eip and esp from the stack, stores them in the edx and ECx registers, and calls the sysexit command to return the user-state. After the user State is returned, ESP, EDX, and ECx are extracted from the stack, and the glibc library is returned. Support for other operating systems and other hardware platforms It is worth mentioning that, starting from Windows XP, the Windows System Call method also changes from Soft Interrupt int 0x2e to sysenter mode, because int mode is not supported at all, therefore, the Minimum CPU configuration required for Windows XP is PentiumII 300 MHz. In other operating systems such as the * BSD series, the sysenter command is not supported currently. In terms of CPU, amd cpu supports a set of corresponding commands syscall/sysret. The pure 32-bit amd cpu does not support the sysenter command. However, in some modes, the CPU can support the sysenter/sysexit command on AMD's amd64 series CPU. In the code of the Linux kernel for the amd64 architecture, the syscall/sysret command is used. There is no way to draw a conclusion as to who will eventually become the standard of these two commands. Future Intel's sysenter/sysexit commands and AMD's syscall/sysret commands are collectively referred to as "quick System Call commands ". Compared with the interrupted command, the "quick system invocation command" consumes less time, but with the development of the CPU design, there should be no disparity like intel pentium4 in the future. Compared with the interrupt method, the "quick System Call Command" still has some limitations, for example, other system calls cannot be called through "quick System Call Command" during a system call process. Therefore, not every system call must be implemented through "quick System Call Command. For example, for complex system calls such as fork, the time difference between the two System Call methods is negligible compared with the time consumed by system calls, there is no need to use the "quick System Call Command" method. However, the "quick System Call Command" method should be used for system calls that require a short running time and high time accuracy, such as getuid and gettimeofday. Therefore, only by taking flexible measures and taking different methods for different system calls can we get the optimal performance and achieve the most perfect functions. References [1] VxWorks optimized for Intel architecture, hdei nunoe, Wind River, member of technical staff Leo Samson, Wind River, technical marketing engineer David Hillyard, Intel Corporation, Mgr., platform effecect [2] kernel entry/kernel exit, Marcus voelp & University Karlsruhe [3] Dave Jones 'blog, http://diary.codemonkey.org.uk/index.php? Month = 12 & year = 2002 [4] Linux kernel source v2.6.0 http://www.kernel.org/[Linus Torvalds, 2004] [5] gnu c library glibc 2.3.3 source http://www.gnu.org/software/libc/libc.html Linux Kernel mailing list: [5] Linux kernel mailing list, "Intel P6 vs P7 System Call performance" http://www.ussg.iu.edu/hypermail/linux/kernel/0212.1/index.html#1286 http://www.ussg.iu.edu/hypermail/linux/kernel/0212.3/index.html#54 Linux Kernel first introduced support for sysenter/sysexit commands: [6] Linux kernel mailing list, "add" sysenter "support on x86, and a" vsyscall "page." http://lwn.net/Articles/18414/
About the author Liu Zui: a Linux enthusiast who has studied driver development and kernel security issues and is interested in Linux kernel and Java virtual machines. You can contact him through the liuzirui@ustc.edu. |
|