I. Overview
System calling is a way for applications to interact with the kernel. As an interface, system calls allow applications to enter the operating system kernel to use various resources provided by the kernel, such as operating hardware, switching and interruption, and changing the privileged mode. First, the system call is a soft interrupt. Since it is an interrupt, it generally has two attributes: the interrupt number and the interrupt processing program. Linux uses the 0x80 interrupt as the entry to the system call, the address of the interrupt handler is placed in the interrupt vector table.
Ii. Process
Based on the linux-2.6.38, the read () System Call function is described as an example.
In the user space, the declaration of the read () function is located in # include <unistd. h>. The prototype is ssize_t read (int fd, void * Buf, size_t count ). The following is the Pseudo Code defined by the read () function in the user space:
1 ssize_t read(int fd, void *buf, size_t count) 2 { 3 long res; 4 %eax = __NR_read 5 %ebx = fd 6 %ecx = (long)buf 7 %edx= count 8 int $0x80 9 res = %eax10 return res;11 }
Line 3: Use the eax register to save the read () system call number, which is defined in/ARCH/x86/include/ASM/unistd_32.h (# DEFINE _ nr_read 3); 4th ~ Line 7: Put the three parameters into three registers (parameters are transferred through registers); line 3: Execute system call, enter kernel; line 3, obtain the function return value stored in the eax register.
After 8th rows are executed, the program enters the system kernel. As this is an interrupt, the program enters the interrupt processing program that records the 0x80 in the interrupt vector table, the interrupt vector table is initialized in/ARCH/x86/kernel/traps. defined in C:
1 void __init trap_init(void) 2 { 3 ................... 4 5 #ifdef CONFIG_X86_32 6 set_system_trap_gate(SYSCALL_VECTOR, &system_call); 7 set_bit(SYSCALL_VECTOR, used_vectors); 8 #endif 9 ...................10 }
As shown in row 6th. Syscall_vector is the interrupt Number of the system call. It is defined in/ARCH/x86/include/ASM/irq_vectors.h as follows:
1 #ifdef CONFIG_X86_322 # define SYSCALL_VECTOR 0x803 #endif
It is exactly 0x80. System_call is the pointer to the interrupt processing function called by the system. After you execute int $0x80, this function is executed. It is defined in/ARCH/x86/kernel/entry_32.s:
1 ENTRY(system_call) 2 RING0_INT_FRAME # can't unwind into user space anyway 3 pushl_cfi %eax # save orig_eax 4 SAVE_ALL 5 GET_THREAD_INFO(%ebp) 6 # system call tracing in operation / emulation 7 testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp) 8 jnz syscall_trace_entry 9 cmpl $(nr_syscalls), %eax10 jae syscall_badsys11 syscall_call:12 call *sys_call_table(,%eax,4)13 movl %eax,PT_EAX(%esp) # store the return value
...........
Row 3, save_all is a macro, which is also defined in this file:
1 .macro SAVE_ALL 2 cld 3 PUSH_GS 4 pushl_cfi %fs 5 /*CFI_REL_OFFSET fs, 0;*/ 6 pushl_cfi %es 7 /*CFI_REL_OFFSET es, 0;*/ 8 pushl_cfi %ds 9 /*CFI_REL_OFFSET ds, 0;*/10 pushl_cfi %eax11 CFI_REL_OFFSET eax, 012 pushl_cfi %ebp13 CFI_REL_OFFSET ebp, 014 pushl_cfi %edi15 CFI_REL_OFFSET edi, 016 pushl_cfi %esi17 CFI_REL_OFFSET esi, 018 pushl_cfi %edx19 CFI_REL_OFFSET edx, 020 pushl_cfi %ecx21 CFI_REL_OFFSET ecx, 022 pushl_cfi %ebx23 CFI_REL_OFFSET ebx, 024 movl $(__USER_DS), %edx25 movl %edx, %ds26 movl %edx, %es27 movl $(__KERNEL_PERCPU), %edx28 movl %edx, %fs29 SET_KERNEL_GS %edx30 .endm
The main function is to push each register into the stack.
In row 3, compare whether the value of eax is greater than or equal to nr_syscall. nr_syscallis 1 greater than the maximum valid system call number, which is defined in/ARCH/x86/kernel/entry_32.s:
1 #define nr_syscalls ((syscall_table_size)/4)
Syscall_table_size indicates the size of the system call table (in bytes). syscall_table_size is actually an array, which stores the addresses of function calls by various systems. The element type is long, dividing by 4 is the number of functions called by the system.
If the system call number passed in from the eax register is valid, execute the 12th line and find the corresponding system call service program in the system call table, sys_call_table is defined in/ARCH/x86/kernel/syscall_table_32.s:
1 ENTRY(sys_call_table)2 .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */3 .long sys_exit4 .long ptregs_fork5 .long sys_read6 .long sys_write7 .long sys_open /* 5 */8 .long sys_close9 .................
* Sys_call_table (, % eax, 4) refers to the function with the offset of % eax * 4 pointed to in sys_call_table. Here % eax = 3, then the sys_read () function of the first row will be called. Sys_read () is defined in/fs/read_write.c:
1 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) 2 { 3 struct file *file; 4 ssize_t ret = -EBADF; 5 int fput_needed; 6 7 file = fget_light(fd, &fput_needed); 8 if (file) { 9 loff_t pos = file_pos_read(file);10 ret = vfs_read(file, buf, count, &pos);11 file_pos_write(file, pos);12 fput_light(file, fput_needed);13 }14 15 return ret;16 }
It can be seen that the parameter format is the same as that of the user space. Syscall_define3 is a macro defined in/include/Linux/syscils. h:
1 #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
Syscall_definex is also a macro, which is also defined in this file:
1 #define SYSCALL_DEFINEx(x, sname, ...) \2 __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)3 .....4 #define __SYSCALL_DEFINEx(x, name, ...) \5 asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__))6 ......
After macro expansion, such a function is declared:
asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
Asmlingage is a macro defined as __attribute _ (regparm (0 ))), this function is used to obtain parameters only from the stack (because the previous save_all puts the parameters in the stack ).
After the interrupt processing program is executed, restore_regs will be called later to restore various registers:
1 ..............2 CFI_REMEMBER_STATE3 je ldt_ss # returning to user-space with LDT SS4 restore_nocheck:5 RESTORE_REGS 4 # skip orig_eax/error_code6 ...............
Row 3, restore_regs definition:
1 .macro RESTORE_REGS pop=0 2 RESTORE_INT_REGS 3 1: popl_cfi %ds 4 /*CFI_RESTORE ds;*/ 5 2: popl_cfi %es 6 /*CFI_RESTORE es;*/ 7 3: popl_cfi %fs 8 /*CFI_RESTORE fs;*/ 9 POP_GS \pop10 .................
Row 3, restore_int_regs definition:
1 .macro RESTORE_INT_REGS 2 popl_cfi %ebx 3 CFI_RESTORE ebx 4 popl_cfi %ecx 5 CFI_RESTORE ecx 6 popl_cfi %edx 7 CFI_RESTORE edx 8 popl_cfi %esi 9 CFI_RESTORE esi10 popl_cfi %edi11 CFI_RESTORE edi12 popl_cfi %ebp13 CFI_RESTORE ebp14 popl_cfi %eax15 CFI_RESTORE eax16 .endm
It's almost the same here. If you trace read (), it will involve File System content, which will be discussed later.