The system invokes the operating system to provide a set of interfaces for interacting with hardware devices (such as CPUs, disks, printers, and so on) that run in the user state.
The UNIX system implements most of the interfaces between the user-state process and the hardware device by issuing a system call to the kernel.
The POSIX API and system calls let us first emphasize the difference between the application programming interface (API) and the system invocation. The former is simply a function definition that shows how to get a given service, while the latter sends a clear request to the kernel state via a soft interrupt.
UNIX systems provide programmers with library functions for many APIs. Some of the APIs defined by the standard C Library of libc reference the encapsulation routines (wrapper routine). Typically, each system call corresponds to an encapsulation routine, and the encapsulation routines define the APIs that the application uses.
System call handlers and service routines when a user-configured process invokes a system call, the CPU switches to the kernel state and starts executing a kernel function.
Because the kernel implements many different system calls, the process must pass a parameter called the system Claa number to identify the desired system call, and the EAX register is used for this purpose. When invoking a system call, it is often necessary to pass additional parameters.
All system calls return an integer value. These return values are different from the conventions that encapsulate the return values of the routines.
In the kernel, integers and 0 indicate a successful end of the system call, and a negative number represents an error condition.
In the latter case, this value is the negative error code that must be returned to the application in the errno variable. The kernel does not set and use the errno variable, and the wrapper routine sets the variable after it returns from the system call.
The system call handler is similar to the structure of other exception handlers by doing the following:
Save the contents of most registers in the kernel state stack
Call the appropriate C function called the system call service routine to handle the system call.
Exit the system call handler: Loads the register with the value stored in the kernel stack, and the CPU switches from the kernel state to the user state.
The name of the corresponding service routine for the XYZ () system call is usually sys_xyz (). But there are exceptions.
To correlate the system invocation number with the corresponding service routine, the kernel utilizes a system call dispatch table (Dispatch table). This table is stored in the sys_call_table array, by the Nr_syscalls table entry (289 in the Linux2.6.11 kernel): The nth table entry contains the address of the service routine with the system call number N.
The Nr_syscalls macro is simply a static limit on the maximum number of system calls that can be implemented, and does not represent the number of calls actually implemented.
Incoming and exiting system calls local apps can invoke system calls in two different ways:
1. Execute the INT $0x80 assembly language directive. In older versions of Linux, this is the only way to switch from a user state to a kernel state.
2. Execute Sysenter assembly language instruction.
Similarly, the kernel can exit from system calls in two ways, allowing the CPU to switch back to the user state:
1. Execute Iret assembly language instruction.
2. Execute Sysexit assembly language instruction.
Making a system call with an int $0x80 instruction
The traditional way to invoke system calls is to use assembly language Directive Int.
Vector 128 (hex 0x80) corresponds to the kernel entry point. The function Trap_init () called during the kernel initialization phase, establishes the Interrupt Descriptor table entry for vector 128 in the following way:
Set_system_gate (Syscall_vector,&system_call);
The call stores the following values in the corresponding field of the gate descriptor:
Segment Selector: Segment selector for kernel code snippet __kernel_cs
Offset: Pointer to the System_call () system call handler
Type: Set to 15. Indicates that the exception is a trap and the corresponding handler does not prohibit masking interrupts
DPL: Set to 3. This allows the user-state process to call this exception handler
So when the user-state process issues an int $0x80 instruction, the CPU switches to the kernel state and starts executing the command from the address System_call.
The System_call () function first saves the system calls and all CPU registers that the exception handler can use to the appropriate stack.
ENTRY (System_call)
PUSHL%eax # Save Orig_eax
Save_all
Get_thread_info (%EBP)
# system call Tracing in operation
Testb $ (_tif_syscall_trace|_tif_syscall_audit), Ti_flags (%EBP)
JNZ syscall_trace_entry
Cmpl $ (nr_syscalls),%eax
Jae Syscall_badsys
Syscall_call:
Call *sys_call_table (,%eax,4)//calls a specific service routine that corresponds to the system call number contained in EAX
MOVL%eax,eax (%ESP) # Store the return value
Syscall_exit:
CLI # Make sure we don ' t miss an interrupt
# setting need_resched or Sigpending
# between sampling and the Iret
MOVL ti_flags (%EBP),%ecx
TESTW $_tif_allwork_mask,%cx # current->work
Jne syscall_exit_work
Exit from System call
When the system invokes the service routine at the end, the System_call () function obtains its return value from EAX and stores the return value in the position of the stack unit where the user-state EAX register value has been saved.
Issuing system calls via Sysenter directives
assembly language instruction int is slower because it performs several consistency and security checks.
Sysenter directive
Assembly language instruction Sysenter uses three special registers, which must be loaded with the following information:
SYSENTER_CS_MSR: Kernel snippet Selector
SYSENTER_EIP_MSR: The linear address of the kernel entry point
SYSENTER_ESP_MSR: Kernel stack pointer
Vsyscall page
As long as both the CPU and the Linux kernel support the Sysenter directive, the encapsulation function in the standard library libc can use it.
Enter system call
When making a system call with the sysenter instruction, perform the following steps in turn:
1. The package routines in the standard library load the system call number into the EAX register and call the __kernel_vsyscall () function.
2. function __kernel_vsyscall () saves the contents of EBP, edx, and ecx to the user-state stack, copies the user's battle to EBP, and executes the sysenter instruction.
3.CPU switch from user state to kernel state, the kernel begins to execute the sysenter_entry () function.
4.sysenter_entry () assembly language function execution
Exit system call
When the system invokes the service routine at the end, the Sysenter_entry () function essentially performs the same operation as the System_call () function.
First, it obtains the return value of the system invocation service routine from EAX and stores the return code in the kernel stack where the user-state EAX register value is stored.
The function then disables local interrupts and checks for flags in the current thread_info structure.
If any flags are set, some work needs to be done before returning to the user state.
SYSEXIT directive
Sysexit is an assembly language directive paired with Sysenter: It allows fast switching from the kernel state to the user state.
Sysenter_return's Code
The code at the Sysenter_return tag is stored in the Vsyscall page, and the Code in the page box is executed when the system call entered through Sysenter is terminated by a iret or sysexit instruction.
Parameter passing is similar to normal functions, and system calls often require input/output parameters, which may be actual values or variables in the user-state process address space, or even a pointer to a user-state function data structure.
Before a system call is made, the parameters of the system call are written to the CPU register, and the kernel then copies the parameters stored in the CPU to the kernel stack before the system invokes the service routine because the system call service routine is a normal C function.
Validation parameters
All system call parameters must be carefully checked before the kernel intends to satisfy the user's request. The type that is checked depends on the system call, and also on the specific parameters.
One kind of check is common to all system calls. As long as a parameter specifies an address, the kernel must check if it is within the address space of the process. There are two possible ways to perform this check:
Verify that the linear address is part of the process's address space, and if so, the linear region where the linear address is located has the correct access rights. (Early use)
Just verify that the linear address is less than page_offset. (starting from the Linux2.2 kernel)
This cursory check is critical to ensure that both the process address space and the kernel address space are not accessed illegally.
The check of the address passed to the system call is implemented through the ACCESS_OK () macro, which consists of two parameters, respectively, addr and size.
#define ACCESS_OK (Type,addr,size) (Likely (__RANGE_OK (addr,size) = = 0))
The Access process address space system calls the service routine to read and write data to the process address space very frequently. Linux contains a set of macros that make this access easier. We will describe two of these macros named Get_user () and Put_user (). The first macro is used to read 1, 2, or 4 contiguous bytes from an address, and the second macro is used to write the contents of these sizes to an address.
#define GET_USER (X,PTR) \
({int __ret_gu; \
unsigned long __val_gu; \
__chk_user_ptr (PTR); \
Switch (sizeof (* (PTR))) {\
Case 1: __get_user_x (1,__RET_GU,__VAL_GU,PTR); Break \
Case 2: __get_user_x (2,__RET_GU,__VAL_GU,PTR); Break \
Case 4: __get_user_x (4,__RET_GU,__VAL_GU,PTR); Break \
Default: __get_user_x (X,__RET_GU,__VAL_GU,PTR); Break \
} \
(x) = (__typeof__ (* (PTR))) __val_gu; \
__ret_gu; \
})
#define PUT_USER (X,PTR) \
__put_user_check ((__typeof__ (* (PTR))) (x), (PTR), sizeof (* (PTR)))
Dynamic address checking: Fix code as you can see, the ACCESS_OK () macro checks the validity of the linear address that the system calls with the parameter pass. This check value ensures that the user state process does not attempt to harass the kernel address space.
However, the linear address passed by the parameter may still not be part of the process address space. In this case, a missing pages exception occurs when the kernel tries to use any of these wrong addresses.
There are four scenarios in which the kernel state causes a missing pages exception:
1. The kernel apprentice accesses a page that belongs to the process address space, but either the corresponding page box does not exist or the kernel tries to write a read-only page. In this case, the handler must assign and initialize a new page box.
2. The kernel addresses the page that belongs to its address space, but the corresponding page table entry has not been initialized. In this case, the kernel nosebleed properly establishes some table entries in the current Process page table.
3. A kernel function contains a programming error that causes an exception when the function is run, or it may cause an exception due to an instantaneous hardware error. When this happens, the handler must execute a kernel vulnerability.
4. One scenario discussed in this chapter is that the system invocation service routine attempts to read and write to a memory area where the address is passed through the system invocation parameters but is not part of the process's address space.
The exception table determines the source of the missing pages because the kernel uses wired access to the address space of the process.
It does not take much effort to put the address of each kernel instruction in the access process address space into a structure called an exception table (excepyion table).
When a fault occurs in the kernel state, the Do_page_fault () handler checks the exception table: If the table contains the instruction address that generated the exception, then this error is caused by an illegal system call parameter, otherwise, it is caused by a more serious bug.
Linux defines a few exception tables. The main exception table is generated automatically by the C compiler when the kernel program image is established. It is stored in the __ex_table section of the kernel code, and its starting and ending addresses are identified by the two symbols __start__ex_table and __stop__ex_table generated by the C compiler.
The table entry for each exception table is a EXCEPTION_TABLE_ENTRY structure that has two fields:
INSN: The linear address of the instruction that accesses the process address space.
Fixup: Fixup is the address of the assembly language code to invoke when the fault that is triggered by the instruction stored in the INSN unit occurs.
Generate exception table and fix code GNU Assembler (assembler) pseudo-directives. section allows the programmer to specify the part of the executable file that contains the code that will be executed immediately thereafter.
Kernel encapsulation Routines Although the system calls are mostly user-state processes, they can also be called by kernel threads, and kernel threads cannot use library functions. To simplify the declaration of the appropriate encapsulation routines, Linux defines 7 sets of macros from _syscall0 to _SYSCALL6.
The number 0~6 in each macro name corresponds to the number of parameters used by the system call (except for the system call number).
Deep understanding of Linux kernel day09--system calls