Linux system calls

Last Update:2018-02-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, preface

When a user-space program calls the SWI instruction to initiate a kernel service request, the program actually completes a "traversal" that passes from the user state to the kernel state. This process is a bit like the weekend you at home watching movies, suddenly some bursting, conveniently pressed the pause button, the movie inside the world to a halt. In the world of programs, too, after a SWI, user-space code execution pauses, data on the stack (user stack), body segment, static data area, heap data ... Everything stops, the execution of the program is suddenly transferred to another world, the stack becomes the kernel stack, the executing body segment program becomes the binary code of the VECTOR_SWI, and the matching data area also changes ...

How did it all happen? CPU has only one set, what is the hardware doing here? What the hell is the software doing? It's fun to cross into another world, but how do you find your way back? All these questions are expected to be clearly described in a document like this.

The code for this article comes from the 4.4.6 kernel, which is described as an example with ARM processors.

Second, build the user site on the kernel stack

The code is as follows (ignoring the code of the CORTEX-M processor, ignoring the code of the thumb instruction set):

ENTRY (VECTOR_SWI)
Sub sp, SP, #S_FRAME_SIZE
Stmia sp, {r0-r12} @ calling R0-r12
ARM (add R8, SP, #S_PC)
ARM (Stmdb R8, {sp, lr}^) @ calling SP, LR
Mrs R8, SPSR @ called from Non-fiq mode, so OK.
str LR, [sp, #S_PC] @ Save calling PC
Str R8, [sp, #S_PSR] @ Save CPSR
STR r0, [sp, #S_OLD_R0] @ Save old_r0

When executing VECTOR_SWI, the hardware has done a lot of things, including:

(1) Save the CPSR register to the SPSR_SVC register and save the return address (the next instruction of the user space execution SWI instruction) in the LR_SVC

(2) Set the value of the CPSR register. Specific include: CPSR.M = ' 10011 ' (svc mode), CPSR. I = ' 1 ' (Disable IRQ), cpsr.it = ' 00000000 ' (TODO), CPSR. J = ' 0 ' (), CPSR. T = SCTLR. TE (J and Tbit and instruction set state, not related to this article), CPSR. E = SCTLR. EE (byte-order definition, not related to this article).

(3) address of PC set to SWI exception vector

The subsequent behavior is software behavior, because the code involves a stack action, so the first thing to determine is where the current problem. SP_SVC has been set up as early as the process switch, which is the kernel stack of the process.

When task a switches to task B, one important step is to switch the HW context, since both task A and task B are running on the same CPU, so the various registers and status information of the current CPU need to be stored in a memory data Block (which is the hardware context), and the CPU is loaded with the value of the hardware context of task B, where the SP_SVC is included. In the kernel state, after the process switch is complete, the user space execution of task B is eventually returned, but the kernel stack (SP_SVC0) corresponding to task B is determined.

When entering the kernel through system calls, the kernel stack is ready, but this time the kernel stack is empty, after the execution of the above code, on the kernel stack to form the following user space on the scene:

Code we will not be a daytime, very simple, we can read by ourselves. By the way: Did you see this saved scene as familiar? You can look at arm interrupt processing This document, the interruption of the saved field and the system call is the same. In addition, the user space stored on the kernel stack is not all HW CONTEXT,HW Context is an in-memory data that preserves the full state of the CPU at some point, not just the core register, but also many other HW block states within the CPU. such as the register and status of the FPU. At this point, the question is, is it enough to simply save the core register when the system call enters the kernel state? is not enough to be related to the system calling interface, in fact, for Linux, we agreed to the following: kernel-state code is not allowed to execute floating-point operation instructions (here with the FPU example, other similar), if it must be so, Then you need to add code to save and recover the FPU context before and after the kernel uses the FPU code to ensure that the context of the FPU remains constant when the user space is returned.

The last interesting question is: Why is the r0 being stacked two times? One is R0, the other is old R0. In fact, in the system call process, R0 has two roles, one is to pass parameters, and the other is the return value. When we first entered the system call site, old R0 and R0 actually saved the parameters of the system call, and after the system call was completed, R0 saved the return value of the returned user space. But you might think it's OK to use a r0, so we'll describe it later.

Three, a few simple initialization operation

The code is as follows:

Zero_fp
Alignment_trap R10, IP, __cr_alignment
Enable_irq
Ct_user_exit
Get_thread_info tsk

ZERO_FP is used to clear the frame pointer, when the debugger do stack backtracking, when the FP equals 0, it means to the outermost function. For kernel, the call tracking for a function is over, and we can't always go back to the function call of user space. As we said in the previous section, the hardware will turn off the IRQ, where we can turn on the interrupt processing of this CPU via ENABLE_IRQ. Ct_user_exit and Context tracking subsystem related content, here is not in depth, about alignment, you can talk a few more words. ARM64 's hardware is supported for non-aligned operations, but is limited to access to the normal memory (those not required for memory order, such as exclusive load/store and Load-acquire or The store-release directive is not supported). Memory accesses that are generated as a result of a fetch, or memories that access device type, must be aligned. When directives are non-aligned access, there are two choices (SCTLR_ELX.A control): One is to generate fault, and the other is to perform non-aligned access (done by hardware). Non-aligned access to memory is decomposed into two transaction on the bus. All ARMV8 processor hardware supports non-aligned access, so the ARM64 should not require software to implement non-aligned access.

ARMV8-enabled processors certainly do not need to consider alignment issues, but for ARM processor, some hardware does not support non-aligned access, when the kernel configuration (config_alignment_ TRAP) can be used in software to achieve non-aligned access (this is not supported by the hardware in the case of frustration), but the performance of the great damage, not the last resort can not open. The code is very simple, it is not explained here.

Third, how to get the system call number?

System calls have two specifications, one is the old Oabi (the system call is from the SWI Directive) and the other is the arm ABI, which is the Eabi (the system call is from R7). If you want to be compatible with the old Oabi, then we need to define OABI_COMPAT, which brings a bit of overhead to the system call, while making the kernel larger, and the corresponding benefit is that user programs that use the old Oabi specification can also run on the kernel. Of course, if we determine that user space is only subject to the Eabi specification, then you can consider not defining config_oabi_compat.

The relevant code is as follows:

#if defined (CONFIG_OABI_COMPAT)

USER (Ldr R10, [LR, #-4]) @ Get SWI instruction
Arm_be8 (rev R10, R10) @ Little endian instruction

#elif defined (Config_aeabi)

#else
/* Legacy ABI only. */
USER (Ldr scno, [LR, #-4]) @ Get SWI instruction
#endif

If it is in accordance with the Eabi specification, then get the system call number directly from the R7, do not need special code, so Config_aeabi case, the code is empty. If it is an old specification, then we need to get the system call number from the SWI instruction, and then we need LR (which is actually lr_svc, which holds the next instruction of the SWI instruction) to find the SWI script.

Uaccess_disable TBL

ADR TBL, sys_call_table @ load syscall table pointer

#if defined (CONFIG_OABI_COMPAT)
BiCS R10, R10, #0xff000000
Eorne Scno, R10, #__NR_OABI_SYSCALL_BASE
Ldrne TBL, =sys_oabi_call_table
#elif!defined (Config_aeabi)
Bic Scno, Scno, #0xff000000 @ mask off SWI op-code
Eor scno, Scno, #__NR_SYSCALL_BASE @ Check OS number
#endif

The system call number can be obtained by taking out the low 24bit in the SWI instruction, of course, for the Eabi standard, we use R7 to pass the system call number, so in the kernel we always use "SWI 0" way, so if the SWI instructions in the low 24bit is 0, then it is to obey the Eabi specification.

After executing the above code, R7 (SCNO) saved the system call number, R8 (TBL) is Syscall table pointer, through the values of R7 and R8, we already know how the next road to go.

Iv. parameter passing

The code that uses the SWI directive is in glibc, and we can probably think of the code as the following format:

......

return value = SWI (parameter 1, parameter 2, ...);

......

From this point of view, the system call and a normal C program call are similar, all have the concept of parameter and return value. Of course, because the mode has also been switched, so the parameter transfer here cannot use stack stack (SWI generated a stack switch), only use the way register.

For ARM processors, the standard procedure call Convention uses R0~R3 to pass parameters, and the remaining parameters are pressed into the stack. As described in the previous two subsections, we have found the system call number and system call table, the following is ready to call the kernel system call function, for the kernel-state system call function, the format is as follows:

......

return value = sys_xxx (parameter 1, parameter 2, ...);

......

Therefore, we also need some code to transition to SYS_XXX, as follows:

Local_restart:
    ldr    R10, [tsk, #TI_FLAGS]   & nbsp;    @ Check for syscall tracing
    stmdb    sp!, {r4, r5} & nbsp;          @ Push Fifth and sixth args

    tst& nbsp;   R10, #_TIF_SYSCALL_WORK         @ are we tracing syscalls?
    bne    __sys_trace

    cmp    Scno, #NR_syscalls         @ Check Upper Syscall limit
    badr    LR, ret_fast_syscall         @ return address
    ldrcc    pc, [TBL, Scno, LSL #2]   & nbsp;    @ Call sys_* routine

We need to simulate a C function call here, so we need to press the fifth and sixth parameters of the system call on the stack (some systems call more than 4 parameters, they use R0~R5 to pass parameters on the SWI interface). If the parameter is OK, then the LDRCC pc, [TBL, Scno, LSL #2] code will give control directly to the corresponding sys_xxx function. It is important to note that the setting of the return address, we cannot use the assembly instructions such as BL, so we can only set the LR register manually (Badr LR, ret_fast_syscall).

V. Return to User space

Many things are handled before the user space is returned, such as signal processing, process scheduling, and so on, which is done by checking the flag flag in the struct Thread_info, as follows:

Disable_irq_notrace @ Disable Interrupts
LDR R1, [tsk, #TI_FLAGS] @ re-check for syscall tracing
TST R1, #_TIF_SYSCALL_WORK | _tif_work_mask
BNE fast_work_pending

Restore_user_regs fast = 1, offset = S_off

Fast_work_pending:
STR r0, [sp, #S_R0 +s_off]! @ returned R0

......

The most famous flag in the area is _tif_need_resched, with this flag, which shows that there are scheduling requirements. It is necessary to have a dispatch point when the system call returns the user space. The other flags have nothing to do with our scene here, and in short, if there is anything else to be done, we need to jump to fast_work_pending, otherwise call Restore_user_regs back to the user space site. Here is a small detail: if there is a need for additional processing (for example, with pending signal), then the R0 register is actually destroyed and the return value of the SYS_XXX function is destroyed, and we save the R0 to the user site (pt_regs) s_ R0 's location, which is why Pt_regs has s_r0 and s_old_r0 two and r0 related domains.

The code to recover the user space (Restore_user_regs) is as follows:

mov R2, SP
LDR R1, [R2, #\offset + S_PSR] @ get calling CPSR
Ldr LR, [R2, #\offset + s_pc]! @ Get PC
MSR spsr_cxsf, r1 @ Save in Spsr_svc
. If \fast
Ldmdb R2, {r1-lr}^ @ get calling R1-LR
. else
Ldmdb R2, {r0-lr}^ @ get calling R0-LR
. endif
mov r0, R0 @ armv5t and earlier require a NOP
@ after LDM {}^
Add SP, SP, #\offset + s_frame_size
Movs PC, LR @ return & move Spsr_svc into CPSR

The entire code is simple, is to enter the system call time to press the value of the kernel stack to restore the user site, one of the details is the operation of the kernel stack, in the call Movs PC, LR return to the user space site before the add SP, SP, #\offset + s_frame_size instructions is empty on the user stack. In addition, we need to consider the return user space when the R0 setup problem, after all, it carries the return value of the system call, this time the r0 there are two cases:

(1) In the absence of pending work (Fast equals 1), R0 saves the return value of the SYS_XXX function

(2) In the case of pending work (Fast equals 0), the r0 in the struct pt_regs (field returning to the user space) holds the return value of the SYS_XXX function

Restore_user_regs also has a parameter called offset, we know, when we enter the system call, we put the parameter 5 and parameter 6 on the stack, resulting in a 8-byte offset to the pt_regs, here need to compensate back.

Linux system calls

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux system calls

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support