implementation of system call under Linux
Introduction to architecture and system invocation of Unix/linux operating system
what is an operating system and system call
The operating system is a virtual machine that is abstracted from the hardware, where the user can run the application. It is responsible for directly interacting with the hardware, providing public services to the user program and isolating them from the hardware features. Because the program should not rely on the underlying hardware, only such applications can easily move between the various Unix systems. System calls are interfaces that the Unix/linux operating system provides support to the user program, through which applications request services from the operating system, control the steering system, and the operating system returns control and results to the user program upon completion of the service.
Unix/linux System Architecture
A unix/linux system is divided into three tiers: User, core, and hardware.
The system call is the boundary between the user program and the core, through the system call process can be transferred from the user mode to the core mode, in the core mode to complete a certain service request in the return user mode.
The system call interface looks similar to ordinary function calls in C programs, which typically map these function calls to the primitives needed to enter the operating system through the library.
These operational primitives simply provide a basic set of functionality, and the reference and encapsulation of these operations through the library can form a rich and powerful system call library. This embodies the idea of the separation of mechanism and strategy--system call is only the basic mechanism to provide access to the core, and the strategy is embodied through the system call library.
Example: Execv, Execl, EXECLV, Opendir, Readdir ...
unix/linux run mode, address space and context
run mode (run State):
a computer hardware to run a unix/linux system requires at least two modes of operation: High priority core mode and low priority user mode.
in fact, many computers have more than two execution modes. For example, the Intel 80x86 architecture has four tiers of executive privilege, with the highest level of privilege in the hierarchy. UNIX requires only two layers: the core runs at a high priority, called the nuclear mentality; other peripherals, including shells, editing programs, Xwindow, etc. are run at low-priority levels, called user states. The main reason for the different execution patterns is that, in order to protect, they will not accidentally or deliberately destroy other processes or cores because the user processes run at a lower privileged level. The damage caused by the program is localized without affecting other activities or processes in the system. When a user process needs to complete certain functions in privileged mode, the interface must be provided strictly according to the system call to enter privileged mode, and then the limited functionality provided by the invocation is performed.
each of these running States should have its own stack. In Linux, it is divided into user stacks and core stacks. The user stack includes parameters, local variables, and other data structures that function calls when the user state executes. Some systems provide interrupt stacks specifically for global interrupt processing, but there is no interrupt stack in the x86, which interrupts processing in the core stack of the current process.
address space:
The fundamental purpose of protection in privileged mode is to protect the address space, and the user process should not be able to access all address spaces: Only through system calls to this tightly restricted interface can the process enter the kernel mentality and access the data in the protected part of the address space. This section is usually left to the operating system. In addition, the address space between processes and processes should not be exchanged freely. In this way, you need to provide a mechanism to implement protection on a single piece of physical memory for different address spaces on the same process, as well as the address space between different processes.
In Unix/linux, this kind of protection is realized by the virtual storage management mechanism, and the address used by the process does not correspond to the physical memory unit directly in the virtual memory system. Each process has its own virtual space, each process has its own fictitious address space, and the reference to the virtual address is transformed into a reference to the physical address through the address translation mechanism. Because all processes share physical memory resources, there must be a way to protect this shared resource, which is well implemented through a virtual storage system: The address space of each process is mapped to different physical storage pages through an address translation mechanism, This ensures that the process can only access the page corresponding to its own address space and cannot access or modify the page corresponding to the address space of other processes.
The virtual address space is divided into two parts: User space and system space. Only user space can be accessed in user mode and system space and user space are accessible in core mode. System space is fixed in the virtual address space of each process, and because only one kernel instance is running in the system, all processes are mapped to a single kernel address space. The kernel maintains global data structures and some object information for each process, which includes information that allows the kernel to access the address space of any process. Through the address translation mechanism process, you can access the address space of the current process directly (through MMU), and you can access the address space of other processes through some special methods.
Although all processes share the kernel, the system space is protected and the process is inaccessible in user state. If a process needs to access the kernel, it must call the interface through the system. When a process invokes a system call, the by executing a special set of instructions (which are platform-related, each system provides a dedicated trap command, the x86 Linux uses the INT directive) to bring the system into the kernel state, and control to the kernel, which replaces the process to complete the operation. When the system call completes, the kernel performs another set of characteristic instructions to return the system to the user state, and control is returned to the process.
Context:
the context of a process can be divided into three parts: User-level context, register context, and system-level contexts.
User-level context: body, data, user stack, and shared storage area;
Register Context: Program register (IP), that is, the CPU will execute the next instruction address, processor State Register (eflags), stack pointers, general-purpose registers;
System-Level context: The Process table entries (proc structure) and the U-zone, in Linux these two parts are synthesized task_struct, District table and Page table (mm_struct, Vm_area_struct, PGD, PMD, PTE, etc.), the core stack.
All contextual information constitutes the running environment of a process. When a process schedule occurs, all context information must be switched and the newly scheduled process can run. A process is an abstract concept of a set of contexts.
function and classification of system calls The activity of the operating system core during operation can be divided into two parts: the upper half (top half) and the lower half (bottom half), where the top half provides the system call or the self trapped service for the application, which is the synchronization service, caused by the currently executing process, Executes and allows direct access to the data structure of the current process in the context of the current process, while the lower part is a subroutine that handles hardware interrupts and is an asynchronous activity, and the invocation and execution of those subroutines is independent of the current process. The upper part is allowed to be blocked because the current process is blocked, and the lower part is not allowed to be blocked because blocking the lower half causes blocking an innocent process or even the entire core.
a system call can be considered a subroutine library for all unix/linux processes, but it is run in a privileged way to access the core data structure and the user-level information it supports. The primary function of the system call is to enable users to use the operating system's capabilities for device management, file system, Process Control process communication, and storage management without having to understand the internal structure of the operating system and the details of the hardware, thereby reducing user burden and protecting the system and increasing resource utilization.
System calls are divided into two parts: two parts that interact with the file subsystem and the process subsystem. The part that interacts with the file subsystem is further composed of system calls that can include interaction with device files and interaction with normal files (open, close, IOCTL, create, unlink, ...). The process-related system calls include Process control system calls (fork, exit, Getpid, ...). ), inter-process communication, storage management, process scheduling and other aspects of the system call.
implementation of system call under 2.Linux
(Take i386 as an example)
A. How system calls in Linux get stuck in the core. system calls are similar to normal function calls when used, but they are intrinsically different, and function calls do not cause transitions from user state to kernel mentality, and as mentioned earlier, system calls require a state transition.
on each platform, there is a specific directive that converts the execution of a process from user state to a core state, which is called the Operating system trap (operating system traps). After the process has been executed into the command, it can run the system call code in the kernel mentality.
In Linux it is through soft interrupts to achieve this, on the x86 platform, this instruction is int 0x80. That is, in Linux, the interface of the system call is a special case of an interrupt handler function. The details of how to implement the system call through the interrupt handler function are described in detail later.
this will require a certain initialization of the int 0x80 at system startup, as described in the following procedure:
1. Use Assembly subroutine Setup_idt (Linux/arch/i386/kernel/head. S) initializes the IDT table (Interrupt descriptor Chart), at which point all the entry function offset addresses are set to Ignore_int
(Setup_idt:
Lea Ignore_int,%edx
MOVL $ (__kernel_cs <<),%eax
MOVW%dx,%ax/* selector = 0X0010 = cs * *
MOVW $0X8E00,%DX/* Interrupt gate-dpl=0, present * *
Lea Symbol_name (idt_table),%edi
mov $256,%ecx
Rp_sidt:
movl%eax, (%edi)
movl%edx,4 (%edi)
Addl $8,%edi
Dec%ecx
jne Rp_sidt
ret
selector = __kernel_cs, DPL = 0, TYPE = E, P = 1);
2.start_kernel () (LINUX/INIT/MAIN.C) invokes the Trap_init () (LINUX/ARCH/I386/KERNEL/TRAP.C) function to set the interrupt descriptor table. In this function, the setting of the item is actually done by calling the function Set_system_gate (syscall_vector,&system_call). The Syscall_vector is 0x80, and System_call is an assembler function, which is the processing function of interrupt 0x80, which mainly completes two tasks: a. Preservation of the register context; Jumps to the system call handler function. These will be covered in more detail later. (Supplemental Note: Door descriptor
Set_system_gate () is in Linux/arch/i386/kernel/trap. As defined in S, several similar functions are defined in the file Set_intr_gate (), Set_trap_gate, Set_call_gate (). These functions call the same assembly child function __set_gate (), which is the function of setting the gate descriptor. Each item in the IDT is a door descriptor.
#define _SET_GATE (GATE_ADDR,TYPE,DPL,ADDR)
set_gate (idt_table+n,15,3,addr);
the function of the door descriptor is to control the transfer, which includes the selector, which is always the __kernel_cs (pointing to a segment descriptor in the GDT), the entry function offset address, the gate access Privilege level (DPL), and the type identifier. The DPL of Set_system_gate is 3, which means that the door can also be accessed from the privileged Level 3 (the least privileged level), and type 15, representing 386 interrupt gates. )
B. Data structures related to system calls
1. The system calls the function name of the processing function Convention
The function names begin with "Sys_" followed by the name of the system call. For example, the handler function name for the system call Fork () is sys_fork ().
asmlinkage int sys_fork (struct pt_regs regs);
(Additional notes on Asmlinkage)
2. System call number
The core defines a unique number for each system call, which is defined in Linux/include/asm/unistd.h, and the numbering is defined in the following way:
#define __NR_EXIT 1
#define __nr_fork 2
#define __nr_read 3
#define __nr_write 4
. . . . . .
when a user invokes a system call, the system call number is passed as a parameter to the interrupt 0x80, which is actually the subscript of the system call table (Sys_call_table) to be referred to later, through which the handler function address of the matched system call can be found.
3. The system call Table System call table is defined as follows: (Linux/arch/i386/kernel/entry. S)
ENTRY (sys_call_table) . Long Symbol_name (Sys_ni_syscall)
. Long Symbol_name (sys_exit)
. Long Symbol_name (sys_fork)
. Long Symbol_name (sys_read)
. Long Symbol_name (sys_write)
. . . . . . The system call table records the entry address of each system call handler function, and it is easy to find the address of the handler function in the table with the system call number offset. The nr_syscalls defined in Linux/include/linux/sys.h represents the maximum number of system calls that the table can hold, Nr_syscalls = 256.
C. How system call Function interface translates into command
As mentioned earlier, the system call enters the kernel mentality through a stuck instruction, and then finds the matched handler entry address in the system call table based on the system call number passed to the core. This process is described in detail here.
We also take x86 as an example to illustrate:
because falling into a directive is a special instruction and relies on the platform that is implemented with the operating system, as in x86, this instruction is int 0x80, which is clearly not a statement that users should use when programming, because it makes it difficult for user programs to migrate. So the upper layer of the operating system needs to implement a corresponding system call library, each system call contains an entry point in the library (such as the fork, open, close, and so on) that are visible to the programmer, which work with the corresponding system call number as an argument, Execution falls into the instruction int 0x80 to sink into the core to perform real system call handler functions. When a process calls the entry point of a particular system call library, create a stack frame for the library function, just as it calls any function. When a process executes into an instruction, it converts the processor state to a nuclear mindset and executes the core code on the core stack.
Here is an example (linux/include/asm/unistd.h):
#define _SYSCALLN (type, name, Type1, Arg1, type2, arg2, ...)/
type name (type1 arg1,type2 arg2)/
{ /
long __res;
__asm__ volatile ("int $0x80"/
: "=a" (__res)/
: "(__nr_# #name)," B "((long) (arg1))," C "((long) (ARG2)); /
. . . . . .
__syscall_return (type,__res);
}
when you perform a system call entry function defined in a system call library, the actual execution is a section of code similar to the one above. This involves some GCC embedded assembly language, do not do a detailed introduction, simply explain its significance:
where the __nr_# #name是系统调用号, such as name = = IOCTL, is __NR_IOCTL, which is placed in the register eax as an argument passed to the interrupt 0x80 handler function. and other parameters of the system call Arg1, Arg2, ... Then put into ebx, ecx, ... In general registers, and as parameters for system call processing functions, how these parameters are passed into the core will be described later.
The following example illustrates:
int func1 ()
{
int fd, retval;
FD = open (filename, ...);
.....
ioctl (FD, CMD, arg);
. . .
}
Func2 ()
{
int fd, retval;
FD = open (filename, ...);
.....
__asm__ __volatile__ (/
"int $0x80/n/t"/
: "=a" (retval)/
: "0" (__nr_ioctl),/
"B" (FD),/
"C" (cmd),/
"D" (Arg));
}
The results of the two functions running on the linux/x86 should be the same.
Several library functions can be mapped to the same system call entry point. The system call entry point defines its true syntax and semantics for each system call, but the library function usually provides a more convenient interface. If the system call exec has different invocation modes: EXECL, execle, etc., they are actually just different interfaces of the same system call. For these calls, their library functions handle their respective parameters to achieve their own characteristics, but are eventually mapped to the same core entry point.
D. What is the initialization process when a system call is caught in the kernel? When a process executes a system call, it first invokes a function defined in the system call library, which is usually expanded into the _syscalln form mentioned above to get into the core through an int 0x80, Its parameters will also be passed through registers to the core.
in this section, we will introduce the processing function system_call for int 0x80.
thinking about it, you will find that the execution state is completely different before and after the call: The former executes the user state program on the user stack, which executes the kernel code on the core stack. Then, in order to ensure that the call point can be returned to execute the user code after executing the system call within the core, a context layer must be pressed into the core when the kernel mentality is entered, and a context layer pops up when returning from the core so that the user process can continue to run.
So, how are these contextual messages saved, and what are the contextual messages that are saved? This is still illustrated by the example of x86.
when the INT directive is executed, the following actions are actually done:
1. As the int instruction takes place a control transfer between different priority levels, so the core stack information (SS and ESP) of high priority is obtained from TSS (Task State segment); 2. Low Priority stack information (SS and ESP) is kept to the high priority stack (that is, the core stack);
3. Push the eflags, the outer cs,eip into the high priority stack (the core stack).
4. By IDT loading CS,EIP (control transfer to interrupt handler function) and then into the Interrupt 0x80 processing function System_call, in which the first use of a macro Save_all, the macro is defined as follows:
#define Save_all/ (a) CLD;
PUSHL%es;
PUSHL%ds;
PUSHL%eax;
PUSHL%ebp;
PUSHL%edi;
PUSHL%esi;
PUSHL%edx;
PUSHL%ecx;
PUSHL%ebx;
MOVL $ (__kernel_ds),%edx;
MOVL%edx,%ds;
MOVL%edx,%es; the function of the macro is to push the register context into the core stack, for system calls, but also for system call parameters, because the int instruction differs from the calling instruction when it controls transitions between different privilege levels, it does not automatically copy the parameters of the outer stack to the inner stack. So when calling a system call, you must specify the parameters in each register as mentioned in the previous example, and then press the keys in the registers to the core stack using Save_all after you get stuck in the kernel, so that the core can use the parameters passed in by the user. The source code for System_call is given below:
ENTRY (System_call) PUSHL%eax # Save Orig_eax
Save_all
get_current (%EBX)
Cmpl $ (nr_syscalls),%eax
Jae Badsys
testb $0x20,flags (%EBX) # Pf_tracesys
jne Tracesys
Call *symbol_name (sys_call_table) (,%eax,4) . . . . . .
The
all the work done here is:
1. Save the EAX register because the EAX registers that are saved in Save_all are overwritten by the return value of the call;
2. Invoke Save_all to save the register context;
3. Determine if the current call is a legitimate system call (EAX is the system call number, it should be less than nr_syscalls);
4. If the PF_TRACESYS flag is set, jumps to Syscall_trace, where it will suspend the current process and send sigtrap to its parent process, mainly to set up design for debugging breakpoints;
5. If the PF_TRACESYS flag is not set, jumps to the handler function entry for the system call. Here is the EAX (that is, the system call number mentioned earlier) as the offset, in the System call table Sys_ Call_table, and jumps to the entry address for the handle function entry address.
(Supplemental Note:
1.get_current macros
#define GET_CURRENT (REG)/
movl%esp, Reg;
Andl $-8192, Reg;
The effect is to get a pointer to the TASK_STRUCT structure of the current process to return to Reg, because the core stack in Linux is positioned at two pages after task_struct (8192bytes), so the stack pointer and 8192 are obtained here Task_ The struct structure pointer, while the TASK_STRUCT offset to 4 is the member flags, where the directive Testb $0x20,flags (%EBX) detects task_struct->flags.
2. Parameters in the stack
as mentioned earlier, Save_all is the incoming procedure for system call parameters, and when the Save_all is executed and the call instruction calls its handler function, the stack structure should look like the one shown above. The stack structure then looks the same as a function call that performs a common parameter, and the order in which the parameters correspond in the stack is (arg1, ebx), (Arg2, ECX), (Arg3, edx) ..., which is precisely the reverse sequence of the Save_all stack, These parameters are exactly the parameters that the user attempted to transmit to the core when using system calls. The following are two typical ways to use parameters in the call handler function at the core:
asmlinkage int sys_fork (struct pt_regs regs);
asmlinkage int Sys_open (const char * filename, int flags, int mode);
In sys_fork, the entire stack is treated as a struct pt_regs type parameter, and the structure of the parameter is consistent with the stack, so all the information in the stack can be used. In Sys_open, the parameter filename, flags, mode exactly corresponds to the position of the ebx, ECX, edx in the stack, which is exactly the register that the user assigns to these parameters when calling system calls through the C library.
__asm__ __volatile__ (/
"int $0x80/n/t"/
: "=a" (retval)/
: "0" (__nr_open),/
"B" (filename),/
"C" (flags),/
"D" (mode);
3. How the core uses user-space parameters when using system calls, some parameters are pointers that point to the address in the segment selection child of the user space DS register. In the version prior to 2.2, the segment selector in the DS segment Register of the kernel mentality and the segment selector of the user state are different (the former is 0xc0000000, the latter is 0x00000000), so that they cannot be read to the correct location when using these parameters. Therefore, it is necessary to read the parameters from the user spatial data segment through special core functions (such as: Memcpy_fromfs, Mencpy_tofs), in which the FS registers are used as the segment registers for reading parameters, and the FS registers are set to be user_ when the system call enters the kernel mentality. DS (DS is set to Kernel_ds). After 2.2 version of the user state and nuclear mentality of the use of the DS section of the segment address is the same (are 0x00000000), so no need to go through the above cumbersome process and directly use parameters. 2.2 and later versions of Linux/arch/i386/head. S
ENTRY (gdt_table) . Quad 0x0000000000000000/* NULL Descriptor * *
. Quad 0x0000000000000000/* not used * *
. Quad 0X00CF9A000000FFFF/* 0x10 kernel 4GB Code at 0x00000000 * *
. Quad 0X00CF92000000FFFF/* 0x18 kernel 4GB data at 0x00000000 * *
. Quad 0X00CFFA000000FFFF/* 0x23 user 4GB Code at 0x00000000 * *
. Quad 0X00CFF2000000FFFF/* 0x2b user 4GB data at 0x00000000 * * 2.0 linux/arch/i386/head. S ENTRY (GDT) . Quad 0x0000000000000000/* NULL Descriptor * *
. Quad 0x0000000000000000/* Not used * *
. Quad 0XC0C39A000000FFFF/* 0x10 kernel 1GB Code at 0xc0000000 * *
. Quad 0XC0C392000000FFFF/* 0x18 kernel 1GB data at 0xc0000000 * *
. Quad 0X00CBFA000000FFFF/* 0x23 user 3GB Code at 0x00000000 * *
. Quad 0X00CBF2000000FFFF/* 0x2b user 3GB data at 0x00000000 *
in the 2.0 version of the kernel Save_all macro definition There are several statements:
"MOVL $" STR (Kernel_ds) ",%edx/n/t"/
"mov%dx,%ds/n/t"/
"mov%dx,%es/n/t"/
"MOVL $" STR (User_ds) ",%edx/n/t"/
"mov%dx,%fs/n/t"/
"Movl $0,%edx/n/t"/
E. Call return
The process of calling back will do more work than its response process, which is almost always required to return the user state from a nuclear mindset, which is briefly described here:
1. Judge if there is soft interruption, if there is a jump to soft interrupt processing;
2. To determine whether the current process needs rescheduling, if necessary, jump to scheduling processing;
3. If the current process has a pending signal has not been processed, then jump to signal processing;
4. Use Restore_all to eject all content that is save_all into the core stack and use Iret to return the user state.
F. An example description of the system call related data structure and how each step is handled in the process of using a system call in Linux, the following will be strung together to show how to add a system call to Linux.
the system call you implement here is simply to print a statement on the console without any functionality.
1. Modify Linux/include/i386/unistd.h to add a statement inside:
#define __nr_hello??? (This number may vary depending on the core version)
2. Add a hello.c to a suitable directory (e.g. Linux/kernel), and modify the makefile in the directory (include the matched. o Files in