Reference:
"Linux kernel design and implementation"
0 Summary
System call process for Linux:
The hierarchy is as follows:
User program------>C library (i.e. API): INT 0x80----->system_call-------> System invoke Service Routines--------> Kernel programs
First of all, we often say that the user API is actually a system-provided C library.
The system call is implemented through the soft interrupt instruction int 0x80, and this INT 0x80 directive is encapsulated in the function of the C library.
(soft interrupts differ from what we often call hard interrupts in that soft interrupts are triggered by instructions, not by hardware peripherals.) )
INT 0x80 The execution of this instruction will cause the system to jump to a preset kernel space address, which points to the system call handler, the System_call function.
(NOTE:!!!) System call handler System_call is not a system invoke service routine, the system invoke service routine is a kernel implementation function for a specific system call, and the system call handler is a boot process before the system invokes the service routine, which is the directive for int 0x80. For all system calls. To put it simply, any system call is performed first by invoking the function in library C, where there will be a soft interrupt INT 0x80 statement, and then go to execute system call handler System_call,
The system_call then goes to the execution of the specific system invocation service routine based on the specific system call number. )
How does the System_call function find a specific system invocation service routine? Find system Call Table sys_call_table! by system call number When the soft interrupt instruction int 0x80 executes, the system call number is placed in the EAX register, the System_call function can read the EAX register fetch, multiply it by 4, generate an offset address, and then use sys_call_table as base address, base address with offset, You can get the address of the specific system call service routine!
Then the system invokes the service routine. It is necessary to note that the system invoke service routine takes only the parameters from the stack, so the parameters are stored in the registers before the System_call executes, and the registers are first pressed into the stack when System_call executes. After the system_call exits, the user can obtain (modified) parameters from the register.
In addition: the system calls through the soft interrupt int 0x80 into the kernel, jumps to the system call handler System_call function, and then executes the corresponding service routine. However, because it represents the user process, the execution process is not part of the interrupt context, but the process context. Therefore, during system call execution, many of the information that can be accessed by the user process can be preempted by other processes and can hibernate.
Once the system call is complete, the kernel will be dispatched once the control is handed back to the user process that initiated the call. If you find that a higher priority process or the current process has run out of time slices, you will select a higher priority process or re-select the process execution.
1 system call meaning
The Linux kernel sets up a set of subroutines that are used to implement system functions, called system calls. System calls are very similar to normal library function calls, except that system calls are provided by the operating system core and run in a kernel mindset, while normal function calls are provided by the library or the user itself and run in the user state.
In general, a process cannot access the kernel. It cannot access the kernel's occupied memory space, nor can it call kernel functions. The CPU hardware determines these (which is why it is referred to as "protected mode"). To interact with processes running on the user space, the kernel provides a set of interfaces. Through this interface, applications can access hardware devices and other operating system resources. This set of interfaces acts as a messenger between the application and the kernel, and the application sends various requests, while the kernel is responsible for satisfying those requests (or letting the application shelve). In fact, to provide this set of interfaces is mainly to ensure that the system is stable and reliable, to avoid the application of reckless, causing great trouble.
The system call adds a middle tier between the user space process and the hardware device. The main role of this layer is three:
(1) It provides a unified hardware abstraction interface for user space. For example, when you need to read some files, the application can not control the disk type and media, and even do not have to control what kind of file system the file is in.
(2) system calls ensure the stability and security of the system. As an intermediary between a hardware device and an application, the kernel can adjudicate the required access based on permissions and other rules. This can, for example, prevent applications from using hardware devices incorrectly, stealing resources from other processes, or doing anything else that harms the system.
(3) Each process runs in a virtual system, and it is for this reason that a common interface is provided in the user space and the rest of the system. If the application is free to access the hardware and the kernel knows nothing about it, multitasking and virtual memory are almost impossible, and of course, it is not possible to achieve good stability and security. In Linux, system calls are the only means of user-space access to the kernel, except exceptions and interrupts, which are the only legitimate portals of the kernel.
2 Relationship of API/POSIX/C library
In general, applications are programmed through the Application Programming interface (API) rather than directly through system tuning. This is important because the programming interface used by the application does not actually need to correspond to the system call one by one provided by the kernel. An API defines the programming interfaces used by a set of applications. They can be implemented as a system call, or they can be implemented by invoking multiple system calls, without using any system calls at all. In fact, APIs can be implemented on a variety of operating systems, providing the exact same interface to the application, and they may have different implementations on those systems.
In the Unix world, the most popular application programming interface is based on the POSIX standard, and its goal is to provide a set of largely UNIX-based portable operating system standards. POSIX is an excellent example of the relationship between APIs and system calls. On most Unix systems, there is a direct relationship between API functions and system invocations defined based on POSIX.
Linux system calls, like most Unix systems, are provided as part of the C library as shown in. The C library implements the main API for UNIX systems, including standard C library functions and system calls. All C programs can use C libraries, and because of the characteristics of the C language itself, other languages can be easily packaged up to use.
From the programmer's point of view, system calls don't matter, they just have to deal with the API. Instead, the kernel only deals with system calls, and how library functions and applications use system calls is not the kernel's concern.
The interface design for UNIX has a common maxim of "providing mechanisms rather than strategies". In other words, UNIX system calls abstract functions that are used to accomplish some sort of deterministic purpose. How to do these functions with no need for the kernel to care about. The differential treatment mechanism (mechanism) and policy are a major highlight of Unix design. Most programming problems can be cut into two parts: what functionality is required (mechanism) and how to implement these functions (strategy).
3 Implementation of system calls
3.1 System call handlers
You may wonder: "When I enter Cat/proc/cpuinfo, how is the Cpuinfo () function called?" "After the kernel finishes booting, the control flow is from the relatively intuitive" which function is called next? "changed to depend on system calls, exceptions, and interrupts.
User-space programs cannot execute kernel code directly. They cannot directly invoke functions in kernel space because the kernel resides on a protected address space. If a process can read and write directly to the kernel's address space, system security will be out of control. Therefore, the application should notify the system in some way, telling the kernel that it needs to perform a system call, and that it wants the system to switch to the kernel state so that the kernel can execute the system call on behalf of the application.
The mechanism for notifying the kernel is implemented by software interrupts. First, the user program sets the parameters for the system call. One of the parameters is the system call number. After the parameter setting is complete, the program executes the system call command. The soft interrupt on the x86 system is generated by Int. This command causes an exception: an event that causes the processor to switch to the kernel state and jump to a new address and start executing the exception handler there. The exception handler at this point is actually the system call handler. It is closely related to the hardware architecture.
The instruction of the new address saves the state of the program, calculates which system call should be called, invokes the function that implements that system call in the kernel, restores the state of the user program, and then returns control to the user program. A system call is a way for a function defined in a device driver to eventually be called.
3.2 System call number
In Linux, each system call is given a system call number. This makes it possible to associate system calls with this unique number. When a user-space process executes a system call, the system invocation number is used to indicate which system call is being executed. The process does not mention the name of the system call.
The system call number is critical, and once the assignment can no longer be changed, the compiled application crashes. Linux has an "not implemented" system call Sys_ni_syscall (), which does not do any other work except to return a Enosys, which is specifically designed for invalid system calls.
Because all system calls fall into the same way as the kernel, it is not enough to simply fall into the kernel space. Therefore, the system call number must be passed to the kernel. On x86, the system call number is passed to the kernel through the EAX register. Before trapping the kernel, the user space puts the corresponding system call number into the EAX. This allows the system call handler to get the data from the EAX once it is run. Implementations of other architectures are similar.
The kernel records the list of all registered system calls in the system call table, stored in sys_call_table. It is architecture-related and is typically defined in ENTRY.S. This table specifies a unique system call number for each valid system call. Sys_call_table is a table consisting of function pointers to kernel functions that implement various system calls:
ENTRY (sys_call_table)
. Long Symbol_name (sys_ni_syscall)/* 0-old "setup ()" System call*/
. Long Symbol_name (sys_exit)
. Long Symbol_name (sys_fork)
. Long Symbol_name (Sys_read)
. Long Symbol_name (Sys_write)
. Long Symbol_name (sys_open)/* 5 */
. Long Symbol_name (Sys_close)
. Long Symbol_name (SYS_WAITPID)
。。。。。
. Long Symbol_name (Sys_capget)
. Long Symbol_name (sys_capset)/* 185 */
. Long Symbol_name (Sys_sigaltstack)
. Long Symbol_name (Sys_sendfile)
. Long Symbol_name (sys_ni_syscall)/* STREAMS1 */
. Long Symbol_name (sys_ni_syscall)/* STREAMS2 */
. Long Symbol_name (sys_vfork)/* 190 */
The System_call () function checks for validity by comparing a given system call number to Nr_syscalls. If it is greater than or equal to NR syscalls, the function returns a Enosys. Otherwise, the appropriate system call is executed.
Call *sys_ call-table (,%eax, 4)
Because table entries in the system call table are stored in 32-bit (4-byte) types, the kernel needs to multiply the given system call number by 4, and then use the resulting results to query its location in the table
3.3 Parameter passing
In addition to the system call number, most system calls also require some external parameters to be lost. Therefore, when an exception occurs, these parameters should be passed from user space to the kernel. The simplest way to do this is to store the parameters in the registers as if they were passing the system call number. On x86 systems, EBX, ECX, edx, ESI, and EDI store the first five parameters in order. It is rare to need six or more than six parameters, at which point a separate register should be used to hold pointers to all of these parameters in the user-space address.
The return value to the user space is also passed through the register. On the x86 system, it is stored in the EAX register. The next many descriptions of the system invocation handlers are for the x86 version. But don't worry, the implementation of all architectures is very similar.
3.4 Parameter Verification
System calls must be carefully checked to see if all of their arguments are valid. For example, system calls related to file I/O must check that the file descriptor is valid. Process-related functions must check that the provided PID is valid. Each parameter must be checked to ensure that they are not only valid, but also correct.
One of the most important checks is to check whether a user-supplied pointer is valid. Imagine that if a process can pass pointers to the kernel without being checked, it can give a pointer to a core that it does not have access to, and spoof the kernel to copy data that it does not allow it to access, such as data that was originally part of another process. Before receiving a pointer to a user space, the kernel must ensure that:
2 The memory area that the pointer points to belongs to user space. The process must not spoof the kernel to read the data in the kernel space.
2 The memory area pointed to by the pointer is in the address space of the process. The process must not spoof the kernel to read data from other processes.
2 If it is read, the memory should be marked as readable. If it is write, the memory should be marked as writable. The process must not bypass memory access restrictions.
The kernel provides two methods to complete the necessary checks and a back-and-forth copy of the data between the kernel space and the user space. Note that the kernel is not allowed to lightly accept pointers from user space at any time! One of these two methods must be called. In order to write data to the user space, the kernel provides copy_to_user (), which requires three parameters. The first parameter is the destination memory address in process space. The second one is the source address within the kernel space. The last parameter is the length of the data (in bytes) that needs to be copied.
To read data from user space, the kernel provides copy_from_ user (), which is similar to Copy-to-user (). The function copies the data at the location specified by the second parameter to the position specified by the first parameter, and the length of the copied data is determined by the third parameter.
If execution fails, both functions return the number of bytes of data that were not able to complete the copy. If successful, returns 0. When the above error occurs, the system call returns the standard-efault.
Note that Copy_to_user () and Copy_from_user () are both likely to cause blocking. This happens when the page containing the user data is swapped out to the hard disk rather than to the physical memory. At this point, the process sleeps until the page fault handler re-converts the pages from the hard disk back to physical memory.
3.5 Return values for system calls
System calls (often called Syscalls in Linux) are usually called through functions. They usually need to define one or several parameters (input) and may have some side effects, such as writing a file or copying data to a given pointer, and so on. To prevent confusion with normal return values, the system call does not return the error code directly, but instead put the error into a global variable named errno. A negative return value is usually used to indicate an error. Returning a value of 0 usually indicates success. If a system call fails, you can read out the value of errno to determine where the problem lies. By calling the Perror () library function, you can translate the variable into an error string that the user can understand.
errno the error messages represented by different values are defined in errno.h, you can also view them by command "Man 3 errno". It is important to note that the value of errno is only set when the function has an error, and if the function does not have an error, the value of errno is undefined and will not be reset to 0. In addition, it is best to put its value in another variable before processing errno, because in error handling, even functions like printf () will change the value of errno when an error occurs.
Of course, the system call eventually has a definite operation. For example, such as the Getpid () system call, according to the definition it will return the PID of the current process. Its implementation in the kernel is simple:
Asmlinkage long Sys_ getpid (void)
{
return current-> Tgid;
}
Although the above system calls are very simple, there are two special places we can find. First, note the Asmlinkage qualifier in the function declaration, which is a operas method that tells the compiler to extract only the parameters of the function from the stack. This qualifier is required for all system calls. Second, note that the system call GET_PID () is defined in the kernel as Sys_ getpid. This is the naming convention that all system calls in Linux should follow
4 Adding a new system call
Adding a new system call to Linux is a relatively easy task. How to design and implement a system call is a problem, but adding it to the kernel does not require much effort. Let's look at the steps required to implement a new Linux system call.
The first step in implementing a new system call is to determine its purpose. What's it going to do? Each system call should have a clear purpose. Multi-purpose system calls are not advocated in Linux (a system call that chooses to do different things by passing different parameter values). The IOCTL () should be considered as a counter-example.
What is the parameter, return value, and error code of the new system call? The interface of the system call should be concise and with as few parameters as possible. When designing interfaces, try to think about the future as much as possible. Do you have unnecessary restrictions on functions? The more general the system call is designed, the better. Don't assume that this system call is going to be used in the future. The purpose of the system call may be the same, but its usage may change. Is this system call portable? Do not assume the byte-length and byte-order of the machine. When you write a system call, always pay attention to portability and robustness, not only to consider the current, but also to make plans for the future.
When a system call is written, registering it as a formal system call is trivial work:
Add a table entry at the end of the system call table. Every hardware system that supports this system call must do so. Starting with 0, the system call's position in the table is its system call number.
For each of the supported architectures, the system call number must be defined in <asm/unistd.h>.
System calls must be compiled into the kernel image (cannot be compiled into a module). Just put it in a related file under the kernel/.
Let's take a closer look at these steps with a fictitious system called F00 (). First, we're going to add sys_foo to the system call table. For most architectures, the table is in the Entry.s file, in the following form:
ENTRY (sys_ call_ table)
Long Sys_ Restart_ syscall/*0*/
. Long Sys_ Exit
Long Sys_ Fork
Long Sys_ Read
. Long Sys_write
We add the new system call to the end of the table:
. Long Sys_foo
Although no number was explicitly specified, the system call we joined was assigned to the 283 system call number in order. For each architecture that needs to be supported, we have to add our own system calls to the system call table. Each architecture does not need to correspond to the same system call number.
Next, we add the system call number to <asm/unistd.h>, which is in the following format:
/* This file contains the system call number */
#define_ nr_ Restart_ Syscall
#define NR Exit
#define NR Fork
#define NR Read
#define NR Write
#define NR-MQ getsetattr 282
We then add the following line to the list:
#define_ nr_ Foo 283
Finally, let's implement the F00 () system call. Regardless of the configuration, the system call must be compiled into the core kernel image, so we put it into the kernel/sys.c file. You can also put it in the code that is most closely related to its function.
Asmlinkage long Sys-foo (void)
{
return THREAD SIZE
)
That's it! Strictly speaking, it is now possible to call the F00 () system in user space.
Creating a new system call is easy, but it is never advocated. Usually the module can be a better alternative to creating a new system call.
5 accessing system calls
5.1 System Call Context
The kernel is in a process context when it executes system calls. The current pointer points to the present task, which is the process that raised the system call.
In the context of a process, the kernel can hibernate and can be preempted. These two points are very important. First, the ability to hibernate indicates that system calls can use most of the functionality provided by the kernel. The ability to hibernate can greatly facilitate kernel programming. Being preempted in the context of a process actually indicates that, like a process within a user space, the current process can be preempted by other processes as well. Because the new process can use the same system call, care must be taken to ensure that the system call is heavy. Of course, this is also a problem that must be equally concerned in symmetric multi-processing.
When the system call returns, the control is still in System_call (), which ultimately takes care of switching to user space and letting the user process continue.
5.2 System Call Access example
The operating system uses the system call table to translate the system call number into a specific system call. The system Call table contains the address of the function that implements each system call. For example, the read () system calls the function named Sys_read. The read () system call number is 3, so Sys_read () is in the fourth entry in the System call table (because the system call starts with a number of 0). Read the data from the address Sys_call_table + (3 * word_size) and get the address of the Sys_read ().
Once the correct system call address is found, it transfers control to that system call. Let's look at defining the location of Sys_read (), the fs/read_write.c file. This function will find the file structure associated to the FD number (passed to the Read () function). The struct contains pointers to functions that are used to read data from a particular type of file. After a few checks, it calls the file-related read () function to actually read the data from the file and return it. File-related functions are defined elsewhere-such as socket code, file system code, or device driver code. This is one aspect of a particular kernel subsystem that eventually collaborates with other parts of the kernel.
After the read function is finished, it returns from Sys_read (), which switches the control to Ret_from_sys. It will check for tasks that need to be completed before switching back to user space. If there is nothing to do, restore the state of the user process and return control to the user program.
5.3 Direct access to system calls from user space
Typically, system calls are supported by the C library. User programs can use system calls (or call library functions, which are actually called by library functions) by including a standard header file and a link to the C library. But if you just write a system call, the GLIBC library is probably not supportive. Fortunately, Linux itself provides a set of macros that are used to access the system calls directly. It sets the register and invokes the trap command. These macros are _syscalln (), where n ranges from 0 to 6. Represents the number of arguments that need to be passed to the system call, because the macro must know exactly how many parameters are pressed into the register in what order. For example, an open () system call is defined as:
Long open (const char *filename, int flags, int mode)
Instead of library support, the macro that calls this system call directly is in the form:
#define NR_ Open 5
Syscall3 (Long, open, const char*,filename, int, flags, int, mode)
This allows the application to directly use the open ()
For each macro, there is a n-th parameter. The first parameter corresponds to the return value type of the system call. The second parameter is the name of the system call. The type and name of each parameter are then arranged in the order of the system invocation parameters. _NR_ Open is defined in <asm/unistd.h> and is the system call number. The macro is expanded into the C function of the inline assembly. The steps discussed in the previous section of assembly language execute the system call number and parameters into the register and trigger a soft interrupt to fall into the kernel. Call the open () system call to place the above macro directly in the application.
Let's write a macro to use the Foo () system call we wrote earlier, and then write the test code to show off what we've done.
#define NR Foo 283
_sysca110 (Long, foo)
int main ()
{
long stack size;
Stack_ Size=foo ();
printf ("The kernel stack
Size is 81d\n ", stack_ size);
Return
}
6 actual use of attention
(1) system calls need to be pre-compiled to solidify into the kernel, and need to officially assign a system call number
(2) system calls need to be registered in each of the supported architectures
(3) system calls can not be accessed directly in the script
(4) Try to avoid the creation of new system calls, can be used to create device node method instead.
Process Analysis of Linux system call