Linux kernel space and user spatial information interaction method

Source: Internet
Author: User

 Linux kernel space and user spatial information interaction method

The author of this article :

Kang Hua : Master of Computer Science, mainly engaged in the Linux operating system core, Linux technology standards, computer security, software testing and other fields of research and development work, is now working in the Ministry of Information Industry software and integrated circuit Promotion center belongs to the MII-HP Linux software laboratory. Contact him via [email protected] if you need to.

Absrtact: In the case of device drivers, kernel function modules and other system-level development, often need to exchange information between the kernel and the user program. Linux provides a variety of ways to accomplish these tasks. This paper summarizes a variety of commonly used information exchange methods, and uses simple examples to demonstrate the characteristics of these methods and their usage. Among them are very familiar with the method, there are special conditions can be used under the means. By contrasting these methods, we can deepen our understanding of the Linux kernel, and more importantly, we can more skillfully drive the Linux kernel-level application development technology.

Kernel space (kernel-space) VS user space (User-space)

As a Linux developer, you should first understand the difference between kernel space and user space. There is already a lot of information on this topic, as we briefly describe here:

In modern computer architecture, storage management usually includes a protection mechanism. The purpose of protection is to avoid a task in the system that accesses a storage area belonging to another or an operating system. As in the IntelX86 system, the privilege level is provided as a protection mechanism that restricts access to the storage area through the distinction of privilege levels. Based on this architecture, the Linux operating system divides itself: part of the core software is independent of the normal application and runs at a higher level of privilege (Linux uses the Intel system's privileged level to run the kernel. They reside on a protected memory space and have all the permissions to access the hardware device, which Linux calls kernel space.

In contrast, other parts are executed as applications in user space. They can only see some of the system resources that they are allowed to use, and cannot use certain system features, direct access to the hardware, no direct access to kernel space, and, of course, other specific usage restrictions. (Linux runs the user program with the privileged level of Intel system.) )

It is very effective to put user space and kernel space under this asymmetric access mechanism from a security standpoint, which can resist the prying of malicious users and prevent the abuse of poor quality user programs, thus making the system run more stably and reliably. However, if the user program is not allowed to access and use kernel space resources like this completely, then our system will not be able to provide any meaningful functionality. In order to facilitate the user program to use the kernel space in order to fully control the resources, without violating the above-mentioned privileges, from the hardware architecture itself to the operating system, has defined a standard access interface. For details on the X86 system, refer to reference 1

The general hardware architecture provides a "door" mechanism. The "gate" means that low-privileged applications can enter high-privileged kernel space through these "gates" when certain events occur. For the IntelX86 system, the Linux operating system uses the hardware interface of the "System Gate" (by calling the int $0x80 machine instruction), constructs the various system calls as the software interface, provides the channel for the application to sink into the kernel state from the user state. Using "System Gate" through "system call" does not require special permission, but it is not arbitrary to get into the specific location of the kernel, which is specified by "system call", and there is such a limitation to ensure kernel security. We can visually describe this mechanism: as a tourist, you can buy a ticket to enter the safari park, but you must be honest in the sightseeing car and follow the prescribed route for sightseeing. Of course, don't get off the bus, because that's too dangerous, not to let you lose your life, is to scare the wild animals.

For efficiency and code size considerations, kernel programs cannot use standard library functions (and of course there are other concerns, refer to reference 2 for detailed reasons) Therefore, kernel development is not as convenient as user program development. And since the kernel currently (linux2.6) is "non-preemptive", processes running in kernel space will not be replaced by other processes (unless the process actively abandons CPU control, such as calling sleep (), schedule (), etc.), Therefore, either in the context of the process (such as running the read system call) or in the interrupt context (in the Interrupt Service program), the kernel program can not take a long time to consume the CPU, otherwise the program will not be able to execute, only wait.

Interaction of kernel space and user space

Now, more and more applications need to write kernel-level and user-level programs to accomplish specific tasks together, typically in the following pattern: first, the Kernel service program takes advantage of the privileges and services provided by the kernel space to receive, process, and cache data, and then writes the user program to interact with the previously completed kernel service program. Specifically, you can use the user program to configure kernel service program parameters, extract the data provided by the kernel service program, of course, you can also enter the processing data to the kernel service program.

Typical applications include: NetFilter (Kernel service Program: firewall) vs iptable (User-level program: Rule Setup program), IPSEC (kernel service Program: VPN protocol section) vs IKE (User-level program: VPN key negotiation processing) And, of course, a large number of device drivers and the corresponding application software. These applications are both kernel-level and user-level programs that perform specific tasks together by exchanging information with each other.

Information Interaction Methods

The information exchange between the user program and the kernel is bidirectional, which means that the information can be sent from the user space to the kernel space, or the data can be submitted to the user space from the kernel space. Of course, the user program can also take the initiative to extract data from the kernel. Below, we will summarize and summarize the methods for the kernel and user interaction data.

Information interaction by information transmission initiator can be divided into the user to the kernel to transmit/extract data and the kernel to the user space to submit requests for two categories, first of all:
The information that is initiated by the user-level program interacts.

User-level programs proactively initiate information interaction a write your own system call

As can be seen from the previous article, system invocation is the most basic method of user-level program access to the kernel. Linux currently provides roughly 200 standard system calls (see Include/asm-i386/unistd.h and arch/i386/kernel/entry in the kernel code tree). s file), and allows us to add our own system calls to implement and exchange information with the kernel. For example, we want to set up a system call log system to record all system call actions for intrusion detection. At this point, we can write a kernel service program. The program collects all system call requests and logs them to a self-built buffer in the kernel. We cannot implement a complex intrusion detection program in the kernel, so we must extract the records in this buffer into the user space. The most straightforward approach is to write a new system call to implement this buffer data extraction function. When the kernel service program and the new system call are implemented, we can write the user program in the user space for the intrusion detection task, the intrusion detection program can be timed, rotation or when needed to call the new system call from the kernel to extract data, and then intrusion detection.

b Writing Drivers

One feature of Linux/unix is that everything is considered a document (every thing is a file). The system defines a simple and sophisticated driver interface that the client can use to interact with the kernel driver through this interface in a uniform way. Most of the users and developers of the system are already familiar with this interface and the corresponding development process.

The driver runs in kernel space, and the user-space application interacts with it through a file in the/dev/directory in the file system. That's what we're familiar with. The file operation process: Open ()--read ()--write ()--ioctl ()--close (). (Note that it is not all kernel drivers are this interface, network drivers and the use of various protocol stacks is not consistent, such as the set of interface programming, although there is open () close () and other concepts, but its kernel implementation and external use of the common driver is very different. For programming details on this section, refer to Resources 3, 4.

Device drivers in-core interrupt response, device management, data processing, etc. this article does not care, we focus on its interaction with the user-level program part. The operating system defines a unified interface for this, which is the previously mentioned open (), read (), write (), IOCTL (), and Close (), and so on. Each driver is implemented independently of its own needs and hides the functionality and services it provides in this unified interface. The client-level program chooses the required driver or service (in fact, selects the file under the/dev/directory), and according to the above interface and file operation process, it can interact with the driver in the kernel. In fact, the concept of object-oriented is easier to explain, the system defines an abstract interface (abstract interface), each specific driver is the implementation of this interface (implementation).

So the driver is also one of the important ways of user space and kernel information interaction. In fact, the IOCTL, read, write is also in essence through the system call to complete, but these calls have been the kernel has been standard encapsulation, unified definition. So the user does not have to modify the kernel code as it is called to fill in the new system, recompile the new kernel, and use the virtual appliance to install the new virtual device into the kernel (insmod) with the help of the module method. Refer to reference 5 for details on this aspect of the design, and refer to reference 6 for programming details.

In Linux, devices can be broadly categorized into: Character devices, block devices, and network interfaces (character devices include devices that must be accessed sequentially, like byte streams, such as character terminals, serial ports, and so on. Block devices are those devices that can be accessed in a random way, in the form of a whole piece of data, such as a hard disk, or a network interface, which is a complex network input and output service such as a typical NIC and protocol stack. It is also a breeze to work if we implement the system call log system in a character-driven way. We can write the part of the kernel that collects and records information into a character device driver. Although there is no physical device, it is not a problem: The Linux device driver is a software abstraction that can be used in conjunction with hardware to provide services, but also as a purely software service (of course, we cannot avoid the use of memory). In the driver, we can use open to start the service, with read () to return the processed records, with the IOCTL () to set the record format, close () to stop the service, write () is not used, then we can not implement it. Then create a device file in the/dev/directory corresponding to our newly added kernel system call log system driver.

C: Using the proc file system

Proc is a special file system provided by Linux, and its purpose is to provide a convenient way to interact between users and the kernel. It uses the file system as the interface, so that the application can get the state of the system running and some other kernel data information safely and conveniently in the way of file operation.

Proc file systems are used for monitoring, managing, and debugging systems, and many of the management tools we use, such as ps,top, use proc to read kernel information. In addition to reading the kernel information, the proc file system also provides write functionality. So we can also use it to enter information into the kernel. For example, by modifying the system parameter configuration file (/proc/sys) under the proc file system, we can dynamically change the kernel parameters directly at runtime, as in the following command:

Echo 1 >/proc/sys/net/ip_v4/ip_forward

Turn on the switch that controls IP forwarding in the kernel, and we can enable the routing feature in the running Linux system. Similarly, there are many kernel options that can be queried and adjusted directly through the proc file system.

In addition to the file entries already provided by the system, Proc also leaves us with an interface that allows us to create new entries in the kernel to share information data with the user program. For example, we can create a new file entry in the proc file system for the system call log program (either as a driver or as a simple kernel module), showing the number of times a system call is used, how often each individual system call is used, and so on. We can also add additional entries to set up logging rules, such as not logging the use of open system calls. For details on the use of the proc file system, refer to reference 7.

D: Using a virtual file system

Some kernel developers think that using the IOCTL () system calls often makes system calls ambiguous and difficult to control. Putting information into the proc file system can confuse the information and therefore does not favor excessive use. They recommend implementing an isolated virtual file system instead of the IOCTL () and/proc, because the file system interface is clear and user-friendly, and the use of virtual file systems makes it easier and more efficient to perform system administration tasks with scripts.

Let's say, for example, how to modify kernel information through a virtual file system. We can implement a virtual file system called Sagafs, where the file log corresponds to the system call log stored by the kernel. We can get log information through the file access Special method:

# Cat/sagafs/log

Using virtual file system--vfs to realize information interaction makes system management more convenient and clear. But some programmers may say that the VFS API interface is complicated and not easy to grasp, do not worry about the 2.5 kernel to provide a sample program called Libfs to help users unfamiliar with the file system to encapsulate the implementation of the VFS common operation. See Resources for a way to interact with VFS.

E: Using a memory image

Linux provides the ability of a user program to access memory directly through a memory-imaging mechanism. The memory image means that the memory space of a specific part of the kernel is mapped to the memory space of the user-level program. That is, the user space and the kernel space share a piece of the same memory. The intuitive effect of this is obvious: the kernel stores any data changed in this address, and users can immediately discover and use it without having to copy the data at all. When using system invoke interaction information, there must be a copy of the data during the entire operation-either copy the kernel data to the user buffer, or simply copy the user data to the kernel buffer-for many applications with high data transmission volume and time requirements, This is undoubtedly a fatal blow: Many applications simply cannot tolerate the time and resources spent on copying data.

We've developed a driver for a high-speed sampling device that requires 16-bit real-time sampling at a repetition rate of 1KHz at 20 megabits per millisecond, a staggering amount of data that needs to be sampled, DMA, and processed in milliseconds, and cannot be achieved if a data copy method is used. At this point, the memory image becomes the only option: we reserve a space in memory to configure it as a ring queue for the DMA output data of the sampled device. Then the memory space mapped to the user space to run the data processing program, so the sampling device just got and transferred to the host data, immediately can be processed by the user space program.

In fact, memory innuendo is often the case for applications where the kernel and user space require fast and large amounts of interactive data, especially those that are more demanding in real time. The virtual memory area of the server of the X Window System can be seen as a typical example of memory image usage: The x server needs a large amount of data exchange for video memory, and the graphical display memory directly to the user space can significantly improve performance relative to Lseek/write.

Not all types of applications are suitable for mmap, such as streaming data-based character devices like serial ports and mice, and mmap does not have much to play for. Also, there is a problem with the way that shared memory is not synchronized. Since there is no dedicated synchronization mechanism to allow user programs and kernel programs to be shared, there is a very careful design to read and write data to ensure that no dry-wrap occurs.

Mmap is entirely based on shared memory, and because of this, it provides additional convenience, but is particularly difficult to control.

Information interaction initiated by the kernel

In the kernel-initiated interaction, we are most concerned about and interested in how the kernel to the user program to send messages, the user program and how to receive these messages, the specific problem is usually focused on the following aspects: Can the kernel invoke the user program? Can I tell user process events to occur by signaling to the user process?

The biggest difference between the interaction methods described earlier is that the kernel takes the initiative rather than waiting for the system to return information passively.

A calls the user program from the kernel space.

Even in the kernel, we sometimes need to do something at the user level: such as opening a file to read specific data, executing a user program to complete a function. Because many of the data and functions are existing in the user space or have already been implemented, there is no need to spend a lot of resources to repeat. In addition, when the kernel is designed, in order to have better elasticity or performance to support unknown but possible changes, it requires the use of user-space resources to work together to complete the task. For example, parts of the dynamic load module in the kernel need to call kmod. But it is not possible to compile all the kernel modules at the time of compiling kmod (if so, the dynamic loading module is meaningless), so it is impossible to know the location and loading methods of the modules that appear after it. As a result, the dynamic loading of the module takes the following strategy: The load task is actually done by the Modprobe program in the user space-The simplest case is that Modprobe uses the module name passed by the kernel as a parameter to invoke Insmod. Use this method to load the required modules.

In the kernel to start the user program or through the EXECVE system call prototype, but at this time the call occurs in the kernel space, and the general system calls in the user space. If the system calls with parameters, it will encounter a problem: because in the implementation code of the system call to check the validity of the parameters, the check requires that all parameters must be in the user space-the address is between 0x0000000--0xc0000000, So if we pass the parameter from the kernel (the address is greater than 0xc0000000), then the check will reject our call request. To solve this problem, we can use the SET_FS macro to modify the check policy, allowing the parameter address to be the kernel address. This allows the kernel to use the system call directly.

For example, a set_fs (Kernel_ds) is required before Kmod executes modprobe code by calling Execve:

Set_fs (Kernel_ds);

/* Go, go, go ... */
if (Execve (Program_path, argv, ENVP) < 0)
In the code above, Program_path is "/sbin/modprobe", argv {modprobe_path, "-S", "-K", "--", (char*) module_name, NULL},envp = {"home=/", "Term=linux", "Path=/sbin:/usr/sbin:/bin:/usr/bin", NULL}.

Opening a file from the kernel also uses an open system call with parameters, which is still required to call the SET_FS macro first.

B using BRK system transfer to export kernel data

The kernel and user space pass data primarily with Get_user (PTR) and Put_user (datum,ptr) routines. So they can be found in most of the system calls that need to pass the data. But how do we pass the kernel data to the user space if we are not using a system call initiated by the user program-that is, without explicitly providing a buffer position within the user space?

Obviously, we can no longer directly use Put_user (), because we have no way to assign a destination buffer to it. So, we're borrowing BRK system calls and the current process space: BRK is used to set the size of the heap space for the process. Each process has a separate heap space, and the dynamic memory allocation function, such as malloc, is actually getting memory in the heap space of the process. We will use BRK to extend a new temporary buffer on the heap space of the current process, and then use Put_user to export the kernel data to this deterministic user space.

Remember the process that we called the user program in the kernel just now? There, we have a skip parameter check operation, and now with this method, we can find a new way: we extend a space on the heap of the current process, and copy the parameters of the system call to the newly expanded user space by Put_user (). Then, when calling Execve, the new spatial address is used as the parameter, so the obstacle of parameter checking no longer exists.

char * Program_path = "/bin/ls";

/* Locate the current heap top position */
/* Extend a new 256-byte buffer with BRK on top of the heap */
RET = BRK (* (void) (mmm+256));
/* Copy the parameters required by EXECVE to the new buffer */
Put_user ((void*) 2,program_path,strlen (Program_path) +1);
/* Successful execution of/BIN/LS program! */
Execve ((char*) (mmm+2));
/* Recovery site */
TMP = BRK ((void*) MMM);

This method has no generality (specifically, this method has a negative effect) and can only be used as a technique, but it is not difficult to find out: If you are familiar with the kernel structure, you can do a lot of unexpected things!

C: Use the signal:

The main purpose of the signal in the kernel is to notify the user program of a significant error, forcibly kill the current process, when the kernel sends Sigkill signal to notify the process to terminate, the core send signal using send_sign (PID,SIG) routines, You can see that the signal is sent in advance to know the process sequence number (PID), so in order to send signals from the kernel to notify the user process asynchronously to perform a task, you must know beforehand the process number of the user process. While the kernel runtime searches for the process number of a particular process is a laborious task, it may be necessary to traverse the entire process Control block list. So the method of signaling a particular user process is bad, generally not used in the kernel. The use of signals in the kernel only occurs when notifying the current process (the PID can be obtained from the present variable) to do some common operations, such as terminating the operation. Therefore, this method is not useful for kernel developers.

In a similar situation, there is a message operation. There's no nagging here.

Summarize the information interaction initiated by the user-level program, whether it is using the standard call method or through the driver interface, the system calls are generally used. There are few cases where the kernel initiates information interaction. There is no standard interface, the operation is very inconvenient. Therefore, in general, as far as possible, use the previous methods described in this article for information interaction. After all, at the root of the design, the kernel is defined as a passive service provider relative to the client-level program. Therefore, our own development should also try to follow this design principle.


1 Zhou Mingde, 80386 and its programming under protected Mode, Tsinghua University Press, 1993

2 Robert Love, Linux Kernel development,sams publishing,2003

3 W.richard Stevens, advanced programming in the UNIX environment,addision wesley,1992

4 W.richard Stevens, UNIX Network programming, Prentic Hall, 1998

5 Maurice J. Bach, the Design of the UNIX Operating System, prentic Hall, 1990

6 Linux Device Driver, O ' Reilly

7 Ori Pomerantz, Linux Kernel Module Programming Guide, 1999

A few understandings and summaries of Linux user space and kernel space data transfer: (1) Let's ignore Linux support for segment memory mapping. In protected mode, we know that regardless of whether the CPU is running in a user or kernel state, the address that the CPU executor accesses is a virtual address, and the MMU must read the value in the control register CR3 as a pointer to the current page directory. This translates the virtual address into a real physical address based on the paging memory mapping mechanism (see related documents) to allow the CPU to actually access the physical address.

(2) for 32-bit Linux, each process has a 4G addressing space, but when a process accesses an address in its virtual memory space, how does it not confuse the virtual space of other processes? Each process has its own page directory pgd,linux the directory's pointer to the memory structure that corresponds to the process task_struct. (struct mm_struct) in MM-&GT;PGD. Each time a process is dispatched (schedule ()), the Linux kernel sets CR3 (SWITCH_MM ()) with the PGD pointer of the process.

(3) When creating a new process, create a new page directory PGD for the new process and copy the kernel interval page directory entries from the kernel's page directory Swapper_pg_dir to the corresponding location of the new Process page directory PGD, as follows:
Do_fork ()--copy_mm ()--mm_init ()--Pgd_alloc ()--set_pgd_fast ()--Get_pgd_slow ()--memcpy (& Amp PGD + USER_PTRS_PER_PGD, Swapper_pg_dir + USER_PTRS_PER_PGD, (PTRS_PER_PGD-USER_PTRS_PER_PGD) * sizeof (pgd_t))
In this way, the page directory of each process is divided into two parts, the first part of the "User space" to map its entire process space (0x0000 0000-0xbfff FFFF) is the virtual address of 3G bytes, the second part is "system space" for mapping (0xc000 0000-0xffff FFFF) 1G bytes of virtual address. It can be seen that the second part of the page directory of each process in the Linux system is the same, so from a process point of view, each process has 4G bytes of virtual space, the lower 3G bytes are its own user space, the highest 1G bytes are the system space shared with all processes and the kernel.

(4) Now suppose we have the following scenario:
In process A, set the host name of the computer in the network by system call SetHostName (const char *name,seze_t len).
In this scenario, we are bound to involve the transfer of data from the user space to the kernel space, where name is the address in the user space, which is set to an address in the kernel through the system call. Let's take a look at some of the details of this process: the specific implementation of the system call is to put the parameters of the system call into the register Ebx,ecx,edx,esi,edi (up to 5 parameters, the scenario has two name and Len), and then the system call number is stored in the register eax, Process A is then brought into system space by the interrupt instruction "int 80". Since the CPU run level of the process is less than or equal to the ingress level 3 of the trap gate set for the system call, it is possible to enter the system space unimpeded to execute the function pointer System_call () set for int 80. Since System_call () is in kernel space, its runlevel is 0,cpu to switch the stack to the kernel stack, which is the system space stack for process a. We know that when the kernel creates the TASK_STRUCT structure for the new process, it allocates two contiguous pages, which is the size of 8 K, and uses the size of about 1k at the bottom for task_struct (such as # define ALLOC_TASK_STRUCT () (struct task _struct *) __get_free_pages (gfp_kernel,1)), while the rest of the memory is used in the stack space of the system space, that is, when the system space is transferred from user space, the stack pointer esp becomes (ALLOC_TASK_STRUCT () + 8192), which is why system space usually defines the current (see its implementation) with a macro to get the task_struct address of the present process. Each time the process enters the system space from the user space, the system stack has been pressed into the user stack SS, the user stack pointer esp, EFLAGS, user space CS, EIP, then System_call () eax Press in, and then call Save_all in turn into ES, DS , EAX, EBP, EDI, ESI, EDX, ECX, EBX, and then call Sys_call_table+4*%eax, this scenario is sys_sethostname ().

(5) in Sys_sethostname (), after some protection considerations, call Copy_from_user (To,from,n), where to points to the kernel space system_ Utsname.nodename, such as 0xe625a000,from, point to user space such as 0x8010fe00. Now that process a enters the kernel and runs in the system space, the MMU completes the mapping of the virtual address to the physical address according to its PGD, and finally completes the replication from the user space to the system spatial data. Before copying the kernel to determine the validity of the user space address and length, as to the user space from the beginning of a certain length of the entire interval has been mapped and not to check, if an address in the interval is not mapped or read and write permissions and other problems occur, it is considered a bad address, resulting in a page exception, Let the page exception service program handle it. The process is as follows: Copy_from_user ()->generic_copy_from_user ()->access_ok () +__copy_user_zeroing ().

(6) Summary:
* Process Addressing space 0~4g
* The process can only access 0~3g in the user state, only access to the kernel state 3g~4g
* Process enters kernel state via system call
* The 3g~4g portion of each process virtual space is the same
* Process from the user state into the kernel state will not cause CR3 changes but will cause the stack changes

Linux kernel space and user spatial information interaction method

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.