Linux kernel modules written in a detailed

Last Update:2016-01-07 Source: Internet

Author: User

Tags mutex readable dmesg

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

kernel programming often looks like black magic, and in Arthur C Clark's eyes, it probably. The Linux kernel and its user space is very different: throw away the casual, you have to be careful, because you programming a bug will affect the entire system, this article introduces you to the Linux kernel module writing, the need for friends can refer to the following

Kernel programming often looks like black magic, and in Arthur C Clark's eyes, it probably. The Linux kernel and its user space are very different: throw away the inattention, you have to be careful, because a bug in your programming will affect the entire system. Floating-point arithmetic is not easy, the stack is fixed and small, and the code you write is always asynchronous, so you need to think about what concurrency can cause. In addition to all of this, the Linux kernel is just a big, complex C program that is open to everyone, read it, learn it, and improve it, and you can be one of them.

Perhaps the simplest way to learn kernel programming is to write a kernel module: a code that dynamically loads into the kernel. What the modules can do is limited-for example, they can't add or subtract fields from public data structures such as process descriptors (LCTT: It can disrupt the entire kernel and the system's functionality). However, in other ways, they are mature kernel-level code that can be compiled into the kernel at any time when needed (so that all restrictions can be discarded). It's perfectly possible to develop and compile a module from outside in the Linux source tree (it's not surprising, it's called out-of-tree development), which is handy if you just want to play a little bit and don't want to commit the changes to the mainline kernel.

In this tutorial, we will develop a simple kernel module to create a/dev/reverse device. The string written to the device is read back in reverse order ("Hello World" reads "World Hello"). This is a popular programmer interview problem, and when you use your ability to implement this feature at the kernel level, you get some extra points. Before you begin, there is a caveat: a bug in your module can cause the system to crash (although unlikely, but still possible) and data loss. Before you start, make sure you've backed up your important data, or, in a better way, experimenting with virtual machines.

Do not use root as much as possible

By default,/dev/reverse can only be used by root, so you can only use sudo to run your test program. To resolve this limitation, you can create a/lib/udev/rules.d/99-reverse.rules file that contains the following:

subsystem== "Misc", kernel== "reverse", mode= "0666"
Don't forget to reinsert the module. It is often not a good idea to let non-root users access device nodes, but it is very useful in the course of development. This is not to say that it is not a good idea to run a binary test file as root.
Construction of the module

Since most Linux kernel modules are written in C (except for the underlying architecture-specific parts), it is recommended that you save your module as a single file (for example, reverse.c). We've put the complete source code on GitHub--here we'll look at some of those fragments. To begin, let's include some common file headers and use predefined macros to describe the module:

Everything here is straightforward, except Module_license (): It's more than just a marker. The kernel strongly supports GPL-compatible code, so if you set the license to other non-GPL compatible (e.g. "proprietary" [patent]), certain kernel features will not be available in your module.

Bash/shell code copy content to clipboard

When should I not write kernel modules?

Kernel programming is interesting, but writing (especially debugging) kernel code in real-world projects requires specific techniques. Generally speaking, you should only resolve it at the kernel level when there is no other way to solve your problem. In the following scenario, you might be better at resolving it in user space:

You want to develop a USB drive--see LIBUSB.
You're going to develop a file system--try fuse.
You are expanding netfilter--so libnetfilter_queue is helpful to you.
Typically, the code inside the kernel performs better, but for many projects this performance loss is not serious.
Since kernel programming is always asynchronous, there is no main () function to allow Linux to execute your module in sequence. Instead, you have to provide callback functions for various events, like this:

Bash/shell code copy content to clipboard

Here, the function we define is called the insert and delete of the module. Only the first insert function is necessary. Currently, they are just printing messages to the kernel ring buffer (which can be accessed through the DMESG command in user space); Kern_info is the log level (note that there is no comma). __init and __exit are attributes--metadata slices that are joined to a function (or variable). Attributes are rare in C code in user space, but they are common in the kernel. All marked __init will release memory for reuse after initialization (remember that "freeing unused kernel memory" of the past kernel ... [Free unused kernel memory ...] " Information? ）。 __exit shows that when the code is statically built into the kernel, the function can be safely optimized without the need to clean up the finishing touches. Finally, the two macros, Module_init () and Module_exit (), set the Reverse_init () and Reverse_exit () functions to be the life cycle callback functions of our module. The actual function names are not important, you can call them init () and exit (), or start () and stop (), whatever you want to call it. They are static declarations that you cannot see in the external module. In fact, any function in the kernel is not visible unless it is explicitly exported. However, in kernel programmers, it is customary to prefix your functions with the module name prefixes.

These are basic concepts – let's do more interesting things. The module can receive parameters, just like this:

# modprobe Foo bar=1

The Modinfo command shows all the parameters accepted by the module, which can also be used as a file under/sys/module//parameters. Our module needs a buffer to store the parameters-let's set this size to user-configurable. Add the following three lines under Module_description ():

Bash/shell code copy content to clipboard

Here, we define a variable to store the value, encapsulate it as a parameter, and make it accessible to all by SYSFS. The description of this parameter (the last line) appears in the output of the modinfo.

Since the user can set the buffer_size directly, we need to clear the invalid value in Reverse_init (). You should always check for data from outside the kernel--if you don't, you're placing yourself in a kernel exception or security hole.

Bash/shell code copy content to clipboard

A non-0 return value from the module initialization function means that the module failed to execute.

Navigation

But when you develop the module, the Linux kernel is the source of everything you need. However, it is quite large and you may have difficulty finding what you want. Fortunately, there are a lot of tools to make this process simple in front of a huge code base. First, it is a more classic tool that cscope--runs in the terminal. All you have to do is run make Cscope && cscope in the top-level directory of the kernel source code. Cscope is well-integrated with vim and Emacs, so you can use it in your favorite editor.

If the terminal-based tool is not your favorite, then visit http://lxr.free-electrons.com. It is a web-based kernel navigation tool, even though it does not have much functionality Cscope (for example, you cannot easily find the use of a function), but it still provides enough quick query functionality.
Now it's time to compile the module. You need the kernel version header files (linux-headers, or equivalent packages) and build-essential (or similar packages) that you are running. Next, create a standard makefile template:

Bash/shell code copy content to clipboard

Now call make to build your first module. If you typed it correctly, the Reverse.ko file will be found in the current directory. Insert the kernel module using sudo insmod Reverse.ko and run the following command:

Bash/shell code copy content to clipboard

Congratulations, man. However, the current line is just an illusion-there are no device nodes yet. Let's take care of it.

Hybrid equipment

In Linux, there is a special type of character device called "Hybrid device" (or simply "misc"). It's designed for small device drivers for a single access point, and that's exactly what we need. All promiscuous devices share the same main device number (10), so a driver (DRIVERS/CHAR/MISC.C) can view all of their devices, which are distinguished by the secondary device number. In other sense, they are just plain character devices.

To register a secondary device number (and an access point) for the device, you need to declare the struct Misc_device, fill in all the fields (note the syntax), and then call Misc_register () with a pointer to the struct as a parameter. To do this, you also need to include the linux/miscdevice.h header file:

Bash/shell code copy content to clipboard

Here, we request a first available (dynamic) secondary device number for the device named "Reverse", and the ellipsis indicates the omitted code we have seen before. Don't forget to log out of the device after the module is removed.

Bash/shell code copy content to clipboard

The ' fops ' field stores a pointer to a file_operations structure (declared in linux/fs.h), which is the access point for our module. Reverse_fops is defined as follows:

Bash/shell code copy content to clipboard

In addition, Reverse_fops contains a series of callback functions (also called methods) that are executed when the user space code opens a device, reads and writes or closes the file descriptor. If you want to ignore these callbacks, you can specify an explicit callback function instead. That's why we set Llseek to Noop_llseek () and it doesn't do anything (as the name implies). This default implementation changes a file pointer, and we don't need our device to address it now (this is the homework for you today).

Close and open

Let's implement the method. We will assign a new buffer to each open file descriptor and release it when it is closed. This is actually not safe: if a user-space application leaks a descriptor (perhaps intentionally), it will occupy RAM and cause the system to become unusable. In the real world, you have to think about these possibilities. But in this tutorial, this approach doesn't matter.

We need a struct function to describe the buffer. The kernel provides a number of general data structures: A list of links (double), a hash table, a tree, and so on. However, buffers are often designed from scratch. We will call our "struct buffer":

Bash/shell code copy content to clipboard

Data is a pointer to a string stored in the buffer, and end points to the first byte after the end of the string. Read_ptr is where read () begins to read data. The size of the buffer is stored for completeness-we have not yet used the zone. You cannot assume that the user using your struct will initialize all of these things correctly, so it is best to encapsulate the buffer allocation and recall in the function. They are usually named Buffer_alloc () and Buffer_free ().

Bash/shell code copy content to clipboard

Kernel memory is allocated using Kmalloc () and released using Kfree (), and the Kzalloc () style is to set memory to full zero. Unlike the standard malloc (), its kernel counterpart receives a flag that specifies the type of memory requested in the second parameter. Here, Gfp_kernel is saying that we need a normal kernel memory (not in DMA or high memory area) and that the function can sleep (reschedule the process) if necessary. sizeof (*BUF) is a common way to get the size of a struct that can be accessed by pointers.

You should always check the return value of Kmalloc (): Accessing a null pointer will result in a kernel exception. It is also important to note the use of unlikely () macros. It (and its relative macro likely ()) is widely used in the kernel to indicate that the condition is almost always true (or false). It does not affect the control process, but it can help modern processors to improve performance through branch prediction techniques.

Finally, note the goto statement. They are often considered evil, but the Linux kernel (and some other system software) uses them to implement centralized function exits. The result is a reduced nesting depth, making the code more readable, and much like try-catch chunks in higher-level languages.

With Buffer_alloc () and Buffer_free (), the open and close methods become very simple.

Bash/shell code copy content to clipboard

A struct file is a standard kernel data structure that stores information about open files, such as the current file location (File->f_pos), Flags (file->f_flags), or open mode (File->f_mode). Another field, File->privatedata, is used to associate a file to some proprietary data, which is of type void * and is opaque to the internal core outside of the file owner. We're going to store a buffer there.

If the buffer allocation fails, we indicate the calling user space code by returning a negative value (-ENOMEM). An open (2) system call called in a C library (such as GLIBC) will detect this and set the errno appropriately.

Learn how to read and write

The "read" and "write" methods are where the work is really done. When data is written to the buffer, we discard the previous content and store the field in reverse, without any temporary storage required. The Read method simply copies the data from the kernel buffer to the user space. But what does Revers_eread () do if the buffer does not have data? In user space, the read () call blocks the data before it is available. In the kernel, you have to wait. Fortunately, there is a mechanism for dealing with this situation, which is ' wait queues '.

The idea is simple. If the current process needs to wait for an event, its descriptor (struct Task_struct stores ' current ' information) is put into a non-operational (sleep) state and added to a queue. Then schedule () is called to select another process to run. The code that generates the event wakes them up by using queues to put the waiting process back into the task_running state. The scheduler will select one of them at a later time. Linux has a variety of non-operational states, most notably task_interruptible (a sleep that can be interrupted by a signal) and task_killable (a process that can be killed in sleep). All of this should be handled correctly and wait for the queue to do these things for you.

A natural place to store the read-waiting queue header is the structure buffer, so start by adding the Wait_queue_headt read\queue field for it. You should also include the Linux/sched.h header file. You can use the Declare_waitqueue () macro to statically declare a wait queue. In our case, the dynamic initialization is required, so add the following line to Buffer_alloc ():

Bash/shell code copy content to clipboard

We wait for available data, or wait for the read_ptr! = end condition to be valid. We also want to allow the wait operation to be interrupted (for example, by CTRL + C). Therefore, the "read" method should start like this:

Bash/shell code copy content to clipboard

...

We let it loop until there is data available, and if not, use Wait_event_interruptible () (It is a macro, not a function, which is why it is passed by value to the queue) to wait. Well, if wait_event_interruptible () is interrupted, it returns a value other than 0, which represents-erestartsys. This code means that the system call should be restarted. File->f_flags Check the number of files opened in nonblocking mode: If there is no data, return-eagain.

We cannot use if () as a substitute for while () because there may be many processes waiting for data. When the write method wakes them, the scheduler chooses one to run in an unpredictable way, so the buffer may be vacated again when the code has a chance to execute. Now we need to copy the data from the Buf->data to the user space. The Copy_to_user () kernel function did this:

Bash/shell code copy content to clipboard

If the user space pointer is wrong, then the call may fail; If this happens, we will return to-efault. Remember, don't trust anything from outside the kernel!

Bash/shell code copy content to clipboard

In order to make the data readable in any block, a simple operation is required. The method returns the number of bytes read in, or an error code.

Write a shorter way. First, we check if the buffer has enough space, and then we use the Copy_from_userspace () function to get the data. Then the read_ptr and end pointers are reset and the contents of the storage buffers are reversed:

Bash/shell code copy content to clipboard

Here, Reverse_phrase () did all the strenuous work. It relies on the Reverse_word () function, which is fairly brief and is marked inline. This is another common optimization, but you can't overdo it. Because too many unions cause the kernel image to grow in vain.

Finally, we need to wake up the process of waiting for data in Read_queue, as we have said before. Wake_up_interruptible () is used to do this:

Bash/shell code copy content to clipboard

Yes! Now that you have a kernel module, it has at least been compiled successfully. Now, it's time to test it.

Debugging Kernel Code

Perhaps the most common debugging method in the kernel is printing. If you wish, you can use normal PRINTK (assuming the Kern_debug log level is used). However, there are better ways to do it. If you are writing a device driver, this device driver has its own "struct device" that can use Pr_debug () or dev_dbg (): They support dynamic debugging (dyndbg) features, And can be enabled or disabled as needed (please refer to documentation/dynamic-debug-howto.txt). For simple development messages, use Pr_devel (), unless debug is set, nothing will be done. To enable debug for our module, add the following line to makefile:

Bash/shell code copy content to clipboard

When you're done, use DMESG to see the debug information generated by Pr_debug () or Pr_devel (). Alternatively, you can send debug information directly to the console. To do this, you can set the Console_loglevel kernel variable to a value of 8 or greater (echo 8/PROC/SYS/KERNEL/PRINTK), or at a high log level, such as Kern_err, to temporarily print the debug information you want to query. Naturally, you should remove such a debug declaration before releasing the code.

Note that kernel messages appear in the console, not in terminal emulator windows such as Xterm, which is why it is recommended that you do not perform in the X environment when the kernel is developed. Surprise, Surprise!

Compile the module and load it into the kernel:

Bash/shell code copy content to clipboard

Everything seemed to be in place. Now, to test whether the module is working properly, we will write a small program to flip its first command-line argument. Main () (repeated check error) may look like this:

Bash/shell code copy content to clipboard

Run like this:

Bash/shell code copy content to clipboard

Now, let's make things a little more fun. We will create two processes that share a file descriptor (and its kernel buffers). One of them will continue to write the string to the device, while the other will read the strings. In the following example, we use the fork (2) system call, and the pthreads is very useful. I also omit the code to open and close the device and check for code errors here (again):

Bash/shell code copy content to clipboard

What do you hope the program will output? Here's what I got on my laptop:

Read:dog lazy The over jumped Fox Brown Quick A
Read:a Kcicq Brown Fox jumped over the lazy dog
Read:a Kciuq Nworb XOR jumped Fox Brown Quick A
Read:a Kciuq Nworb XOR jumped Fox Brown Quick A
...
What's going on here? It's like holding a game. We think that read and write are atomic operations, or that one command is executed from beginning to end. However, the kernel is indeed out of order and randomly re-dispatches the kernel part of the write operation that is running somewhere inside the Reverse_phrase () function. What if the read () operation is dispatched before the end of the write operation? Results in a state where the data is incomplete. Such bugs are very difficult to find. But, how to deal with this problem?

Basically, we need to make sure that no Read method can be executed until the Write method returns. If you have ever written a multithreaded application, you may have seen synchronization primitives (locks), such as mutexes or signals. Linux also has these, but there are some subtle differences. The kernel code can run the process context (the "rep" of the user-space code works just like the method we use) and the terminal context (for example, an IRQ processing thread). If you are already in the context of the process and you have obtained the required lock, you simply need to sleep and retry until successful. You cannot hibernate when the context is interrupted, so the code runs in a loop until the lock is available. The associated primitive is called a spin lock, but in our environment, a simple mutex--only the object that the only process can "occupy" at a given time--is sufficient. In terms of performance considerations, the real code may also use a read-write signal.

Locks always protect certain data (in our environment, is a "struct buffer" instance), and often embed them in the structures they protect. Therefore, we add a mutex (' struct mutex lock ') to the ' struct buffer '. We also have to use Mutex_init () to initialize the mutex; Buffer_alloc is a good place to handle this. Code that uses mutexes must also contain linux/mutex.h.

A mutex is like a traffic light-it doesn't work if the driver doesn't look at it and doesn't listen to it. Therefore, we need to update reverse_read () and Reverse_write () to obtain the mutex before the buffer is manipulated and released when the operation is complete. Let's take a look at the Read method--write works the same way:

Bash/shell code copy content to clipboard

We get the lock at the beginning of the function. Mutex_lock_interruptible () either gets the mutex and then returns, or lets the process sleep until a mutex is available. As before, the _interruptible suffix means that sleep can be interrupted by a signal.

Bash/shell code copy content to clipboard

Here is our "Wait for data" loop. You should not let the process sleep when a mutex is acquired, or when a situation called "deadlock" occurs. So, if there is no data, we release the mutex and call Wait_event_interruptible (). When it returns, we re-acquire the mutex and continue as usual:

Bash/shell code copy content to clipboard

Finally, the mutex is unlocked when the function ends, or when an error occurs during the mutex being fetched. Recompile the module (don't forget to reload), and then test again. Now you're not going to find the data that's been destroyed.

What's next?

Now you've tried a kernel hack. We've just opened up the subject for you, and there's more to explore. Our first module is consciously written in a simpler way, and the concept of learning from it is the same in more complex environments. concurrency, method tables, registering callback functions, making process sleep, and waking processes are all familiar to kernel hackers, and now you've seen how they work. Perhaps one day, your kernel code will be added to the mainline Linux source tree--if so, please contact us!

Linux kernel modules written in a detailed

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More