"Reading Notes", "Linux kernel design and implementation" process management and scheduling

Last Update:2016-06-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The university and the teacher do the embedded project, write the device driver of I²c, but the knowledge of the Linux kernel is limited to this. Many of the vulnerabilities in the Android system that lead to root are in the kernel, and it's interesting to study, but look at the relevant analysis articles always feel separated by a layer of window paper, can not be completely ignored. So I'm going to learn a bit about Linux kernel system. Bought two books "Linux Kernel Design and implementation (3rd edition)" and "in-depth understanding of the Linux Kernel (3rd edition)"

0x00 some crap

Object oriented thinking.

Although the Linux kernel is written in C and assembly languages, it does not use object-oriented language, but it contains a lot of object-oriented design. For example, you can think of a process in the kernel as an object that has various variables to represent the state of the process, and an OPS function pointer structure that represents all the executable operations of the process.

Learning principles, ideas, and frameworks.

The Linux kernel code is quite large, a tast_ that represents the process struct structure has 1.7K size (kernel 2.6-32 bits), if we are concerned about one of the member variables, it is easy to get lost in the details, so learning principles and ideas more important, extrapolate can analyze the kernel source to ease.

0x01 Linux kernel, process long what kind of

The kernel stores the process in a two-way circular list called the task queue. Each item in the list is a structure of type task_struct, called the process descriptor.

Processes in the kernel are also called tasks.

Note that the data contained in the process descriptor can fully describe an executing program: It opens the file, the process's address space, the pending signal, the status of the process, and more information.

In the previous kernel of 2.6, Task_struct was placed at the tail of the process kernel stack, so that the position could be quickly computed with the stack pointer in a platform like x86, which has fewer registers. However, in the 2.6 kernel, the task_struct space is allocated dynamically through the slab allocator, so now the tail of the process kernel stack only has a struct called Thread_info, whose members *task pointers to the TASK_STRUCT structure.

In the (f:\linux-2.6.32.67\arch\x86\include\asm\thread_info.h) file, you can view the THREAD_INFO structure definition under the x86 schema:

structThread_info {structTask_struct *task;/*main task Structure*/        structExec_domain *exec_domain;/*execution Domain*/__U32 flags; /*Low level flags*/__u32 status; /*Thread Synchronous Flags*/__U32 CPU; /*Current CPU*/        intPreempt_count;/*0 = preemptable, <0 = BUG*/mm_segment_t Addr_limit; structRestart_block Restart_block; void__user *Sysenter_return, #ifdef config_x86_32 unsignedLongPREVIOUS_ESP;/*ESP of the previous stack in case of nested (IRQ) Stacks*/__u8 supervisor_stack[0];#endif        intUaccess_err;};

In the kernel, most of the code of the processing process is done through the direct operation of the TASK_STRUCT structure, so it is important to quickly locate the task_struct location, which directly affects the operating system speed. Like PowerPC, a lot of processor (RISC) registers, task_struct addresses are stored directly in the register. In x86, the THREAD_INFO structure can only be created at the end of the kernel stack, and the TASK_STRUCT structure is indirectly found by calculating the offset, and on the x86 system, the last 13 significant bits of the stack pointer are screened out to calculate the Thread_info offset. The operation is done through the Current_thread_info () function, which is compiled as follows:

MOVL $-8192,%eaxandl%esp,%eax

This assumes that the stack size is 8K. Finally, the task_struct address is obtained by fetching the task domain:

Current_thread_info ()->task;

The state field in the process descriptor describes the current status of the process. Each process in the system must be in one of the following 5 process states:

Task_running (Run)--the process is executable, it is either executing, or waiting to be executed in the run queue.
Task_interruptible (interruptible)--The process is sleeping (that is, it is blocked), pending certain conditions.
Task_uninterruptible (non-interruptible)-The receiving signal will not be awakened. Others are the same as task_interruptible.
__task_traced--processes that are tracked by other processes, such as Ptrace () to track the debugger.
__task_stopped (stop)--the process stops executing. This occurs when a signal such as Sigstop, SIGTSTP, Sigttin, Sigttou is received.

The kernel can use the Set_task_state (task, state) function to set the status of the process:

/**/

About the process context

Executable code is an important part of the process. The code is executed from an executable file loaded into the address space of the process. The general program executes in user space. When a program executes a system call or triggers an exception, it falls into kernel space. At this point, we call the kernel "execute on behalf of the process" and be in the process context. The program resumes in user space when the kernel exits, unless a higher-priority process needs to be executed and adjusted by the scheduler during this time.

The process descriptor also stores the relationship between processes. Each task_struct contains a parent pointer to the task_struct of its parents process, and a list of child processes called children. So for the current process, you can get the process descriptor for its parent process through the following code:

struct task_struct *my_parent = current->parent;

You can also access child processes in the following ways:

struct task_struct *task; struct list_head *&amp;current->children) {        struct  task_struct, sibling);         /* */}

The process descriptor of the INIT process is statically allocated as Init_task. The following code is a good demonstration of the relationships among all processes:

struct task_strcut *task;  for (task = current; task! = &amp;init_task; task = task->parent        ) /* */

So we're going to have a lot of space. From any process in the system, go through all processes. Because the queue itself is a doubly linked list, you can do so.

Get the next process of the list

struct task_struct, tasks);

Get previous process

struct task_struct, tasks);

These two pieces of code are implemented by Next_task (task) Macros and Prev_task (Task) macros, respectively. In fact, the for_each_process (Task) macro provides the ability to access the entire queue. For each visit, the task pointer points to the next element in the list:

struct task_struct *task;for_each_process (Task) {        /*         */ PRINTK ("%s[%d]\n", Task->comm, task->pid);}

0x02 Fork () What did you do?

Next look at the process of process creation and termination in the kernel.

The process is generated by creating a process in the new address space, reading the executable file, and starting execution. Linux puts this process into two functions: fork () and exec (). First fork creates a child process by copying the current process, and the child process differs from the parent process only on pid,ppid and some resources and statistics (for example: a pending signal, which is not necessarily inherited). exec is responsible for reading the executable file and loading the address space execution.

One feature of Linux fork () is the copy-on-write (Copy-on-write). In other words, the new process created by Fork does not immediately copy the address space of the parent process, and the data is copied only when it needs to be written. In other cases, such as calling the EXEC function after the fork is finished, there is no need to copy the parent process resources, which greatly improves the efficiency of the fork. Linux emphasizes the process's ability to execute quickly, and this feature is important.

Fork () calls clone () according to its own parameter flag, and then calls Do_fork () by Clone ().

Do_fork () completes most of the work created by the process, which is defined in the Kernel/fork.c file. The function calls the Copy_process () function and then lets the process begin execution. The copy_process () function mainly implements the following functions:

Call Dup_task_struct () to create a kernel stack, thread_info structure, and task_struct structure for the new process. At this point the child process and the parent process descriptor are identical.
Check that the number of processes has not exceeded the system limit.
The child process begins to differentiate itself from the parent process. Many parameters in the process descriptor are cleared 0 or set as initialization values, which are not inherited descriptor members, mainly statistical information. Most of the data in Task_struct is still unchanged.
The child process state is set to task_uninterruptible to ensure that it is not put into operation.
Copy_process () calls Copy_flags () to update the flags member of the TASK_STRUCT. The PF_SUPERPRIV flag that indicates whether the process has superuser privileges is cleared 0. Indicates that the process has not called the EXEC () flag pf_forknoexec flag is set.
Call Alloc_pid () to assign a valid PID to the new process
Copy_process () copies or shares open files, file system information, information processing functions, process address spaces, and namespaces, depending on the parameter flags passed to clone ().
Finally, copy_process () does the cleanup work and returns a pointer to the child process.

Back to the Do_fork () function, if the copy_process () function returns successfully, the newly created child process is awakened and put into operation. The kernel will prefer to execute the new process, because the normal child process calls the Exec () function immediately, which avoids the extra overhead of copy-on-write.

At the end of the process, the kernel must release the resources it occupies and notify its parent process, primarily for the parent process to collect information about the child process exit. Generally, the exit of a process is caused by an explicit or implicit call to exit (). This function is implemented by Do_exit (), which is defined in the KERNEL/EXIT.C function, mainly doing the following work:

Set the flag member in Task_struct to Pf_exiting
Call Del_timer_sync () to remove any kernel timers. Depending on the return result, it ensures that no timers are queued and no timer handlers are running.
If the BSD Process accounting feature is turned on, Do_exit () calls Acct_update_integrals () to output accounting information.
Then call the EXIT_MM () function to release the mm_struct that the process occupies, and if no process uses them (that is, the address space is not shared), release them completely.
Next, call the Sem_exit () function. If the process waits for an IPC signal, it leaves the queue.
Call Exit_files () and Exit_fs () to decrement the reference count of the file descriptor and file system data, respectively. If one of the reference counts is 0, it is released.
Use exit codes for exit () to set the Exit_code member of the TASK_STRUCT structure. For the parent process to retrieve and query the child process exit status.
Call Exit_notify () to send a letter to the parent process, to the child process to find the adoptive father, the adoptive father is the other members of the thread group or the Init process, and the process state (Task_struct Exit_state member) is set to Exit_zombie
Do_exit () calls schedule () to switch to the new process. Because a process in the Exit_zombie state is no longer scheduled, this is the last piece of code that the process executes. Do_exit () never returns.

At this point, all the resources associated with the process are freed, it only occupies the kernel stack, the thread_info structure, and the task_struct structure, and the sole purpose of its existence is to provide information to the parent process. The parent process retrieves the information or notifies the kernel that it is irrelevant, and that the remaining memory held by the process is returned to the system.

The parent process receives the exit notification of the child process through the wait family function, which is eventually called WAIT4 (), which is implemented by the system call. The delete process descriptor is implemented by the parent process through Release_task (), which is responsible for deleting the process descriptor and finding the adoptive father for the orphan process (the child process that exits the process).

0x03 threads and processes from the Linux kernel

Including "Unix Programming Art" and some predecessors have said that Linux do not use the thread. Because the process under Linux differs from systems such as windows, it is fairly lightweight, and the multi-process system does not affect the entire system when a process hangs, and the master process can restart the worker process. In a multithreaded system, a thread hangs out of the entire system and crashes. Now there is a chance to look at the Linux thread implementation from the kernel perspective. The result is surprisingly, the Linux process is not different from the thread, including the structure is task_struct to describe. Threads are only considered a process that shares certain resources with other processes, and in the kernel's view, there is no concept of threading. ;(

The evolutionary history of process scheduling in the 0x04 Linux kernel

A processor, the same time can only be occupied by a process, the reason for process scheduling is very simple. If we were to design a process scheduling algorithm, the easiest to think about and implement is to traverse the entire list of processes, each of which gives a fixed-time execution cycle such as 10ms. Cycle. This is the most primitive process scheduling algorithm for time-slice rotation. The process scheduling algorithm before the Linux 2.5 kernel is so primitive that it is difficult to handle a large number of process situations and multiprocessor situations.

The Linux 2.5 kernel has performed a major operation on process scheduling, introducing the O (1) scheduler, which helps to get rid of the limitations of the child's previous scheduler through a static time slice algorithm and a running queue for each processor. However, its time-sensitive process is inherently congenital, so-called time-sensitive processes, refers to the existence of a large number of user interaction processes, such as desktop programs, it needs to respond quickly to customer operations.

In the Linux 2.6 kernel, the "complete Fair scheduling algorithm", referred to as CFS, was introduced. It absorbs the queue theory and introduces the concept of fair dispatch to the Linux scheduler.

0x05 current scheduling algorithm in order to solve what problem

The CFS algorithm is introduced to solve the problem of the response of the time-sensitive process in O (1) scheduling algorithm.

Principle analysis of 0x06 CFS algorithm

Before analyzing the principle of CFS algorithm scheduling, the first thing to think about is what scheduling strategy we should follow.

I/O intensive and compute intensive processes

A text editor is an I/O intensive process because it waits and processes user input at all times. While a video decoder is computationally intensive, its work is mainly a large number of operations. A scheduling strategy typically seeks to balance two conflicting goals: Process response (short response time) and maximum system utilization (high throughput). To meet this requirement, the scheduler typically uses a very complex set of algorithms to run the most worthwhile processes, but it often does not guarantee that low-priority processes are treated fairly. In order to ensure the responsiveness of interactive applications and desktop systems, the Linux system optimizes the response of the process and prefers to prioritize I/O intensive processes.

Process priority

The most basic class of scheduling algorithm is the priority-based scheduling algorithm. It is common practice for high-priority processes to run first, with low post-run, same-priority rotation. On some systems, high-priority processes may also have more time slices.

Linux uses two different priority ranges. The first is the nice value, the range is 20 to 19, the default is 0, and the larger the nice value means the lower the priority. You can see through the Ps-el command that a column labeled NI is the nice value of the process.

The second is real-time prioritization. Its value is configurable, and the default range of changes is 0-99. In contrast to the nice value, the higher the real-time priority data means the higher the process priority.

Time slices

A time slice is a numeric value that indicates how long a process can run before it is preempted. The scheduling policy must determine a default time slice, but this is not easy. Because the film can cause the system to be unresponsive too long, the short time slices will significantly increase the consumption of the process switching. This is also the contradiction seen earlier, where I/O intensive processes require shorter slices of time to ensure responsiveness, while computationally intensive processes require longer slices of time to ensure efficiency (such as allowing them to hit higher in the cache).

Process priority and time slice are two common concepts of traditional process scheduling.

Below you can see how the CFS algorithm works.

Suppose such a Linux system, which only runs two processes, a text editing program and a video decoder program. CFS no longer assigns a given priority and time slice to the file editor, but allocates a processor usage ratio. If two programs have the same nice value, the processor ratio is 50%, and they divide the processor time. However, it is obvious that the file editor will spend more time waiting for user input, so it certainly won't use 50% of the processor, and the video decoder will have the opportunity to use more than 50% of the processor time in order to complete the decoding task more quickly. CFS found this situation, in order to honor the process of fair use of the processor commitment, when the text editor is awakened, will immediately preempt the video decoding process, let the file editing program run.

We see that the core of the CFS algorithm is: "Completely fair".

The starting point for CFS is based on a complete multi-tasking processor model, the so-called perfect multitasking processor model: We can run two simultaneous processes within 10ms, each using half the capacity of the processor. Of course, this model is not realistic because a processor cannot run multiple processes at the same time. CFS allows each process to run for a period of time, cycle round, select the least-run process as the next running process, instead of the one assigned to each process time slice, and the CFS calculates how long a process should run based on the total number of running processes, rather than relying on nice to calculate the time slice. The nice value is taken by the CFS as a process to get the processor run time ratio weight.

It should be noted that when the operational tasks tend to be infinite, their respective processor runs and time slices tend to be 0, resulting in unacceptable switching losses. CFS introduces each process to achieve the minimum value of the time slice, called the minimum granularity. This value is 1 by default. This ensures that even if the running process tends to be infinite, each of them will have at least 1ms of running time. In other words, the CFS is not the perfect fair scheduling algorithm when the process is very, very many, because we have to rotate the execution;

CFS is mainly implemented in KERNEL/SCHED_FAIR.C, with 4 components to be particularly concerned:

Time Billing

CFS uses the Vruntime variable to record how long a program is running and how long it should run. Its calculations are normalized (or weighted) by all operational processes, and in NS, so it is no longer relevant to the timer beats.

Process Selection

CFS selects a minimal vruntime process to execute (after Nice weighting, which has nothing to do with precedence), which is the core of the CFS scheduling algorithm: Select the task with the minimum vruntime. So the rest of the work is how to choose Vruntime the smallest task, CFS use red black tree to organize the running process queue, where the key value of the node is the vruntime of the running program. The left-most leaf node is the next process we want to choose to run. So all operations are summed up into the red-black tree, the increase, the deletion, the balance above.

Dispatcher entry

The Scheduler entry function is schedule (), defined in KERNEL/SCHED.C. It is the other part of the kernel that calls the process Scheduler's Portal: Select which process can run and when it will be put into operation. Schedule () First finds the appropriate scheduling class through Pick_next_task (). Then ask the scheduling class, which program to run, its implementation is so simple.

Sleep and wake up

Dormant (blocked) processes are in a special non-executable state. Process hibernation can be a variety of reasons, but it must be to wait for an event. An event may be a rush time to read more data from a file I/O, or a hardware event. A process may also attempt to acquire a semaphore that has been consumed and is forced into hibernation. Regardless of the situation caused by hibernation, the kernel processing operation is the same: the process marks itself as dormant, removes from the executable red-black tree, puts in the wait queue, and then calls schedule () to select and execute a different process. The wake-up process is the opposite: the process is set to a runnable state and then moved from the wait queue to the executable red-black tree.

"Reading Notes", "Linux kernel design and implementation" process management and scheduling

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Reading Notes", "Linux kernel design and implementation" process management and scheduling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Reading Notes", "Linux kernel design and implementation" process management and scheduling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support