Simple Unix-Scheduler Details

Last Update:2014-10-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0. we all know the essence of multi-process scheduling that there is a famous Nice call on Unix. What is nice? Of course it is "good". The general idea is that the larger the nice value, the better. In fact, the better the nice value, the lower the priority. Why not use badness?
In fact, if we understand that the operating system multi-process scheduling system is an "altruistic" system, this problem is not a problem. Nice is good, not for yourself, but for others. The altruistic system is a system for everyone, like TCP traffic control and congestion control, and human religious and social organizations. The altruistic system has a negative feedback mechanism, keep the fluctuations within the same range. If you are missing, everyone will donate a little to you and accumulate a little less. No one can help you at a time, because no one can balance a lot. On the contrary, "If you have anything, you have to give it to him. If you don't have anything, you have to take it ." It is a selfish system, such as UDP or real-time processes, or a Chinese without faith...
1. The benefits of unixv6 are reflected in Unix. The priority of a process's user State is expressed by a formula (the smaller the value, the higher the priority ):
Prio = user + p_cpu/4 + 2 * nice
The fields are described as follows:
User: It is the benchmark priority of the user State. This is to ensure that all processes in the user State have a higher priority than the user, because the priority in the kernel state must be smaller than the user, so as to ensure fast access from the local organization.
P_cpu: CPU usage time of the process. It is worth noting that in BSD after 4.3bsd, this field does not increase linearly, that is, it does not increase by 1 every time a clock interrupt occurs, the attenuation factor is: (2 * load)/(2 * LOAD + 1) . It can be seen that the longer the current CPU usage, the smaller the total CPU usage time. Why should we emphasize p_cpu? Because in the altruistic system, the longer you use the CPU, the less likely you will be to benefit from the characteristics of his behavior. With the accumulation of p_cpu, PRIO will gradually increase, resulting in a lower priority, and then let out the CPU, the final behavior is still altruistic. So why is there a attenuation index? In the altruistic system, CPU usage for a long time is a "wrong" behavior. However, if you can correct the error, you can still ignore it. The CPU usage for a long time ago only serves as a reference and has no decisive effect. In the altruistic system, the purpose is not to punish.
In the attenuation factor of p_cpu, there is a load, which refers to the total load of the system, that is, the total number of processes. You will find that the larger the load, the smaller the factor, because the factor is less than 1, the smaller the factor, the faster the p_cpu increases, the faster the process priority is reduced, and the shorter the scheduling cycle of each process, ensure that the scheduling cycle of all processes is roughly the same, and every process will not be overly hungry.
Nice: The larger the nice, the lower the priority.
Let's take a look at the PRIO formula. If nice is a negative value, it means this process is not so "altruistic", because its nice value is not so nice! In this case, p_cpu is used to restrain it. If the nice value is a positive value and is large, it indicates that it is very nice in the altruistic system! In this case, p_cpu is of little significance. Therefore, the function of p_cpu should be divided by 4 to "ignore the minimum item". Let's look at the significance of dividing p_cpu by 4 When Nice is a negative value, in that case, the nice value causes the process to occupy the CPU continuously, even if the p_cpu value is divided by 4, it is also a small value. Finally, note that even in the altruistic system, not everyone is equal. This is why Nice values are limited within a range rather than a fixed value. Some processes are important, its nice value is naturally low, not said. For example, even in the religious system, the Pope, the Bishop...
After understanding the simple UNIX scheduling mechanism, we will find out how simple and elegant it is, and the priority of each process fluctuates up and down, preemptible at any time (including process with higher priority detected by clock interruption and low kernel priority preemption after wake-up. Note that this article does not introduce the latter. For details, refer to unixv6 or Windows internals). p_cpu Attenuation Factor Calculation ensures that even if the system load is very large, the process will not be too hungry. The increase in load will only affect the process's time to run each time, because switching is intensive, this policy is very suitable for desktop interactive applications.
The following figure shows the scheduling of unixv6:

2. note: In the preceding description, I did not describe how the system retrieves the process with the highest priority and how to maintain all the processes, whether they are saved in an array or linked list, or other data structures, such as the AVL Tree and the red/black tree... this is an implementation problem, not an essential problem. The implementation of the first version is generally implemented by arrays or linked lists, in order to make it simple and generate a runable version as soon as possible, it will be migrated to a more efficient data structure, such as the red and black trees. This is an optimization problem.
If you have read the Linux 0.11 kernel code until the 2.4 kernel code, you will find that the system uses linked list traversal When retrieving the process with the highest priority, which is called O (N) algorithms are designed to express the time complexity of these algorithms directly proportional to the system load. People generally only pay attention to Algorithm Implementation issues. There are a large number of articles on the Internet about the O (N) and O (1) algorithms, however, there are almost no articles describing priority calculation, and the latter is more important than the former. We know that the schedulers for Linux 2.6.0 to 2.6.23 and Windows NT are implemented in O (1), which can be said to be almost the same. They maintain a linked list for each priority, each CPU has a bitmap indicating which priority has a process/thread ready. The difference is that Windows NT only maintains a linked list group because of Dynamic Priority escalation and basic priority degradation, in Linux, O (1) maintains two linked list groups, because priority escalation in Linux is only a secondary optimization. What is the significance of this?
I would like to say that in 2008, I implemented a scheduler on the 2.6.18 kernel and there was no CFS (maybe, but I didn't find any information ). At that time, we needed real-time audio and video transmission, but allowed Frame loss, and the system behavior did not change with the system load. I was designed to set the average time slice of process execution to a constant T, and then calculate:
Tt = nr_task * t
Divide TT into nf_task segments. The length of each segment is:
Ttn = TT * (WN/wt)
Among them, Wn is the weight of process N, while wt is the total weight. For the weight, I just normalize according to nice to ensure that the sum of all nice values is 1. The above is a simple design. Do you think it is like CFS? I also think! But in implementation, I used a linked list instead of a red/black tree, because the linked list is simple. Later, I wanted to implement it with a heap. Linked List traversal is not a big deal! Later, I saw CFS and found the following code to indicate the scheduling cycle of all processes:

if (unlikely(nr_running > 5)) {        period = sysctl_sched_min_granularity;        period *= nr_running;}

I think that the implementation of the CFS and my scheduler in 2008 made the same mistake, that is, if nr_running is very large, it will lead to a very long time for scheduling, according to the bsd4.3 UNIX idea, you should stop increasing the scheduling time of all processes! Okay. Can I limit the maximum value of period? Yes! But if the priority formula itself guarantees this, isn't it more perfect?
3. time slice scheduling most modern operating systems are scheduled based on time slice. what is different is the relationship between time slice and process priority. For O (N) and O (1) schedulers in Linux, time slice is a priority function. Another factor in the time slice is the nature of process execution. A dynamic priority is calculated in total. For Windows NT, the time slice of all threads is fixed. The priority includes the basic priority value of the thread and the dynamically adjusted value. The execution time of each thread with different priorities is consistent, the difference is that high-priority threads can always preemptible low-priority threads and execute them first. This is essentially different from unixv6 process scheduling. it is reiterated that unixv6 is not based on time slices and there is no time slice in the process.
The advantage of time slice scheduling is that the time slice is pre-allocated, which reduces the overhead for calculating the dynamic priority of all processes each time and increases the system throughput. However, the traditional time slice computing method has a disadvantage, it is easy to cause process/thread hunger, although Windows NT uses a fixed time slice to avoid a certain priority thread time slice is too long, however, it has to rely on an external balancer to promptly discover the hunger threads and dynamically adjust them to the high-priority queue. Unixv6 Process Scheduling (in fact, later bsd4.3 +) has no such problem at all:
A. Adjust the priority dynamically based on a single formula to adapt to the system load;
B. Separate the kernel state running priority to improve the response speed after I/O is completed.
. Variable-time slice scheduling is used before Linux 2.6.23. It describes a time slice as a static priority function. For a fixed nice value, the time slice is fixed, that is, each nice value corresponds to a time slice, then, the queue processes are scheduled according to the policy of running the high-priority process first, which brings two problems:
1. If a large number of processes are at the same high priority, rotation between them will occur, resulting in sustained hunger for processes lower than all their priorities;
2. Fast Response after I/O.
Let's take a look at how to solve these problems before Linux 2.6.23. This can be classified into two solutions. For the O (n) algorithm, the system strictly executes the time slice allocated to the process and only executes it once, and then sets its time slice to 0, when the time slice of all processes is 0, reset the time slice of all processes at one time. Because each process runs its own time slice only once, even if it has a higher priority, there is always a moment when it is used up, which avoids the process from getting hungry forever. O (n) does not solve the problem of rapid response to I/O completion! Note that the O (n) Scheduler cannot be preemptible, that is, the process cannot be preemptible before the time slice is used up.
For the O (1) algorithm, it is similar to the time slice allocated to itself after the process runs only once, and then reset the time slice and put it in an expired queue, all processes are in the expired array. Switch the expired array to the active array. We can see that the O (1) scheduler and the O (n) scheduler are essentially the same, and there is no other difference except the implementation method. The only difference is that O (1) Achieves preemption, which is complicated to adjust the dynamic priority of dead, "time slice sharding" and hunger Avoidance Mechanism:
Kernel preemption: The punishment can be performed by the doctor. As long as there are no servers in use, processes in either kernel or user mode can be preemptible regardless of whether the time slice is used up or not. This actually reduces the kernel protection granularity.
Dynamic priority adjustment: The priority is adjusted based on the ratio of the process's sleep time to the actual CPU usage time. The principle is that the process with longer sleep gives higher compensation, which is also a benefit;
Time slice resharding: In a special interaction environment, the time slice of a super long computing process is split into small slice to provide execution opportunities for other processes;
Hunger prevention: With the introduction of the above mechanism, it is inevitable that high-priority processes continue to occupy the CPU due to higher priority, so the hunger avoidance mechanism is introduced. This is actually similar to the Windows NT balancer. However, in Windows NT, priority adjustment and preemption are the main line, while in Linux, they are auxiliary functions.
Appendix: Compare the O (n) scheduler and O (1) scheduler in Linux, which have two effects: first, a high-priority process runs, second, the duration of each execution of a high-priority process. This leads to the absolute advantage of high-priority processes in the system. The scheduling behavior strictly follows the scheduler rather than other additional mechanisms.
Have you ever wondered whether O (1) is better than O (n! O (1) is only more reasonable for interactive applications. For servers with high throughput, it is even worse than O (N), especially for kernel preemption. when compiling the kernel, you will see that, if you compile a server instead of a desktop version, we recommend that you disable kernel preemption. Because the total time is certain, the switching time is too much, and the task execution time is less. In addition, for servers, the task is generally completed at one time to maximize the cache utilization, switching will cause cache refresh and TLB refresh.
I personally feel that O (1) does not improve much performance than O (N). It may only involve desktop systems such as Ubuntu and Suse.
3. 2. fixed-time slice scheduling Windows NT uses fixed-time slice scheduling. A fixed time slice is run every time for high-priority or low-priority threads, each time the scheduler selects the first thread of the highest priority queue for running, the idea is very simple. However, hunger is inevitable, because it does not limit each process to run once as per Linux's O (N), nor does it limit the running of every process as per Linux's O (1) maintain an expired queue. So how does Windows NT solve such problems? First, let's take a look at the figure:

The key to the Windows NT scheduler is the dynamic priority adjustment. As a whole, you will see that any thread jumps in different priority queues at different times, for sleep threads, Windows NT almost completely inherits the idea of unixv6's sleep wake-up priority.
Because the scheduler of WINNT does not specify the concept of "scheduling one round", it does not specify that each thread runs several time slices, so its time slices are fixed, after all, a high-priority thread may run for several consecutive time slices, as long as there is no higher-priority thread ready. The way to preemptible high-priority threads is to rely on the system to raise a low-priority thread to a high-priority queue. There are many reasons for the higher priority: the scheduler event is increased, and the thread is upgraded after sleep and waking; after the lock control lock is upgraded, I/O is upgraded (continued from unixv6), waiting for the execution body resources to be upgraded, the interactive front-end window thread is promoted, and the hunger is prolonged. Each of the above priority hoist systems has a complex implementation, and the priority escalation value is also different. This shows that the main line of the WINNT scheduler is priority adjustment, each thread has a fixed basic priority, which is used to return the thread priority.
The O (1) scheduler of WINNT personally thinks that this scheduler is superior in the opinion of modern operating systems, but it is not as good as the CFS scheduler of Linux. The exquisite nature of the WINNT scheduler is that it relies on a very small number of external factors, very similar to unixv6. the weakness in the US is the balancer. For the server version and home version, winnt defines different time slices. The server pursues high throughput and avoids frequent switching. Therefore, the time slice will be longer, while the home version pursues high responsiveness, therefore, the time slice will be shorter, which is also in line with the unixv6 scheduler design philosophy.
The winnt scheduler does not reflect the behavior of the altruistic system, but relies entirely on external operations such as the reasons for priority escalation and balancer operations. This means that it is not a self-consistent scheduling system, only winnt can play its role.
In 2008, I designed my own (in fact, for the company) 08 scheduler (named later). The original intention was that the O (1) scheduler could not meet real-time requirements, especially when the load is high. The time slice is fixed. The more processes there are, the longer it takes to schedule a round. This is the biggest drawback of Linux's "by-wheel" scheduling philosophy. The winnt scheduler has no such problem, this is because the priority of a thread can be upgraded for any reason at any time. The balancer can be used to eliminate the hunger of the thread for more than four seconds. But Linux does not support "Wheel-based scheduling", so I implemented my own 08 scheduler.
4. CFs scheduling, but my 08 scheduler did not solve all the problems because I was poisoned? Drugs scheduled by wheel! In my mind, there is always a traditional Linux concept, that is, all processes must be scheduled round by round, and one process in each round can only be scheduled once! In fact, there is nothing wrong with this idea. On the contrary, it is the best way to avoid confusion and complexity. The key is how you understand it and how to implement it.
Linux later than 2.6.23 adopts the CFS scheduling algorithm. First, it is also the implementation of the "Wheel-based" scheduler.
4. 1. Overview of CFS? What should I do? Completely fair! How can this problem be reflected? First look at a picture:

We should switch our thinking into a virtual clock. For a virtual clock, the system always chooses the process with the smallest virtual clock to run every time a process runs equal time slices, the effect is that the virtual clock of all processes catch up with each other at the same step speed. This is an equal embodiment! In each round of scheduling cycle, every process can get a running opportunity. How long does it take to run? Run n virtual clock cycles in a unified manner! But how does it correspond to a real clock? The answer is to explain the n virtual clock cycles by weight! The real clock cycle of each virtual clock cycle is:
Tr = T * (WN/wbase)
T indicates the interval of clock tick, Wn indicates the weight of the current process, and wbase indicates the reference weight, that is, the weight of the process where nice is 0. It can be seen that the process has a higher weight, the longer the actual time is. How long should a process run in any cycle T? The answer is: T * (WN/wt) Wt indicates the total weights of all processes in the queue.
This is very similar to my 08 scheduler! But it is more refined than mine. Is it because it uses a red/black tree ?...
Between Linux 2.6.23 and 2.6.25, the implementation of Linux CFS is simple. It maintains a fair_clock variable for every queue to track the step of the virtual clock. Each task has a key, the value is the current virtual clock value of the task. To ensure fairness, the scheduler always runs the task with the smallest virtual clock. These tasks are stored in the red/black tree. Of course, they can also exist in the heap or even the linked list. In kernel 2.6.25, the CFS implementation is more direct. Instead of tracking the virtual clock in the queue, it directly applies T * (WN/wt) Time to determine whether process n has expired in this round of scheduling.
In any case, the CFS scheduler still has some problems, that is, the total scheduling time of all processes is not limited, and it will still cause hunger, however, the Insertion Algorithm of the red/black tree will cause tasks to always move from right to left. Although it will take a long time, the hunger problem is not persistent (that is, the same is true when using a linked list, this is determined by the virtual clock running with the wind, not by the algorithm !), This is also the advantage of round-robin scheduling!
The Linux CFS scheduler implements "Wheel-based" Scheduling Based on the smooth scheduling of unixv6, which is a wonderful solution.
. Sleep/Wakeup the CFS scheduler described above is simple and does not involve priority escalation issues. For example, the priority after I/O is used is increased. In fact, CFS does not care about this. It only needs to ensure that the process after sleep wake-up gets a relatively small virtual clock. It does not dynamically change the weight of a process, but calculates how much the virtual clock is reduced based on its weight. CFs always selects the process with the smallest virtual clock to run. As a result, the sleep wake-up process will take the lead to get the execution opportunity. The extra opportunity is determined by its weight.
I am unable to pay attention to the details of CFs. The details are implementation issues. If you understand the principle, can you implement the details by yourself?
5. The unixv6 scheduler of the simple UNIX scheduler is simple and super simple. As you can see, neither Linux nor winnt can surpass unixv6 (in fact bsd4.3 +) schedulers. They all adopt various tips, tips, and extra things such as balancers.
Finally, I want to talk about kernel preemption. At first, Unix systems did not allow kernel preemption to protect kernel data. Since this is just a lock-like protection policy, it does not last long, the final kernel preemption is implemented by almost all modern operating systems, including WINNT, Linux, and modern unix variants. At first, unix only defined some points that can be preemptible when implementing kernel preemption, that is, those points that do not possess any mechanism that may cause mutex problems. However, there is an idea of doing the opposite, that is, defining some points that can be preemptible in a non-preemptible kernel, it is better to define some unpreemptible points in the preemptible kernel, and the latter is easier to implement. Both WINNT, Linux, and Solaris are implemented in this way.
Do you know about CFS when unixv6 is outdated? Have you ever suffered from queuing? Simple things are never outdated, but you don't understand them...

Simple Unix-Scheduler Details

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple Unix-Scheduler Details

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support