In Linux, the current process scheduling algorithm is the CFS algorithm, which replaces the previous time slice rotation scheduling algorithm. The CFS algorithm smooths the dynamic priority calculation process, this allows the entire system to be preemptible by any execution entity at any time. In fact, this is the basic principle of the time-sharing system. Imagine how every process/thread is as interrupted, the entire system is truly a "fair" altruistic system by running its priorities at any time. To understand the nature of this altruistic behavior, if we study the statistical data of the CFS scheduling algorithm or the code, the effect is definitely not good, if we study the scheduling algorithm of the Windows NT kernel, it will undoubtedly waste a lot of time and energy, and eventually fall into details.
In fact, the simplest early version of UNIX is the best embodiment of this idea. Its code is concise, its logic and its simplicity.
0. Time-Sharing System
Feng nuoman's storage-type machine has two main points. More abstract, This is a Turing machine. CPU and memory, two major components. If you want to implement a time-sharing system, then let's see how people used computers in the early days, or how they used machine tools. Anyway, any shared machine... people lined up, holding a sign in their hands, and then the Administrator called the number. In fact, it is the same as handling business in today's banks... however, these implementations are much simpler in the computer. There are only two things to implement: How to Use CPU and how to use memory. These two points seem simple, but there are still some details. To use the CPU, you must execute the task one by one, and then run the next task. However, if a task lasts for too long, it will inevitably take a long wait for those in the queue, therefore, the time-based granularity of a task is too rough, and a finer-grained time-based system is a requirement. To achieve a finer-grained time-based system, a context is required, so that a task interrupted by the administrator can be executed. The context must have a storage location, that is, the memory, which involves how to use the memory.
Speaking of memory usage, in short, the whole physical memory is shared by all processes. How can this problem be shared? Of course, the allocation policy should be involved, that is, allocating a memory area for one process and allocating another memory area for another process. It is easy to understand segmentation and paging based on this idea. In short, segments are generally used to differentiate different memory types within a process, such as code, data, and stacks. Paging is generally used to divide memory pages between processes. For example, a page belongs to a process, the other page belongs to another region. how to divide the page must be indicated by a table. For modern operating systems, this table is a familiar page table. For the ancient PDP11, is the content of the APR register group. The tasks of a time-sharing operating system include managing these tables and managing the entire physical memory. When a process switches to another process, the contents of these tables must also be switched. For the modern X86 platform, we are familiar with the switching of the CR 3 register, however, when the Cr 3 register points to the page table of the resident physical memory, this physical memory cannot be used by the process because it stores the process metadata. As early as the PDP11 era, because of the small physical memory capacity, it was impossible to use this method to store metadata. It did not implement a precise paging mechanism, but only implemented a simple virtual address space, the APR register group defines the ing between virtual addresses and physical memory pages. Therefore, in the PDP11 era, all the processes were swapped in and out. There was no part of the resident memory of a process, while the other part was in the swap space, and the pages were interrupted and transferred to the memory as needed.
Essentially, a simple time-based system includes two aspects: CPU usage and physical memory usage. Of course, if the physical memory is large enough, multiple processes can reside in the memory at the same time, which will undoubtedly improve the efficiency, but this is not the essence of the time-sharing system, the subsequent On-Demand page adjustment mechanism is not the essence of the time-sharing system, but is only a MMU management method.
The implementation of time-sharing systems, memory management is extremely important, because the memory is organized by space, not as clock-driven as the CPU, how to associate the space usage with the clock tick and establish a ing is the key to the implementation of the time-sharing system, so I will spend a lot of space to introduce the details of memory management. The implementation of a time-sharing system is recursive and fragment, so the concept of preemption arises. If a fine-grained time-sharing system is used to seize unfinished tasks, in this case, when a process with a higher priority is triggered by Dynamic Priority calculation, the time slice pre-allocated to the process is preemptible.
1. Process Scheduling for Unix v6
We know that Unix was born in 1969, but the newly generated UNIX is not really UNIX, because it does not have a famous fork call. It can only be said to be a time-based system that can run fixed two processes, in addition, because these two processes connect to two terminals and the terminals belong to users, UNIX is a multi-user system. Therefore, although Unix was immature in 1969, it was indeed a multi-process multi-user time-sharing system. The history of modern operating systems began. The first mature UNIX is UNIX V6, the version introduced by the famous Leon's God book, which I call simple UNIX.
Before understanding the simple unixv6 scheduling mechanism, you must understand the memory management mechanism for running UNIX systems in the early days. In PDP11, there is no sophisticated paging MMU facility on the X86 platform. PDP11 uses a set of registers called Apr to implement virtual address space. Note that the table configured in a set of registers is used instead of the table resident in the physical memory. Due to the limit on the number of registers, the capacity of these register tables cannot be too large. Therefore, the conversion from virtual address space to physical address space is doomed to be simple, you cannot use the concept of multi-level page tables to laugh at the management logic of the PDP11 virtual address space. Even the multi-level page table management mechanism is also developed step by step. I will write a special article on virtual memory address space to describe how simple UNIX defines virtual addresses. In this article, I can only evaluate the virtual address space in just a few words: If the C language provides a set of common tools for programmers, then, the virtual address space provides programmers with a general and fixed size space for continuous venues. With the C language and virtual address space, programmers do not have to pay attention to machine details, C language isolates the underlying processor details for programmers. The virtual address space isolates the physical memory size, type, and other details for programmers. Programmers think that the data is in the memory, but in fact, the memory here only refers to virtual memory. The real location of data may be physical memory, disk, or network.
Back to the above topic, PDP11 uses a simple set of registers to save the ing between the virtual memory address and the physical memory address, this is the most comprehensive and simple mechanism before implementing the on-demand paging mechanism. As defined by the time-sharing system, the implementation is simple and efficient. It only needs to complete:
Each process has a set of APR registers, stores the ing between virtual addresses and physical addresses, and switches the APR register group during process switching;
One register maps one page, and an 8 K virtual page maps to an 8 K physical page. A group of registers have eight pairs, which are mapped to eight pages, therefore, the total virtual address space of a process is 8*8 K;
The simple unixv6 MMU mechanism described above directly contains the later multi-level page table mechanism. The only difference between them and the MMU of PDP11 is that the multi-level page table makes the table items resident in memory, this is because the process requires more and more space, and the multi-level page table becomes a requirement. This requirement is met because of the increasing memory size, and registers are increasingly unsuitable for such a task, MMU caches TLB, also known as a fast table, which replaces registers such as Apr to some extent. It is worth noting that multi-level page tables are narrow. In fact, in the APR period of PDP11, MMU tables are multi-level, and a register maps a whole page instead of an address, the offset in the page is directly specified by the virtual address.
However, when the physical memory is very small, it is impossible for a process to have too much virtual address space or to build an MMU table through a multi-level page table, because it will waste more physical memory to store metadata. Some people say that the multi-level page table saves memory, which is a big mistake. There is a premise that the multi-level page table saves memory, that is, the huge 4G address space is not used by many pages, so there is no need to create a page table or page table item. If a page in a virtual address space is used, more memory is needed to store multi-level page tables. In addition, the multi-level page table can be used only after it implements precise Virtual Memory Management for page adjustment on demand. I will describe the Unix memory management in detail later.
Well, after talking about so many MMU-related things, we finally started to talk about process scheduling. You know, in the PDP11 era, the virtual address space was only 64 KB, but the physical memory bus width was 18 bits, that is, 256 kb. The physical memory space was much larger than the virtual memory space, generally, the installed physical memory is more than 64 KB. Therefore, it is fully capable of allowing a process to fully reside in the physical memory at one time. A process can all reside in the memory during running, but it can be swapped out into the swap space when it is not running for a long time. For the implementation of unixv6 time-sharing system, this hardware architecture promotes the perfect launch of a great and simple process scheduling system. Shows the scheduling mechanism of unixv6:
(I have to ignore the graph here, because I am going to use it in the next article, because I want to compare it with other schedulers, which makes it more compact, if you can understand the following points, You must imagine the figure in your mind)
This scheduling mechanism has the following points:
1) process switching requires the assistance of process 0. A process must hand over the execution right to process 0 unless it gives up its CPU. process 0 then determines who will run the process next.
Why do we have to transfer data through the process no. 0? If I understand the memory management mechanism of PDP11 described above, I will immediately understand the reason for the transfer through process 0. Because unixv6 on the simple PDP11 is not implemented by the on-demand paging mechanism, it must ensure that all pages of the processes to be run reside in the memory completely, therefore, this must be ensured through the process no. 0. A process no. 0 is also called a SWAp process or a scheduling process. Once it is awakened, the execution process will be swapped in and out, the process to be run will be swapped in to the memory, and the process that is not running for a long time will be swapped out of the memory, if this is done, and no process needs to be switched in and out, process 0 will forward control to the highest priority process, and sleep, if you do not have such a task at the beginning, you can give the execution right to a process with the highest priority and directly sleep.
2) The priority is re-calculated in real time. As long as a process with a higher priority is ready, the process can be preemptible at any time in the user State, and the process cannot be preemptible in the kernel state.
The simple unixv6 defines a principle that a process must be preemptible (because it does not have the concept of a time slice), but there is a limit, the execution stream in the kernel state cannot be preemptible. In unixv6, the priority of each process is re-calculated in real time for every n clock interruptions. in the manual, the value of N is Hz, that is, 1 second, but it can also be modified before compilation. The significance of this type of re-computing is to find more worthy processes and seize the current process! This idea is actually the current Linux CFS scheduling idea and the scheduling idea of Windows NT, but the implementation strategy is different. However, to protect the data structure of the kernel, an exception is defined, that is, the execution stream in the kernel state cannot be preemptible by the user process, although this policy was finally broken by kernel preemption, it is still the best in the server field, especially in environments that require high throughput. The system (preemption type) remains unchanged, but the policy (the punishment cannot be done by doctors) keeps pace with the times.
3) The execution interval of processes is smooth, and the chances of process starvation are very small.
We can evaluate the O (n) scheduling of Linux 2.4 and O (1) scheduling of Linux 2.6. The biggest problem with these two algorithms is process hunger, as a result, the so-called "Tips" of various complicated and complicated re-computing priorities eventually turned out to be counterproductive. Fortunately, CFS is still half-masked at this time, otherwise .... in fact, the problem is not big, because win7 also faces the same problem, but a system built based on Windows NT has a balancer, similar to a kernel thread, which regularly scans the hunger thread, in this way, when O (1) in Linux tries its best to compensate for the interaction process, Windows will devote all its efforts to high praise rather than compensating for the interaction process, windows has always had a server version and a home version, which is similar to whether to enable kernel preemption during Linux compilation, the differences between the Windows NT server and the home version are also reflected in the difference in the length of time slice. Simple UNIX does not have such a problem. The priority of a process is calculated in real time. As long as a process with a higher priority can be preemptible at any time, this is only one of them, the second is that the priority of the kernel execution stream is different from that of the User-mode execution stream.
It is worth noting that the simple UNIX process scheduling is not based on time slices. Every process does not have a time slice that is computed before running, and more suitable processes are ready, immediately preemptible the current process (excluding kernel-State processes ). This smooth scheduling mechanism requires that the scheduling policy must be preemptible, and the kernel should not be preemptible (the attackers can only protect the confidential data before there is a better solution, once there is a better protection mechanism, the kernel must be able to be preemptible !)
4) when a process is in the kernel state, the priority of the wake-up process is determined by the blocking reason. Because the priority of the kernel state is higher than that of the user State, the kernel machine needs to quickly access the system.
In this regard, I think the Windows NT kernel and Solaris kernel are superb! They all map all execution streams to a certain priority, and even interruptions have their own priority. Even normal processes, when executing a certain thing, it can also be in the priority of interruption. Solaris goes further and uses interruption as the thread to schedule... but you know, as early as 1975 of unixv6, this idea has already reflected the system. Unixv6 associates the priority of the sleep process with the cause of sleep. If you have read windows internals, you will find that, in Windows, "higher priority after I/O is completed" is also associated with I/O reasons, for example, after the disk I/O is complete, the thread priority is not increased by the sound card I/O after the completion of the thread priority, how similar this unixv6.
2. what is simplicity?
I said unixv6 is simple, but what is simple?
Simplicity means streamlining. In the next article, you will find that almost all ideas about Scheduling Mechanisms of modern operating systems can be associated with the scheduling mechanism of unixv6, or even worse than unixv6. For example, earlier Linux versions. Similar to the following:
Unixv6 scheduler and all version schedulers of Windows NT
Unixv6 scheduler and Linux CFS Scheduler
Simple Unix-prepaster