Several interesting explanations in the Linux kernel (process scheduling algorithm, page scheduling algorithm, non-linear working set)

Last Update:2015-09-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.O (1) Time calculation formula for scheduler and CFS Scheduler

Linux 2.6.23 before the general use of the O (1) scheduler, which is a priority-based time-slice scheduling algorithm, so-called O (1) is only some of its ingenious data structure, in the case of dynamic compensation/penalty, as long as the priority is determined, then the time slice is fixed. 2.6.23 after the CFS, it is a weight-based non-time slice scheduling algorithm, the process of each execution time is not fixed, but according to the number of processes in a quasi-fixed period according to its weight ratio of time, still in the time slice as the term, CFS, the process of each run time and process of the total.
Even without considering the dynamic compensation/penalty, O (1) still faces double slope problem, in order to explain this problem, I first give the process priority formula:
prio=max_rt_prio+nice+20
Wherein, Max_rt_prio is a 100,nice of 20 to 19 closed interval of any integer. The calculation of the next time slice reflects the double slope:
If Prio is less than 120:time_slice=20* (140-prio)
If Prio is greater than or equal to 120:time_slice=5* (140-prio)
Visible, as long as the prio determined, the time slice of each process is also determined, with 120 as the demarcation, high priority and low priority time slice calculation is different, the reason is to: both to reflect the advantages of high priority, and not too weakened low priority. Through the Logic of O (1), we can figure out that all processes must complete a round of scheduling, that is, each process must have the opportunity to run once, so that the "round-robin" time increases with the number of processes.
Now let's see how CFS reversed the outcome. The CFS scheduler is very simple and does not have much computational formula. The dynamic compensation/penalty is still not considered, the CFS is fully weighted, the Linux kernel maps 40 priorities to 40 weights, in order to simplify the discussion, I assume the weights are 1,1*1.2,1*1.2*1.2,1*1.2*1.2*1.2, .... increase by 1.2 times times, and then define a fixed schedule period or in any period of time slice, a process run time is slice* (process weights/weights and), visible, if the number of processes increased, all the process of collective smooth slow, meaning that each time the run is reduced ( Time slices are no longer fixed), so-called "complete fairness" means that the weight of the process of its virtual clock step is slow, the weight of small process its virtual clock step faster, CFS at each scheduling point (such as clock tick,wake up,fork, etc.) Select the minimum virtual clock process run, which is relative to O (1 ) in terms of a smoother approach, thus reflecting a delay in fairness, as for the throughput, or by weight, and the weights mapped to the priority level. and O (1) is more of a throughput fair.
In summary, it is O (1) that calculates a fixed time slice for each process, while the CFS calculates the percentage of time that each process runs in the same time period, which can be seen in different points, even in the opposite direction.
Now, let's give a review. CFS is smoother and well suited for interactive processes because the interaction process is hunger-sensitive, but they don't often occupy the CPU, but once the CPU is needed, it must be can claim immediately. For service processes with high throughput requirements, the CFS is not suitable, and the need for this process is to make it run as long as possible if the CPU is occupied, and the fixed time slice of O (1) is more suitable. According to the usual taxonomy, I/o-intensive processes are more of an interaction (and possibly a storage class), which is driven by I/O and should meet the CPU requirements at any point in time because they do not take too long, but for CPU-intensive processes, the chances of getting the CPU should be less than I/O intensive applications. Because once they get the CPU, they will be occupied for a long time. Generally speaking, the CFS is more suitable for desktop clients, and for servers, O (1) is more suitable.
This article does not talk about the other two schedulers, the Windows Scheduler and the Linux BFS Scheduler, which are based on the dynamic priority promotion/recovery, which is suitable for desktop applications, the latter based on the priority classification O (n) algorithm, regardless of the core and NUMA extension, more suitable for mobile terminals.

2. Major and other of missing pages interrupts

The virtual address space of all processes shares a limited amount of physical memory, which is bound to require paging on demand, which is possible because at every point in time, the CPU needs only a small number of physical pages to get the mapping. The question now is, consider if there is a missing page-the "presence bit" in table entry is 0, from where to get the new page. The answer is simple, of course, to get the page from the least expensive place.
At this point we must consider the type of page that is required to be missing pages, roughly divided into 3 categories:
1). Full Address page, that is, the address has not mapped the physical pages.
2). The address once mapped the page, but was swapped out to swap space.
3). The page that the address was mapped to belongs to a file system file, but the mapping has been removed.
For the above 3 cases, the so-called "minimum cost" has different strategies.
First look at 1), this is very simple, directly from the partner System allocation page, of course, the cost of allocating a single page is quite small, because the partner system has a per CPU page pool, the pool allocation does not require any lock. Now let's see if the price is small enough, it seems, but it's not absolutely true. For the read operation, it is assumed that a page was mapped to the fault virtual address, and then the mapping was released, we know that at this point the page's part of the data has been cache to the CPU cache line, when we need to read the page but page faults, we want to get the original pages, The vision is good, but how do we track the page? Does the cost of tracking the relationship between the page and the pages process offset the benefit of keeping the cache heat? In fact, this is difficult because you have to consider the situation of shared memory, which is a multi-to-one bidirectional relationship, that is, a many-to-many relationship. However, Linux's memory subsystem does nothing but it is based on a probabilistic behavior that releases the page pool to the per CPU into the behavior of cold release and hot Release,hot release to add the page to the pool header, In contrast to the end of the queue, and the allocation behavior of the per CPU page pool is the team header assignment, if lucky enough, perhaps the process can get the read-only page that was just unmapped. How does the kernel ensure that a process is lucky enough? This is a very metaphysical but also practical, the kernel uses a quasi-LRU algorithm to prevent page bumps between processes, the locality guarantees that within the process of a page is accessed for a period of time after the chance of being accessed is very large.
Look at 2), the kernel is running a page recycling of the daemon kernel thread, found that an infrequently accessed to be recycled page is dirty page, the kernel thread does not directly start IO write it to the swap space, but temporarily first queued it into a swap cache, This gives the page a chance to be reused once again without IO, and it is the local principle behind the strategy. When a missing fault occurs, it is first searched in the swap cache and no IO is required if found. The practical significance of this strategy is enormous, in the hierarchical storage principle we can know that the memory access and disk IO time difference a few orders of magnitude, so do not have to do, is not to brush the swap cache to the swap partition.
Finally we look at 3), and 2) similar, but this involves the FileSystem file page cache, organized by the radix tree, this tree and page recycling is irrelevant, so-called to brush off a file belonging to the page refers to just the pages to be paged table item mapping, In fact, it is probably still in the radix tree of the file, and if the page is found in the radix tree when it occurs, then you need to create a mapping and no more disk IO.
In summary, we can know, as long as not to disk IO as far as possible, as long as the non-disk IO page is minor, the IO is major, a name only. Today's kernels subdivide minor, but this is not the point, so unification is called other.
It is important to mention that the LRU algorithm, in general, almost all operating systems use the quasi-LRU rather than the standard LRU, because the standard LRU is only theoretical, practical implementation is not realistic, not to say that the hardware consumption is huge, more because "its effect is not better than quasi-LRU or worse", The standard LRU is a stack management system, the spatial locality is certainly important, however, in consideration of the loop, the loop boundary will face the opposite extreme of spatial locality, this is the Levi long jump!! Levi's short jump is in line with spatial locality, but Levi's long jump is the antithesis of spatial locality. By the way, any behavior of the entire human society is in line with the Levi's long jump principle, if the quantitative change as Levi short Jump, then the qualitative changes are Levi's long jump, this is the fundamental principle, Marx said.
The Linux kernel uses a dual-clock two-chance algorithm to simulate the LRU algorithm, which is very effective.

3. Non-linear mapping and working set of process address space

The Linux kernel uses VMA to represent a segment of the process address space, and as for this section maps what VMA itself manages, to the upper layer is just a contiguous virtual memory that provides the address space.
In general, a part of a file corresponds to a VMA, if you need to map a different part of a file, you need a different VMA, if lucky enough, these VMA can be close together, but between the two mappings, some other mappings occupy the hole, then it is not fun, So you need a way to "re-layout" The file, and the following illustration shows the idea:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/73/BE/wKioL1YGH6rBicJqAAG2uh7BWDg473.jpg "title=" Kerl.jpg "alt=" Wkiol1ygh6rbicjqaag2uh7bwdg473.jpg "/>

But the explanation for the document is a bit of a little fun. There is a concept of working set in the operating system, which is also based on the principle of locality. The working set is the mapping of different content to a fixed virtual address space window, and if the CPU's cache line is addressed based on the virtual address, then the temporal space locality will play a significant role in this process, the TLB will also play a role. A virtual address-based working set is a true separation between the virtual address space and physical memory.
In essence, non-linear mappings do not necessarily target files, it is to "map different content to the same virtual address segment".

Several interesting explanations in the Linux kernel (process scheduling algorithm, page scheduling algorithm, non-linear working set)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More