As the first article in this series, let me describe the slab system first. Because I have discussed this topic with colleagues and friends in recent days, and feel that this topic is more typical, so as the first article. In fact, according to the operating system theory, process management should be more important, according to my own interests, IO Management and TCP/IP stack will be more weight, about these things, I will be given in succession.
The slab of the Linux kernel comes from a very simple idea, which is to prepare some data structures that will be allocated frequently and released in advance. However, the standard slab implementation is too complex and the maintenance cost is huge, therefore has differentiated the smaller slub, therefore this article discusses is Slub, after all mentions slab the place, refers to Slub. In addition, because this article mainly describes the content of kernel optimization, is not the basic principle of introduction, so want to know slab details and code to achieve the self-Baidu or read the source.
The simple slab on a single CPU gives a sequence of slab on a single CPU when allocating and releasing objects:
It can be seen that it is very simple, and fully achieves the goal of the slab design at the beginning.
Scaling to multi-core CPUs now we simply extend the above model to a multi-core CPU, similar to the same allocation sequence as shown:
We see that when there is only a single slab, if multiple CPUs are allocating objects at the same time, the conflict is unavoidable, the only way to resolve the conflict is to lock the queue, but this will greatly increase the latency, and we see that the entire delay in applying for a single object starts at T0 and ends at T4, which is too long.
The direct idea of a multi-CPU non-locking parallelization operation is to replicate to each CPU a set of identical data structures.
Only way is to add "per CPU variable". For slab, it can be expanded to look like this:
If you think it's over, it's pointless.
First of all, let's look at a simple question, if the slab cache for a single CPU has no objects to allocate, but the other CPU's slab cache still has a large number of idle objects, as shown in:
This is possible because the requirement for a single slab is closely related to the processes/threads that are executed on the CPU, such as if CPU0 only handles the network, it will have a large demand for data structures such as SKB, and for the final question, if we choose to assign a new page from the partner system ( or pages, depending on the size of the object and the order of the slab cache, it will cause slab to be unevenly distributed between CPUs over time, and more likely to consume a lot of physical memory, which is not expected.
Before proceeding, it is important to be clear that we need to balance slab between CPUs, and these must be done by the internal mechanism of slab, which is completely different from the CPU load balancing between CPUs, and for the process, it has a core scheduling mechanism, such as time slice, or the step rate of the virtual clock, etc. However, for slab, it is entirely up to the user itself, so long as the object is still in use, it cannot deprive the user of the right to continue using it unless the user releases it himself. Therefore, the load balancing of slab must be designed as cooperative rather than preemptive.
All right. Now we know that reassigning a page from a partner system is not a good idea, it should be the final decision, and before executing it, try another route first.
Now, let's draw a second question, as shown in:
No one can guarantee that the CPU that allocates the slab object and the CPU that frees the slab object are the same CPU, and no one can guarantee that a CPU is not assigned a new page (s) during the life cycle of a slab object, and that during this time the complex operations are not regulated. How should these problems be solved? In fact, understanding how these problems are solved, a slab framework is thoroughly understood.
Problem solving-tiered slab cache stepless speed change is always desirable.
If a CPU's slab cache is full, it is considered a reckless and immoral practice to go directly to the slab cache of the CPU at the same level. So why not set up another slab cache, get the object inside it is not as simple and direct as the CPU slab cache, but the difficulty is not small, just a little bit more consumption, this is not very good? In fact, isn't the CPU's L1,L2,L3 cache designed for this scenario? This has in fact become the only way of the cache design. This design idea also works on slab, which is the slub implementation of the Linux kernel.
Now you can give a concept and an explanation.
Linux Kernel slab cache: An object cache model that is divided into 3 layers.
Level 1 Slab cache: A list of idle objects, one for each CPU, and the allocation of freed objects without locking.
Level 2 Slab cache: A list of free objects, one shared page (s) cache per CPU, allocates the release object only to lock the page (s), is mutually exclusive to Level 1 slab cache, and is not mutually inclusive.
Level 3 Slab Cache: A page (s) List of all CPU shared caches per NUMA node, in page (s), obtained after being promoted to the corresponding CPU Level 1 slab cache, and the page (s) as Level 2 Share page (s) exists.
Share page (s): the page (s) is occupied by one or more CPUs, each of which can have a list of idle objects that are not charged to each other on the page (s), and the page (s) has a unique level 2 slab cache Idle list, The linked list does not conflict with one or more of these level 1 slab cache idle lists, and multiple CPUs must scramble to get the Level 2 slab cache, which can then be promoted to their Level 1 slab cache.
The slab cache is illustrated as follows:
Its behavior is as follows:
2 Scenarios show the details of the general object assignment process:
In fact, for multiple CPUs sharing a page (s), there can be another way to play, as shown in:
Partner system in front of our brief experience of the Linux kernel slab design, not too long, too long difficult to understand. But finally, if level 3 doesn't get page (s), it eventually falls to the ultimate partner system.
The partner system is designed to prevent memory allocations from being fragmented, so it does two things as much as possible:
1). Try to allocate as much memory as possible 2). Try to merge contiguous chunks of memory into a chunk of memory. We can understand the above principles through the following diagram:
Note that this article is about optimization, not the partner system of popular Science, so I assume that everyone has understood the partner system.
Since most of the slab cache objects are small structures of no more than 1 pages (not only slab systems, memory requirements for more than 1 pages are less than the memory requirements of 1 pages), there is a significant memory allocation requirement for 1 pages. From the principle of distribution of the partner system, if the continuous large number of allocation of a single page, there will be a large number of orders greater than 0 split into a single page, on a single core CPU, this is not a problem, but on the multi-core CPU, because each CPU will do such a distribution, and the division of the partner system, The merge operation involves a lot of list operations, and the lock overhead is huge and therefore needs to be optimized!
The way the Linux kernel allocates "single page cache per CPU" to the partner system's allocation requirements for a single page!
Each CPU has a single page cache pool, and when a single page is required, you can retrieve the page from the current CPU's corresponding page pool without locking it. When there are not enough pages in the pool, the system pulls a bunch of pages from the partner system into the pool and, conversely, frees it to a single page cache per CPU when a single page is released.
In order to maintain "per CPU single page cache" in the number of pages will not be too much or too little (too many will affect the partner system, too little impact on the CPU requirements), the system maintained two values, when the number of cache pages lower than the low value, then from the partner system in bulk fetch the page into the pool, When the number of cached pages is greater than high, some pages are released to the partner system.
Summary of the multi-CPU operating system kernel, the key overhead is the cost of locks. I think this is the beginning of the design, because at the outset, the multi-core CPU did not appear, the single core CPU on the shared protection is almost all can be used "no interruption", "no preemption" to simple implementation, to the multicore era, the operating system also simple translation to the new platform, So the synchronous operation was added later on the basis of a single core. In short, the current mainstream operating systems are created in the single-core era, so they are all in a single-core environment, and for multicore environments, they may have a problem with their first design.
In any case, the only way of the optimization operation is to prohibit or minimize the operation of the lock. The idea is to create a "cache per CPU" for the shared key data structure, which is divided into two types:
1). Data path Cache. such as the routing table and other data structures, you can use the RCU lock to protect, of course, if you create a local route table cache for each CPU, it is also good, the question is when to update them, because all the cache is peer, so a batch synchronization mechanism is necessary.
2). Management mechanism cache. For example, the Slab object cache, whose lifecycle depends entirely on the user, does not have synchronization problems, but there are management issues. The idea of using hierarchical cache is good, this is very similar to the CPU L1/L2/L3 cache, the use of this smooth overhead gradually increased, the capacity of the gradual increase of the mechanism, and with the design of good swap/swap algorithms, the effect is very obvious.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Only way-slab and partner system for multi-core Linux kernel path optimization