Analysis of slub distribution in Linux

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Linux memory management, the slub distributor uses the slab Object pool to manage the memory used by the kernel. Slab is indeed complicated, so we haven't looked into it for a long time.
But now, in the Linux kernel, slab has been replaced by its simplified version, slub. I recently took the time to read slub's code and noted down some of my understanding.
Although slub is implemented in the kernel, the user-State object pool can also be used for reference.

The general idea of slub is similar to that of slab. The memory in the object pool is allocated and recycled in large units. Then, each "Bulk" is divided into "small blocks" based on the object size. The user allocates and recycles objects in the unit of "small blocks.
The structure of slub is as follows:

In addition, kmem_cache has the following parameters (which will be explained later ):
Int size;/* space occupied by each object */
Int objsize;/* object size */
Int offset;/* stores the offset of the next pointer in the space occupied by the object */
Int refcount;/* Reference count */
Int inuse;/* the size of the object except the next pointer */
Int align;/* Number of aligned bytes */
Void (* ctor) (void *);/* constructor */
Unsigned long min_partial;/* Minimum number of pages saved in kmem_cache_node */
Struct kmem_cache_order_objects Oo;/* the preferred page Allocation Policy (How many consecutive pages can be allocated and how many objects can be divided )*/
Struct kmem_cache_order_objects min;/* Page allocation policy selected */
(There are also some members, or support some options, debug, or periodic memory recovery services. I will not list it here .)

General Structure

Kmem_cache is the object pool manager. Each kmem_cache manages one type of objects. All managers are connected by the slab_caches header through a two-way linked list.
This is the same as the previous slab.

Kmem_cache memory management is completed by the member kmem_cache_node. In the NUMA environment (heterogeneous storage structure), a kmem_cache maintains a group of kmem_cache_nodes, which correspond to each memory node. Kmem_cache_node is only responsible for managing the memory under the corresponding memory node. (If it is not a NUMA environment, there is only one kmem_cache_node .)

The actual memory is managed by the page structure. kmem_cache_node concatenates a set of pages through the partial pointer (nr_partial indicates the length of the linked list), which represents the memory in the object pool. The page structure not only represents the memory, but also contains some union variables to record the allocation of objects in the corresponding memory (only effective when the page is added to the slub distributor, in other cases, these variables are explained separately ).
The original slab is more complex. The page in slab only manages the memory and does not maintain the concept of "object. Instead, an additional slab control structure (slab) is used to manage objects, and some pointer arrays in the slab structure are used to define the boundaries of objects.

As mentioned above, the memory in the object pool is allocated and recycled in large blocks, and the page is such a large block. Pages are divided into several small pieces, each of which is used to hold one object. These objects are stored in the form of a linked list. Page-> freelist is the head of the linked list. Only unallocated objects are stored in the linked list. The object's next pointer is stored in the position where its offset is kmem_cache-> offset. (See the figure above)
In slab, "big blocks" are the slab structure that provides control information. The page structure only indicates memory, which is only the resource referenced by slab.

Each page does not represent only one page, but 2 ^ order consecutive pages. The order value here is determined by oo or min in kmem_cache. When allocating pages, first try to use the order value in OO to allocate continuous pages of a suitable size (this value was calculated when kmem_cache was created, when using this value, you need to allocate a certain number of consecutive pages so that the memory is divided into small blocks and there is less waste on the remaining corner ). If the allocation fails (it is difficult to allocate a large number of consecutive pages due to long running time and more memory fragments), use the order value in Min, allocate the minimal number of consecutive pages that meet the object size (this value is also calculated when kmem_cache is created ).

Kmem_cache_node concatenates a group of pages through the partial pointer. These pages must be not fully occupied (one page is divided into pages-> objects object size space, among these spaces, page-> inuse are used. If page-> objects = page-> inuse, the page is full ). If a page is full, it is removed from the linked list. If the page is free (page-> inuse = 0), it is usually released unless the nr_partial (linked list length) is smaller than the min_partial in kmem_cache. (Since it is a pool, there should be a certain amount of inventory, min_partial represents the lowest stock. This value is also calculated when kmem_cache is created. When the object size is large, a large min_partial value is obtained. Because the larger size value requires more consecutive pages when allocating pages, the allocation of consecutive pages is not as easy as that of a single page, so you should cache more pages .)
The original slab has three linked lists to maintain slab of "full", "partial", and "free" respectively. "Free" and "partial" are combined in slub to form the preceding partial linked list. The "full" page is not maintained. In fact, it does not need to be maintained, because the page is full and cannot meet the allocation of objects. It can only respond to the collection of objects. When the object is recycled, the corresponding page structure can be obtained through the object address (the Page Structure address corresponds to the memory address, see Linux memory management analysis). Maintaining the full page allows you to easily view the status of the distributor. Therefore, in debug mode, the kmem_cache_node still provides a full linked list.

Allocation and release

Objects are not directly allocated and released on kmem_cache_node, but on kmem_cache_cpu. A kmem_cache maintains a group of kmem_cache_cpus, which correspond to each CPU in the system. Kmem_cache_cpu is equivalent to providing a allocation cache for each CPU to avoid competition because the CPU always performs operations on kmem_cache_node. In addition, the kmem_cache_cpu can fix the cached objects on a single CPU to improve the cache hit rate of the CPU. Kmem_cache_cpu provides only one page cache.

The original slab provided an array_cache structure for each CPU to cache objects. The organization form of an object in the array_cache structure is different from that in the slab structure, which increases complexity. Slub organizes objects through the page structure, with the same organizational form.

When allocating objects, first try to allocate them on the kmem_cache_cpu. If the allocation fails, move a page on kmem_cache_node to kmem_cache_cpu. There are two reasons for the allocation failure: the page on kmem_cache_cpu is full, or the node to be allocated is different from the node on the cache page on kmem_cache_cpu. When the page is full, the page is removed from the kmem_cache_cpu (or in debug mode, it is moved to the full linked list of the corresponding kmem_cache_node). If the node does not match, the cache page on the kmem_cache_cpu is first moved back to the partial linked list of the corresponding kmem_cache_node (further, if the page is free and the length of the partial linked list is no less than min_partial, the page is released ).
When an object is released, the address of the page corresponding to the object can be found, and the object can be placed on the page. However, there are some special logics in it. If the page is being cached by kmem_cache_cpu, there will be no additional processing; otherwise, when the object is stored in the page, you need to lock the page (because other CPUs may be allocating or releasing objects on the page ). In addition, if the page is full before the object is recycled, the page becomes partial after the object is released, and it should be added to the partial linked list of the corresponding kmem_cache_node. If the page becomes free after the object is recycled, it should be released.
There is another detail about the release of the object. Since the object will be placed back to the corresponding page, what if the page is being cached by other CPUs (kmem_cache_cpu of other CPUs refers to using this page )? It doesn't matter. kmem_cache_cpu and page each have a freelist pointer. When the page is cached by a CPU, all objects on page freelist are moved to the freelist of kmem_cache_cpu (in fact, a pointer value is assigned), and The freelist of page becomes null. The release is released to the freelist of the page. The two freelists do not affect each other. However, this seems to be a problem. If the freelist of a cached page changes to non-null due to the release of the object, the page may be cached on the kmem_cache_cpu of other CPUs, therefore, multiple kmem_cache_cpus may cache the same page. This causes the cache of a CPU to be cached to other CPU objects (because the CPU cache is not aligned with the object ), therefore, the object write operation on one CPU may cause the cache of another CPU to fail.

When kmem_cache is created, slub calculates the rational layout of the Object pool based on various information (see the figure above ). Objsize is the size of the object. After alignment, the object becomes inuse. The next pointer (marked by offset) of the object may be stored after inuse ), in this way, the actual occupied size of the object is expanded to size.
In fact, the offset here does not always point to the position behind inuse (otherwise, the offset can be replaced by inuse ). Offset has two possible values: inuse and 0. Here, offset specifies the next pointer position, and next is used to concatenate objects in the idle linked list. When the next pointer is required, the object must be idle and the space in the object is not used. So, the first space in the object is used as the next pointer, And the offset is 0. However, in some special cases, the space in the object cannot be reused as the next pointer. For example, if the object provides the constructor, the object space is constructed. In this case, offset is equal to inuse, And the next pointer must be stored after the object space.

About kmem_cache_cpu

In the previous sections on Object allocation and release, we focused on the process. Next we will analyze the role of kmem_cache_cpu in detail.

If there is no kmem_cache_cpu, the object allocation process should be:
1. Select a page from the partial linked list of the corresponding kmem_cache_node;
2. Select an object from the freelist linked list of the selected page;
This is the most basic allocation process.

But there is a bad thing in this process. The partial linked list of kmem_cache_node is global, and the freelist linked list of page is also global:
1. The first step is to lock the access to the partial linked list;
2. When you access the page-> freelist linked list in step 2, you also need to lock it;
3. The object may have just been released on cpu1 and is immediately allocated by cpu2. Not conducive to CPU cache;

The introduction of kmem_cache_cpu is the optimization of this problem. Each CPU corresponds to a kmem_cache_cpu instance, which is used to cache the page selected in the first step. In this way, the first step does not need to be locked. Objects in the page will also be used on the same CPU for a period of time, which is conducive to CPU cache.
Freelist in kmem_cache_cpu is used to avoid locking in step 2.

Assume that there is no kmem_cache_cpu-> freelist, while page-> freelist initially has four objects: 1, 2, 3, and 4. Consider the following event sequence:
1. The page is cached by cpu1, and allocated by 1 and 2;
2. Because the node ID requested on cpu1 does not match the page, the page is put back in the partial linked list of kmem_cache_node. In this case, there are three and four objects left in page-> freelist;
3. The page is cached by cpu2 again (the page has been put back to the partial linked list in the previous step ).
In this case, page-> freelist may be accessed by the CPU cpu1 and cpu2. When object 1 or 2 is released (these two objects have been allocated to cpu1 ), cpu1 will access page-> freelist; while it is clear that cpu2 will also access page-> freelist when allocating objects.

To avoid locking, kmem_cache_cpu needs to maintain its own freelist and take over all objects under page-> freelist.
In this way, cpu1 only deals with page-> freelist, and cpu2 deals with freelist of kmem_cache_cpu, so it does not need to be locked.

Page is used by multiple CPUs at the same time

As mentioned above, objects on the same page may be cached on different CPUs, which may affect the cache of CPU. However, this seems impossible.

First, after a page is added to slub in the initial state, all objects belonging to the page are idle and exist in page-> freelist;
The page may then be cached by the kmem_cache_cpu of a CPU, assuming it is a CPU-0, then the kmem_cache_cpu will get all objects belonging to the page. Page-> freelist will be blank;
Next, some objects under this page may be allocated on the CPU-0;
Next, the page may be detached from the kmem_cache_cpu of the CPU-0 due to node mismatch of NUMA. In this case, page-> freelist will save the unallocated objects (and other objects have been allocated on the CPU-0 );

At this point, a part of the object that belongs to the page is being used on the CPU-0, and another part of the object exists in page-> freelist.
Now there are two options:
1. Do not put this page back into the partial list to prevent other CPUs from using this page;
2. Put the page back into the partial list and allow other CPUs to use the page;

For the first method, you can avoid objects of the same page being cached to different CPUs. But this page must wait until the CPU-0 caches it again before it can continue to be used; or wait until all objects that belong to this page used by the CPU-0 are released, this page can be placed back in the partial list or directly released.
In this way, although a page has idle objects, it may be unavailable for a certain period of time (in extreme cases, it will never be available ). The system is not quite controllable ......

Now slub chooses the second method to put the page back into the partial list, so the page can be used by other CPUs immediately. So, the problem that the objects on the same page are cached to different CPUs is not feasible ......

Vs Slab

Compared with slab, slub also has an interesting feature. When creating a new object pool, if you find that the size of a previously created kmem_cache is equal to or slightly greater than the new size, the new kmem_cache will not be created, instead, kmem_cache is reused. Therefore, a refcount (reference count) is maintained in kmem_cache, indicating the number of times it is reused.

In addition, slub removes a very interesting feature in slab, coloring ).
What is coloring? When dividing a memory into small Blocks Based on the object size, it may not be so good, and some sides and corners will be left blank. Coloring is to use these edges and corners to make an article, so that the starting address of the "small block" is not always equal to the 0 address in the "large block", but is floating between the 0 address and the free size. In this way, there are more changes to the lower bits of each object of the same type.
Why? This takes into account the CPU cache. We have heard of the operating system principles. To improve the memory access efficiency of the CPU, the CPU provides cache. So there is a ing from memory to cache. When the CPU command requests access to a memory address, the CPU will first check whether the address has been cached.
How does memory-to-Cache ing work? Or how does the CPU know whether a memory address has been cached?
An extreme design is "fully connected ing". Any memory address can be mapped to any cache location. When the CPU obtains an address, it may have too many cached cache locations. It needs to maintain a large ing table and spend a lot of query time, to determine whether an address is cached. This is not desirable.
Therefore, there is always such a restriction on Cache ing. A memory address can only be mapped to some cache locations. In general, the lower bit of the memory address determines the location where the memory is cached (for example, cache_location = address % cache_size ).
Well, returning to slab coloring can reduce the probability that objects of the same type have the same number of addresses at a lower level, thus reducing the probability of ing conflicts between these objects in the cache.
How can this be used? In fact, many objects of the same type are used together, such as arrays, linked lists, vectors, and so on. When we traverse these object sets, if every object can be cached by the CPU, the processing efficiency of this traversal code will inevitably be improved. This is the meaning of coloring.
Slub removes the coloring because it is more difficult to use the memory and tries to minimize the number of edges and corners. In addition, since kmem_cache can be reused by a variety of objects of similar size, the more you reuse it, the less significant the coloring will be.

Rating:

It is clearly written, including all key points.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of slub distribution in Linux

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of slub distribution in Linux

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support