From Kmalloc to analysis slab allocation

Source: Internet
Author: User
Tags goto

PS: Notes are based on the linux4.9 version source code, in order to omit some code, use ' –> ' to represent the included function.

First look at the slabinfo below Proc.

From the above diagram, you can see that slab has a lot of objects, according to Ulk, the cache can be dividers to two categories: the normal cache private cache
What difference does it have. The private cache is created by the Kmem_cache_creat () function and is used specifically with special types of objects, such as Dentry Buffer_head task_struct, and so on, which we are familiar with. Ordinary caching is something that is not proprietary and is allocated using Kmalloc, such as Kmalloc-8 kmalloc-16 dma-kmalloc-8 objects.

Look at the slab allocation process from Kmalloc:

static __always_inline void *__do_kmalloc (size_t size, gfp_t flags,
                      unsigned long caller)
    struct kmem_ Cache *cachep;
    void *ret;
    Kmalloc_slab Choose which kmem_cache
    Cachep = kmalloc_slab (size, flags) according to the size of the allocation required;
    if (Unlikely (Zero_or_null_ptr (Cachep)) return
    The following analysis of
    ret = Slab_alloc (CACHEP, flags, caller);
static inline void *____cache_alloc (struct kmem_cache *cachep, gfp_t flags)

    Get CPU Local cache
    AC = cpu_cache_get (Cachep); 
    if (likely (Ac->avail)) {//has idle obj
        ac->touched = 1;
        OBJP = ac->entry[--ac->avail];

        Stats_inc_allochit (Cachep);
        goto out;

    Stats_inc_allocmiss (Cachep);
    CPU local cache does not have idle obj, enter next allocation
    OBJP = Cache_alloc_refill (Cachep, flags); 
    AC = Cpu_cache_get (Cachep);

    if (OBJP)
        kmemleak_erase (&ac->entry[ac->avail]);
    return OBJP;

First, the next struct Array_cache structure:

struct Array_cache { 
    unsigned int avail;/* How many objects are available on the current CPU/ 
    unsigned the maximum number of objects in the int limit;/*per_cpu, when this value is exceeded Return an object to a partner system 
    /unsigned int batchcount;/* the number of objects to be transferred and turned out at once 
    . * * What is the number of objects to be transferred and transferred out at once?
        into the time when the Array_cache has no data, that is, when avail is 0. Array_cache need to read from the CPU shared cache how many object content to Cpu_cache Add, and then if the CPU shared cache is not inside the words to go. When
        the space for the local cache is full, the Batchcount is transferred from the local cache by Batchcount value.
        (PS: The following more specific content)
    * *
    unsigned int touched;/* Indicates whether the local CPU was recently used 
    /spinlock_t lock;/* spin lock 
    /void *entry[]; * * Must have this Definition in this for the proper * alignment of Array_cache. Also simplifies accessing * the entries. * 
    ///entry point to the starting address of the object    

Cachep->cpu_cache is the struct Array_cache __percpu *cpu_cache contained in Kmem_cache.

AC = Cpu_cache_get (Cachep); 
--> return  this_cpu_ptr (Cachep->cpu_cache);

Analysis to use the CPU cache for read operations:
Because each CPU has its own hardware cache (L1,L2,L3 cache), it is likely that the object will still be in the hardware cache of this CPU when the object is released on this CPU.
So the kernel maintains a cpu_cache for each CPU, and when a new object is needed, it takes precedence to get the corresponding size object from the current CPU's local CPU Idle object list (Cpu_cache). If hit (the data will not always exist in the hardware cache, will be swapped out by other data), the hardware cache will not have to swap out the action, greatly reducing the data operation time. This local CPU idle object list is an empty list after the system initialization completes, and the object is added to the list only when the object is disposed.

That is, the operation will be given priority in the CPU local cache for reading, there is a free place to use directly, without the words on the 2nd step operation. When will the 2nd step be done?

That is, Ac->avail = 0, avail is to indicate how many objects are available on the current CPU, ==0 that there is no object to use.
If you have a free object, distribute it directly.

OBJP = ac->entry[--ac->avail];

Since AC is recording the address of this struct arrary_cache structure, by Ac->entry Array, we get the next immediate address, which can be seen as storing the first address for the memory object pointer of this cache memory, as can be seen here, We start with the last object. The avail corresponding idle objects are the hottest, the most recently released, and more likely to reside in the CPU cache.

static void *cache_alloc_refill (struct kmem_cache *cachep, gfp_t flags) {int batchcount;
    struct Kmem_cache_node *n;
    struct Array_cache *ac, *shared;
    int node;
    void *list = NULL;

    struct page *page;
    Check_irq_off (); 

    node = numa_mem_id ()//Get node nodes (PS: Suggestions for node nodes are not known first look at the AC = Cpu_cache_get (CACHEP);//Get CPU local cache Batchcount = ac->batchcount;
    Get CPU Local cache the number of objects that need to be read at once (!ac->touched && batchcount > Batchrefill_limit) {//Batchrefill_limit 16  If the CPU local cache has not been accessed recently, and Batchcount is greater than 16//So why should this condition, the following paragraph has an explanation * * If there is little recent activity  On this cache, then * perform only a partial refill.
         Otherwise we could generate * refill bouncing.  
    * * Batchcount = Batchrefill_limit; } n = Get_node (Cachep, node); Get Kmem_cache_node//So why do I have to get kmem_cache_node from node?//See struct Kmem_cache structure firstDefine//struct Kmem_cache_node *node[max_numnodes]; Max_numnodes is the number of nodes, that is, every node node maintains a kmem_cache_node//node node has its own corresponding to a local physical memory, so be specific to see above this link bug_on (ac->avail &G T 0 | |
    Shared = Read_once (n->shared);
        if (!n->free_objects && (!shared | |!shared->avail))//Free obj is not available (no CPU local cache or no usable obj) 

    Goto Direct_grow;
    Spin_lock (&n->list_lock);

    Shared = Read_once (n->shared);  * * Can refill from the shared array///////////* If there is an obj transmission, the following specific analysis/if (Shared && transfer_objects (AC,
        Shared, Batchcount)) {shared->touched = 1;
    Goto Alloc_done; while (Batchcount > 0) {/* get slab alloc are to come from. */page = Get_first_slab (n, false);

        The following is an analysis of if (!page)//If it is not assigned to the page and jumps to Must_grow Goto Must_grow;

        Check_spinlock_acquired (Cachep);
        Batchcount = Alloc_block (Cachep, AC, page, batchcount); Fixup_slab_list (CachEP, N, page, &list);
//Below is reassigning a new slab must_grow:n->free_objects-= ac->avail;
    Alloc_done:spin_unlock (&n->list_lock);

Fixup_objfreelist_debug (Cachep, &list); Direct_grow:if (Unlikely (!ac->avail)) {/* Check if we can use obj in pfmemalloc slab */if (Sk_me
            Malloc_socks ()) {void *obj = Cache_alloc_pfmemalloc (Cachep, N, Flags);
        if (obj) return obj;

        }//Here is the main analysis of the place, below for further analysis page = Cache_grow_begin (Cachep, Gfp_exact_node (Flags), node);
         * * Cache_grow_begin () can reenable interrupts, * then AC could change.
        * * AC = cpu_cache_get (Cachep);
        if (!ac->avail && page) alloc_block (Cachep, AC, page, batchcount);

        Cache_grow_end (Cachep, page);
    if (!ac->avail) return NULL;

    } ac->touched = 1;
Return ac->entry[--ac->avail]; }
static int transfer_objects (struct array_cache *to, struct array_cache, *from, unsigned in T max) {/* Figure out how many entries to transfer */int nr = min3 (From->avail, Max, To->limit-to->av
    AIL);  FROM->AVAIL:CPU shared cache//Max:batchcount maximum read quantity//TO->LIMIT-TO->AVAIL:CPU local cache acceptable size if

    (!NR) return 0;
    memcpy (To->entry + to->avail, From->entry + from->avail-nr, sizeof (void *) *NR);
    Copies multiple obj of the shared cache into the local cache from->avail-= nr;
    To->avail + + nr;
return nr; }
static struct page *get_first_slab (struct Kmem_cache_node *n, bool pfmemalloc) {struct page

    *page; page = List_first_entry_or_null (&n->slabs_partial, struct page, LRU);
        Read a page from the slabs_partial if (!page) {n->free_touched = 1; page = List_first_entry_or_null (&n->slabs_free, struct page, LRU); Read a page from slabs_free}//here Pfmemalloc = = False So there is no use, this article does not analyze the IF (Sk_memalloc_socks ()) return Get_valid

    _first_slab (n, page, pfmemalloc);
return page; }
static __always_inline int Alloc_block (struct kmem_cache *cachep, struct array_cache *ac, struct page *page, int b
     Atchcount) {/* * There must be in least one object available for * allocation.
    * * BUG_ON (page->active >= cachep->num);
        while (Page->active < cachep->num && batchcount--) {stats_inc_alloced (CACHEP);
        Stats_inc_active (Cachep);
        Stats_set_high (Cachep);
    Place the objects in the slab in the CPU local cache ac->entry[ac->avail++] = slab_get_obj (Cachep, page); return batchcount//below is the specific allocation process, not understood, as described below: static void *slab_get_obj (struct kmem_cache *cachep, struct page *pag

    e) {void *objp;
    OBJP = index_to_obj (Cachep, page, Get_free_obj (page, page->active));

#if DEBUG if (cachep->flags & Slab_store_user) Set_store_user_dirty (CACHEP);
#endif return OBJP; Static inline freelist_idx_t get_free_obj (struct page *page, unsigned int idx) {
    Return ((freelist_idx_t *) page->freelist) [IDX];
    } static inline void *index_to_obj (struct kmem_cache *cache, struct page *page, unsigned int idx) {
    Page->s_mem points to the first obj address, cache->size to obj size and idx to the number of obj already used.
return PAGE-&GT;S_MEM + cache->size * IDX; }

Here we already know how to take the free obj, but the whole slab what structure, we do not know, first analysis of the structure of the next slab , as shown in the figure:

The 4.9 kernel is somewhat different than the previous version, where he puts the freelist behind the object.
And then we'll have to analyze how this thing came about.
Analyze two pieces of code:

the back part of the 1.cache_alloc_refill Direct_grow label
2.kmem_cache_create Create Kmem_cache

The following section is not specific analysis, the code more looking at the egg pain, here only to say how to make the above figure.
Let's say the first piece of code direct_grow the following section:

page = Cache_grow_begin (Cachep, Gfp_exact_node (Flags), node);
    Assign physical pages Page
    = kmem_getpages (Cachep, Local_flags, Nodeid);
    The following analysis
    freelist = alloc_slabmgmt (Cachep, page, offset,
            local_flags & ~gfp_constraint_mask, page_node);
    The following analysis
    CACHE_INIT_OBJS (Cachep, page);
static void *alloc_slabmgmt (struct kmem_cache *cachep, struct page *page, int colour_off,
    gfp_t local_flags, int nodeid) {void *freelist;

    void *addr = page_address (page);  Page->s_mem = addr + Colour_off; Here is the Page->s_mem point to the page start Colour_off offset page->active = 0;
    To allocate the number of obj, here initialized to 0 if (Objfreelist_slab (CACHEP))///here should be based on CACHEP flag whether freelist freelist = NULL; else if (Off_slab (Cachep)) {//This is freelist separated from obj and needs to be reassigned freelist/* Slab management obj is off-slab. */Free
        List = Kmem_cache_alloc_node (Cachep->freelist_cache, Local_flags, Nodeid);
    if (!freelist) return NULL; else {//This is freelist with obj, the address is connected/* We'll use the last bytes in the slab for freelist * * freelist = addr +
        (Page_size << cachep->gfporder)-cachep->freelist_size;
  Here we do not look at the separation situation//We see freelist pointing to Slab last freelist_size address      So what is the size of the freelist_size?
        Where is the address of the freelist point?
Do not know, we will analyze the following} return freelist; }
CACHE_INIT_OBJS (Cachep, page);
    -->shuffled = shuffle_freelist (Cachep, page);
                count = cachep->num;
                for (i = 0; i < count; i++)
                    set_free_obj (page, I, I);
            --> set_free_obj (page, I, i);

static inline void set_free_obj (struct page *page,
                    unsigned int idx, freelist_idx_t val)
    //Here starting from 0, To count-1 size       
    //That Cachep->num is how much. The following analysis
    ((freelist_idx_t *) (page->freelist)) [idx] = val;

So far we know the specific structure of the slab, but we are not clear about the arrangement interval and size, then we will do the next analysis.

The second step analysis kmem_cache_creat:

struct Kmem_cache * kmem_cache_create (const char *name, size_t size, size_t align, unsigned long flags, void (*c TOR) (void *))--> Create_cache--> __kmem_cache_create//There are several important things to analyze first under int __kmem_cache_create (Stru
    CT Kmem_cache *cachep, unsigned long flags) {size_t size = cachep->size;
        The code for size is a bit scattered, briefly stating that size refers to the amount of obj, plus an offset, for alignment operations ... f (Ralign < cachep->align) {
    Ralign = cachep->align; Cachep->align = ralign; Cachep->align is kmem_cache_create specified size, Ralign here is the word size Cachep->colour_off = cache_line_size (); Read the cache line, the cache line is generally 64 bytes (note: Here, for coloring, the following can be specifically analyzed)/* Offset must be a multiple of the alignment.
    */if (Cachep->colour_off < cachep->align) Cachep->colour_off = cachep->align;
        ... if (Set_objfreelist_slab_cache (Cachep, size, flags)) {//below analysis flags |= Cflgs_objfreelist_slab;
    Goto done; } ... done:cachep->freelist_size = cachep->num * sizeof (freelist_idx_t);
    Freelist_size size is freelist number * freelist size cachep->flags = flags;
    Cachep->allocflags = __gfp_comp;
    if (Flags & SLAB_CACHE_DMA) cachep->allocflags |= GFP_DMA;
    Cachep->size = size;//size is obj size + offset cachep->reciprocal_buffer_size = reciprocal_value (size); ....
static bool Set_objfreelist_slab_cache (struct Kmem_cache *cachep, size_t size, unsign
    Ed long flags) {size_t left;
    Cachep->num = 0;
    if (Cachep->ctor | | | Flags & SLAB_DESTROY_BY_RCU) return false; left = Calculate_slab_order (Cachep, size, Flags | Cflgs_objfreelist_slab);
    The following analysis if (!cachep->num) return false; if (Cachep->num * sizeof (freelist_idx_t) > Cachep->object_size)//freelist_size cannot be greater than//object_size RET
    Urn false;  Cachep->colour = left/cachep->colour_off;
Slab the color return true; }
    --> num = cache_estimate (gfporder, size, flags, &remainder);

static unsigned int cache_estimate (unsigned long gfporder, size_t buffer_size,
        unsigned long flags, size_t *left_over )
    unsigned int num;
    size_t slab_size = page_size << gfporder;
    if (Flags & Cflgs_objfreelist_slab | Cflgs_off_slab)) {
        num = slab_size/buffer_size;
        *left_over = slab_size% Buffer_size;
    } else {//We look below
        num = slab_size/(buffer_size + sizeof (freelist_idx_t));//You can see that the num size is equal to
        the size of the//slab divided by (buffer_si Ze + sizeof (freelist_idx_t) Note: buffer_size equals       //cachep->size
        *left_over = slab_size%
            (buffer_size + sizeof (freelist_idx_t)); Remaining space size, associated with slab color
    } return


Well, the above is the analysis of the specific structure of slab, now analyze the remaining problems,slab coloring. Coloring of Slab:

First find the source, that is, the slab structure of the Colour_off, where it came from. Back to allocating new slab there,
In the Cache_grow_begin function

    n->colour_next++;//the color offset of the previous slab + +
    if (n->colour_next >= cachep->colour)//In the above mentioned Cachep->colour 
        n->colour_next = 0;

    offset = n->colour_next; 
    if (offset >= cachep->colour)
        offset = 0;

    Offset *= cachep->colour_off; Here is on the offset  = = Colour_off

To mention how Cachep->colour came in:
Above the Cache_estimate function in the remaining space size left_over, go back to the function that called it
Set_objfreelist_slab_cache inside, here is written:
Cachep->colour = left/cachep->colour_off;
Left is the internet left_over
Well, now we know the source of Cachep->colour. Coloring is offset.

And then analyze how to do slab inside obj's fetch and release:
That is, the Slab_get_obj function, which already has the code attached,

How to find the place to release it. What we are analyzing is kmalloc, which corresponds to what, of course, is kfree. All right, find the Slab_put_obj release function.

static void Slab_put_obj (struct kmem_cache *cachep,
            struct page *page, void *objp)
    unsigned int objnr = Obj_ To_index (Cachep, page, OBJP)//Get the Released Freelist group number
#if DEBUG//debug do not use it
    unsigned int i;

    /* Verify Double free Bug *
    /for (i = page->active; i < cachep->num; i++) {
        if (get_free_obj (page, i) = = OBJNR) {
            pr_err ("Slab:double free detected in cache '%s ', OBJP%p\n",
                   Cachep->name, OBJP);
            BUG ();
    if (!page->freelist)
        page->freelist = OBJP + obj_offset (CACHEP); 

    Set_free_obj (page, page->active, objnr);  Write the objnr number in the Freelist group number (that is, the obj group number just released)

Introduction so much, concrete is what kind of or the coordination diagram is better easy to understand
Suggest to look below this blog, write is really good, I also refer to it for analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.