linux kernel記憶體管理資料結構__Html5

來源:互聯網
上載者:User

mem_map

是一個全域變數, 指向一個struct page數組, 管理著系統中的所有物理頁面, 數組中的每個page結構,對應一個物理頁框.

mem_map僅當系統為單node時有效, 對於arm平台, 只有一個node

    /*     * With no DISCONTIG, the global mem_map is just set as node 0's     */    if (pgdat == NODE_DATA(0)) {        mem_map = NODE_DATA(0)->node_mem_map;



NODE_DATA(0)->node_mem_map

系統中的每個記憶體node的node_mem_map成員都指向一個struct page數組, 用來描述這個node所有zone的實體記憶體頁框

node_mem_map在alloc_node_mem_map中分配

    /* ia64 gets its own node_mem_map, before this, without bootmem */    if (!pgdat->node_mem_map) {        unsigned long size, start, end;        struct page *map;        /*         * The zone's endpoints aren't required to be MAX_ORDER         * aligned but the node_mem_map endpoints must be in order         * for the buddy allocator to function correctly.         */        start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1);        end = pgdat_end_pfn(pgdat);        end = ALIGN(end, MAX_ORDER_NR_PAGES);        size =  (end - start) * sizeof(struct page);        map = alloc_remap(pgdat->node_id, size);        if (!map)            map = memblock_virt_alloc_node_nopanic(size,                                   pgdat->node_id);        pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);    }}

從上面代碼的注釋我們可以得到如下資訊:

1. zone是不需要按照MAX_ORDER對齊的

2. 但是memmap分配時, 必須按照MAX_ORDER對齊, 目的是buddy 分配能夠正常工作

3. mem_map是由memblock分配的,因此起始位置是動態分配的, 大小計算: 對齊地區頁面數*36bytes, (36bytes為page結構大小)


meminfo

meminfo儲存系統啟動階段的記憶體配置, 會被啟動時的初始化函數使用.

struct meminfo {    int nr_banks;    struct membank bank[NR_BANKS];};/* * This keeps memory configuration data used by a couple memory * initialization functions, as well as show_mem() for the skipping * of holes in the memory map.  It is populated by arm_add_memory(). */struct meminfo meminfo;

uboot需要傳入記憶體bank配置參數, 系統在初始化階段根據這些參數填充meminfo, 有兩種途徑傳入系統記憶體配置 通過device tree的memory 節點 傳統的command line參數: mem=size@start

這裡僅描述device tree方式傳入記憶體layout

early_init_dt_scan_memory會解析device_tree中的memory節點,

    ->early_init_dt_add_memory_arch

        ->arm_add_memory(base, size); 填充一個bank, base和size會頁對齊, 對於超出4G範圍的部分會被截掉

. meminfo修改

meminfo通過device tree填充後, 並不是一成不變的, 系統會對重新調整meminfo中的bank

sanity_check_meminfo函數遍曆每一個bank, 如果發現某個bank跨過了vmalloc_limit, 那麼就要把這個bank分成連個bank


memblock

memblock是在kernel 2.6.35引入的, 目的是為了替代bootmem, 簡化啟動階段記憶體管理代碼. memblock代碼並不是重新開發的, 而是使用已經存在的logical memory block(lmb). LMB早已在Microblaze, PowerPC, SuperH和SPARC架構上使用.

struct memblock_region {    phys_addr_t base;    phys_addr_t size;    unsigned long flags;#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP    int nid;#endif};struct memblock_type {    unsigned long cnt;  /* number of regions */    unsigned long max;  /* size of the allocated array */    phys_addr_t total_size; /* size of all regions */    struct memblock_region *regions;};    struct memblock {    bool bottom_up;  /* is bottom up direction? */    phys_addr_t current_limit;    struct memblock_type memory;    struct memblock_type reserved;};extern struct memblock memblock;

memblock包含兩個記憶體區隊列: memory和reserved

memory表示memblock管理的記憶體區, 而reserved表示memblock預留的記憶體區(包括通過memblock分配的, 以及通過memblock_reserve介面預留的), 在reserved中描述的位址範圍不可以再被用來分配.


swapper_pg_dir


totalram_pages totalreserve_pages zone

struct zone {    /* Read-mostly fields */    /* zone watermarks, access with *_wmark_pages(zone) macros */    unsigned long watermark[NR_WMARK];    /*     * We don't know if the memory that we're going to allocate will be freeable     * or/and it will be released eventually, so to avoid totally wasting several     * GB of ram we must reserve some of the lower zone memory (otherwise we risk     * to run OOM on the lower zones despite there's tons of freeable ram     * on the higher zones). This array is recalculated at runtime if the     * sysctl_lowmem_reserve_ratio sysctl changes.     */    long lowmem_reserve[MAX_NR_ZONES];#ifdef CONFIG_NUMA    int node;#endif    /*     * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on     * this zone's LRU.  Maintained by the pageout code.     */    unsigned int inactive_ratio;    struct pglist_data    *zone_pgdat;    struct per_cpu_pageset __percpu *pageset;    /*     * This is a per-zone reserve of pages that should not be     * considered dirtyable memory.     */    unsigned long        dirty_balance_reserve;#ifndef CONFIG_SPARSEMEM    /*     * Flags for a pageblock_nr_pages block. See pageblock-flags.h.     * In SPARSEMEM, this map is stored in struct mem_section     */    unsigned long        *pageblock_flags;#endif /* CONFIG_SPARSEMEM */#ifdef CONFIG_NUMA    /*     * zone reclaim becomes active if more unmapped pages exist.     */    unsigned long        min_unmapped_pages;    unsigned long        min_slab_pages;#endif /* CONFIG_NUMA */    /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */    unsigned long        zone_start_pfn;    /*     * spanned_pages is the total pages spanned by the zone, including     * holes, which is calculated as:     *     spanned_pages = zone_end_pfn - zone_start_pfn;     *     * present_pages is physical pages existing within the zone, which     * is calculated as:     *    present_pages = spanned_pages - absent_pages(pages in holes);     *     * managed_pages is present pages managed by the buddy system, which     * is calculated as (reserved_pages includes pages allocated by the     * bootmem allocator):     *    managed_pages = present_pages - reserved_pages;     *     * So present_pages may be used by memory hotplug or memory power     * management logic to figure out unmanaged pages by checking     * (present_pages - managed_pages). And managed_pages should be used     * by page allocator and vm scanner to calculate all kinds of watermarks     * and thresholds.     *     * Locking rules:     *     * zone_start_pfn and spanned_pages are protected by span_seqlock.     * It is a seqlock because it has to be read outside of zone->lock,     * and it is done in the main allocator path.  But, it is written     * quite infrequently.     *     * The span_seq lock is declared along with zone->lock because it is     * frequently read in proximity to zone->lock.  It's good to     * give them a chance of being in the same cacheline.     *     * Write access to present_pages at runtime should be protected by     * lock_memory_hotplug()/unlock_memory_hotplug().  Any reader who can't     * tolerant drift of present_pages should hold memory hotplug lock to     * get a stable value.     *     * Read access to managed_pages should be safe because it's unsigned     * long. Write access to zone->managed_pages and totalram_pages are     * protected by managed_page_count_lock at runtime. Idealy only     * adjust_managed_page_count() should be used instead of directly     * touching zone->managed_pages and totalram_pages.     */    unsigned long        managed_pages;    unsigned long        spanned_pages;    unsigned long        present_pages;    const char        *name;    /*     * Number of MIGRATE_RESEVE page block. To maintain for just     * optimization. Protected by zone->lock.     */    int            nr_migrate_reserve_block;#ifdef CONFIG_MEMORY_HOTPLUG    /* see spanned/present_pages for more description */    seqlock_t        span_seqlock;#endif    /*     * wait_table        -- the array holding the hash table     * wait_table_hash_nr_entries    -- the size of the hash table array     * wait_table_bits    -- wait_table_size == (1 << wait_table_bits)     *     * The purpose of all these is to keep track of the people     * waiting for a page to become available and make them     * runnable again when possible. The trouble is that this     * consumes a lot of space, especially when so few things     * wait on pages at a given time. So instead of using     * per-page waitqueues, we use a waitqueue hash table.     *     * The bucket discipline is to sleep on the same queue when     * colliding and wake all in that wait queue when removing.     * When something wakes, it must check to be sure its page is     * truly available, a la thundering herd. The cost of a     * collision is great, but given the expected load of the     * table, they should be so rare as to be outweighed by the     * benefits from the saved space.     *     * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the     * primary users of these fields, and in mm/page_alloc.c     * free_area_init_core() performs the initialization of them.     */    wait_queue_head_t    *wait_table;    unsigned long        wait_table_hash_nr_entries;    unsigned long        wait_table_bits;    ZONE_PADDING(_pad1_)    /* Write-intensive fields used from the page allocator */    spinlock_t        lock;    /* free areas of different sizes */    struct free_area    free_area[MAX_ORDER];    /* zone flags, see below */    unsigned long        flags;    ZONE_PADDING(_pad2_)    /* Write-intensive fields used by page reclaim */    /* Fields commonly accessed by the page reclaim scanner */    spinlock_t        lru_lock;    struct lruvec        lruvec;    /*     * When free pages are below this point, additional steps are taken     * when reading the number of free pages to avoid per-cpu counter     * drift allowing watermarks to be breached     */    unsigned long percpu_drift_mark;#if defined CONFIG_COMPACTION || defined CONFIG_CMA    /* pfn where compaction free scanner should start */    unsigned long        compact_cached_free_pfn;    /* pfn where async and sync compaction migration scanner should start */    unsigned long        compact_cached_migrate_pfn[2];#endif#ifdef CONFIG_COMPACTION    /*     * On compaction failure, 1<<compact_defer_shift compactions     * are skipped before trying again. The number attempted since     * last failure is tracked with compact_considered.     */    unsigned int        compact_considered;    unsigned int        compact_defer_shift;    int            compact_order_failed;#endif#if defined CONFIG_COMPACTION || defined CONFIG_CMA    /* Set to true when the PG_migrate_skip bits should be cleared */    bool            compact_blockskip_flush;#endif    ZONE_PADDING(_pad3_)    /* Zone statistics */    atomic_long_t        vm_stat[NR_VM_ZONE_STAT_ITEMS];} ____cacheline_internodealigned_in_smp;


spanned_pages: 是zone start_pfn到end_pfn跨越的物理頁面總和, 包括了這之間的hole. spanned_pages = zone_end_pfn - zone_start_pfn

present_pages: 是zone中存在的物理頁面. present_pages = spanned_pages - absent_pages(pages in holes)

managed_pages: 是被buddy系統管理的present_pages, managed_pages = present_pages - reserved_pages.
計算managed_pages

mem_init->free_all_bootmem

free_all_bootmem在有兩個實現, 分別在nobootmem.c和bootmem.c中, 預設情況下核心使能CONFIG_NO_BOOTMEM, 所以會調用nobootmem.c中的實現.

unsigned long __init free_all_bootmem(void){    unsigned long pages;    reset_all_zones_managed_pages();    /*     * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id     *  because in some case like Node0 doesn't have RAM installed     *  low ram will be on Node1     */    pages = free_low_memory_core_early();    totalram_pages += pages;    return pages;}

reset_all_zones_managed_pages會對所有zone的managed_pages清0

free_low_memory_core_early則會重新計算各個zone的managed_pages(只計算低端記憶體)

static unsigned long __init free_low_memory_core_early(void){    unsigned long count = 0;    phys_addr_t start, end;    u64 i;    for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL) {        count += __free_memory_core(start, end);    }    return count;}

計算memblock中描述的所有空閑記憶體區, 空閑記憶體區的計算方式是從memory region扣去reserved region.





聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.