Kernel that stuff. Memory Management (1)

Source: Internet
Author: User

There are some places where there is a lake. To introduce the memory management of the lake, first of all, we have to start from the main characters.


In a NUMA structure, physical memory is first divided into several nodes. Each node is further divided into a number of zones. Each zone is also associated with an array that describes the page frames, which contains the descriptors for all page frames belonging to the zone.


It is not difficult to see, in this lake and the main has three important people: nodes, zones and page frames. The relationship and status of these three can be broadly described (the figure is taken from "Professional Linux Kernel Architecture"):


650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/6E/AD/wKioL1WClDzSQ4qwAAHlVWtu6Pk863.jpg "title=" Untitled.png "width=" "height=" 314 "border=" 0 "hspace=" 0 "vspace=" 0 "style=" width:500px;height:314px; "alt=" Wkiol1wcldzsq4qwaahlvwtu6pk863.jpg "/>


Here we introduce each of the three characters. Please note that this is the main description of what we care about.


1. Node

In kernel, node is described by the structural body pg_data_t.


Typedef struct pglist_data {    struct zone node_zones[max_nr_ zones];    struct zonelist node_zonelists[max_zonelists];     int nr_zones; #ifdef  config_flat_node_mem_map    struct page * Node_mem_map, #endif        unsigned long node_start_pfn;     unsigned long node_present_pages; /* total number of  physical pages */    unsigned long node_spanned_pages; /*  total size of physical page                                           range, including holes  */    int node_id;    wait_queue_head_t kswapd_wait;    struct  Task_struct *kswapd;    int kswapd_max_order;}  pg_data_t;


  • Node_zones: Each node is divided into several zones, and the zones information is stored in this array.


  • Node_zonelists: What if we want to allocate memory in a specified zone, and there is not enough free memory in that zone? It is the so-called Sancho, we have to save some of our own posterior, node_zonelists is the case of the posterior. It specifies all of the alternative zones. Of course, these alternative zones also have priority, after all, only small three can not meet the demand, will go to find small four.


  • Nr_zones: As the name implies, is the number of zone in the node.


  • Node_mem_map: An array of all page frames that belong to this node. Page frame is the most basic soldier, whether you are a group (zone), or a Division (node), in the final analysis, there are a group of soldiers.


  • NODE_START_PFN: Note that all the page frames in node are numbered in a uniform order, not that there is a soldier in the division 9527, and that the division has a soldier 9527. 9527 there is only one in the whole army. And NODE_START_PFN is the number of the first page frame in the node.


  • Node_present_pages: The number of page frames in this node.


  • Node_spanned_pages: This is also the number of page frames in this node, but the memory holes is included, so node_spanned_pages is generally larger than node_present_pages. The so-called memory hole, that is, some areas of the ram space may be reserved for the I/O mapping, or be reserved by the BIOS, so that the address space has a hole. Ah, suddenly think of a friend's true story ... That year he joined a NB team to do at that time incredibly tall distributed memory management, address space in the middle of the room has memory hole. So one of their team's great beauties named the list as_hole. No one had any objection at that time, just every time code review, we see this variable when the time will go around, until later ... (I quote someone else's story so directly, or the true story, is it infringement?) Please contact me by the copyright owner, I'll delete this paragraph. )


  • NODE_ID: Number of the node, 0 1 2 ...


  • Kswapd_wait, KSWAPD, Kswapd_max_order: These three member variables are used for the swapping mechanism and are now omitted.



2. Memory Zones

This character is more complicated, here we have first impression, mixed face familiar. Will often deal with the future, and then slowly understand.


In kernel, the memory zone is described by the struct-struct zone.


Struct zone {    /* fields commonly accessed by the  page allocator */    unsigned long        pages_min, pages_low, pages_high;         unsigned long       lowmem_reserve[max_nr_zones];     struct per_cpu_pageset  pageset[NR_CPUS];    /*      * free areas of different sizes     */     spinlock_t      lock;    struct free_ area    free_area[max_order];    unsigned long        *pageblock_flags;    zone_padding (_pad1_)     / * fields commonly accessed by the page reclaim scanner */    spinlock_t       lru_lock;    struct list_head    active_ list;    struct list_head    inactive_list;     unsigned long       nr_scan_active;     unsigned long       nr_scan_inactive;    unsigned  long       pages_scanned;     /* since  last reclaim */    unsigned long        flags;         /* zone flags, see below  */    /* zone statistics */    atomic_long_t        vm_stat[nr_vm_zone_stat_items];    int prev_priority;     zone_padding (_pad2_)     /* Rarely used or read-mostly  fields */    wait_queue_head_t   * wait_table;     unsigned long       wait_table_hash_nr_entries;     unsigned long       wait_table_bits;     /*     * Discontig memory support fields.      */    struct pglist_data  *zone_pgdat;     unsigned long       zone_start_pfn;    unsigned  long       spanned_pages;  /* total size,  Including holes */    unsigned long       present_pages;  /*  amount of memory  (excluding holes)  */    /*      * rarely used fields:     */     Const char      *name;}  ____cacheline_internodealigned_in_smp;


The content of this structure is much more. There are two zone_padding, which divides the structure into three parts.


Why is there zone_padding this stuff here? This is the way things are. In SMP systems, multiple CPUs often access the same zone structure in conjunction with each other. So lock is necessary. The structure consists of two lock,zone->lock and Zone->lru_lock. In order to let these two lock in different CPU cache line, had to bear the heart between them between the two galaxies.


The first part of the content is primarily used by page allocator to request the allocation of memory pages.

    • Pages_min, Pages_low, Pages_high: These three are called "watermarks" and will play a role in applying for physical memory and memory recycling.

      • If the number of free memory pages is greater than Pages_high, then this zone is considered to be very idle.

      • If the number of free memory pages is less than Pages_low, memory recycling is required.

      • If the number of free memory pages is less than Pages_min, the memory is recycled Alexander, dangerous.


    • Lowmem_reserve: Specifies how much memory pages must be reserved for each zone. These reserved memory pages are primarily used to handle low-on-memory emergencies.


    • Pageset: This is a PER-CPU memory page cache. The kernel first applies some memory pages in advance and puts them in this cache. When you need to request a single page of memory, you can take it directly in this cache. This is one of kernel's usual tricks, and I will see more similar examples later.


    • Free_area: This is where the famous buddy system is located. There will be a special post in the back to buddy system. As long as you know, this member variable is used to hold the free memory page in the zone.

    • Pageblock_flags: This is also used for buddy system, which is used in the mechanism of preventing fragmentation (anti-fragmentation).


    • Zone_padding (_pad1_) Gorgeous split line


The second part of the content is mainly used by the memory recycling mechanism. All memory pages are categorized by their active status: active OR inactive. Different types of memory pages are placed on different LRU linked lists. This classification is important when it comes to memory recycling. Those inactive memory pages will be recycled first. It is in the words of Lu Xun: Not in the silence of the outbreak, in silence to perish.


    • Active_list, Inactive_list: A linked list of active/inactive memory pages.


    • Nr_scan_active, Nr_scan_inactive: The number of active/inactive memory pages that need to be scanned when memory recycling occurs.


    • Flags: Describes the state of the zone.

typedef enum {zone_all_unreclaimable,/* All pages pinned */zone_reclaim_locked,/* Prevents concurrent Reclaim */zone_oom_locked,/* ZONE is in OOM killer Zonelist */} zone_flags_t;


    • Vm_stat: A variety of statistical information about the zone has been saved, such as nr_free_pages, nr_file_pages, etc. This information is updated anytime, anywhere.


    • Prev_priority: Used to save the weight of the last scan of the zone when memory is reclaimed. (Is it still unclear what this is?) It doesn't matter, when we talk about the details of memory recycling, we'll look back at those variables. )


    • Zone_padding (_pad2_) Gorgeous split line


The third part is mostly seldom used, or mainly used to read the content.

    • Wait_table, Wait_table_hash_nr_entries, Wait_table_bits: The three brothers realized a waiting queue. When a memory page is temporarily unavailable, the process that wants to use that memory page waits in that queue.


    • Zone_pgdat: node to which the zone belongs. This is what I use to find the organization, very important.


    • ZONE_START_PFN: The number of the first memory page in the zone.

    • Spanned_pages: The number of memory pages in the zone, including the memories holes. "= Is it familiar?

    • Present_pages: The number of memory pages actually available in the zone.


    • Name: Come out mixed lake, must have a name AH.


(To becontinued...)

This article from the "Kernel blog" blog, reproduced please contact the author!

Kernel that stuff. Memory Management (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.