Linux memory Management Source Analysis-Overview _

Linux memory Management Source Analysis-Overview __linux

Last Update:2018-07-26 Source: Internet

Author: User

Tags parent directory time interval

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Linux Memory Management Source Analysis-Overview

Reprint Please specify: http://www.cnblogs.com/tolimit/

Http://www.cnblogs.com/tolimit/p/4551428.html

Recently in learning the framework of the kernel module, here is a summary, too much knowledge.

Subsections and pagination

Look at a picture first

That is, we actually encountered in the encoding of the memory address is not corresponding to the actual memory address, our code used in the address is a logical address, through the segmentation and pagination of the two mechanisms to convert it to physical address. Because of the limited segmentation mechanism used by Linux, it can be considered that the logical address under Linux is a linear address. That is, we encode using a linear address, and then only need to go through a paging mechanism to turn this address into a physical address. So it's probably more important to explain the Linux paging model.

The system divides the entire physical memory into multiple page frames, each of which is typically 4K (the hardware allows it to be 4M), which means that if we have 1GB of physical memory, the system will divide the physical memory into 262,144 page frames. When we provide a linear address, the system translates the linear address into a memory address corresponding to a physical page through the paging mechanism. The following figure is a paging model for Linux

Linux uses a four-level paging model, which is the page global catalog (PGD), the page parent catalog (PUD), the page Middle directory (PMD), and the page table (PTE). Here are all the page global catalogs, the page ancestor directory, the page middle directory, the page table, and their size is one page. Linux does not always use the level four directory on each hardware, when it is used for the 32-bit system that does not have the physical address extension of the boot, only uses the two Level page table, Linux will place the page superior directory and the page middle directory to empty. On a 32-bit system with physical Address extensions enabled, Linux uses a Level three page table, and the page ancestor directory is empty. On a 64-bit system, Linux chooses a three-level page table or a four-level page table depending on the hardware. The entire process of switching from a linear address to a physical address is done automatically by the CPU.

Each process has its own page global catalog, and when the process runs, the system saves the process's page global catalog to the CR3 register, and when the process is swapped out, the CR3 saved page global catalog address is saved to the process descriptor. After that we will also introduce a CR2 register, used for page fault exception handling.

Because each process has its own page global catalog, if there are 100 processes in memory, the entire page table set of 100 processes will be saved, which would seem to consume quite a lot of memory. In fact, only the process used in the case of the system will be assigned to the process of a path, such as we require access to a linear address, but this address may correspond to the page of the parent directory, the page middle directory, page tables and pages do not exist, then the system will produce a fault page anomaly, In the fault handling of the page, the linear address of the process is assigned the physical page frame required by the parent directory, the page middle directory, the page table and the page.

address Space

A linear address through the paging mechanism to a corresponding physical address, we call it a mapping, such as our linear address 0x00000001 after paging mechanism processing, the corresponding physical address is 0xffffff01.

The Linux system is divided into two address spaces, one is the process address space, the other is the kernel address space. For each process, they have their own process address space of 3G, which is isolated from each other, i.e. the 0x00000001 linear address of process A and the 0x00000001 linear address of process B are not the same address. Process A also cannot directly access the process address space of process B through its own process space. When the linear address is greater than 3G (i.e. 0xc0000000), the linear address here is kernel space, the kernel address space is 1G, and the address is from 0xc0000000 to 0xFFFFFFFF. In the kernel address space, the kernel maps the first 896MB linear address directly to the first 896MB of the physical address, that is to say, the physical address of the linear address 0xc0000001 of the kernel address space is 0x00000001, and there is a 0xc0000000 between them.

The Linux kernel divides physical memory into 3 admin areas: ZONE_DMA: A memory page box that contains 0MB~16MB, which can be mapped directly to the kernel's address space by DMA use by older, ISA-based devices. Zone_normal: Contains the memory page box between 16MB~896MB, the regular page box, which maps directly to the kernel's address space. Zone_highmem: Contains more than 896MB of memory page boxes without direct mapping, which can be accessed through permanent mappings and temporary mappings for this part of the memory page box.

The entire structure is shown below

For the two admin zones of ZONE_DMA and Zone_normal, the kernel address is mapped directly, and only the ZONE_HIGHMEM Admin zone system is not mapped directly by default, only when it is needed (temporary mapping or permanent mapping).

node and admin Zone descriptor

In order to be used for NUMA schemas, node is used to describe the memory of a place. Numa basically makes many servers run like a single system, so each server has its own memory, and each server's memory is a node. For our PCs, a PC is a node. node is represented by the struct Pglist_data structure:

/* Memory node descriptor, all nodes described Fu Paocun in struct pglist_data *node_data[max_numnodes]/typedef struct PGLIST_DATA {/* admin-Zone Descriptor Array */ 
    struct zone node_zones[max_nr_zones]; /* page Allocator use of the ZONELIST data structure of the array, the management of all nodes by a certain association link into a linked list, the allocation of memory in accordance with the order of this list to allocate * * * struct zonelist node_zonelists[max_zonelists
    ];
/* node in the number of admin zone/int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP/* means!
SPARSEMEM/* node in the page descriptor array, including the node in all the page box descriptor, the actual allocation is an array of pointers/struct page *node_mem_map;
#ifdef CONFIG_MEMCG/* is used for resource limiting mechanism/struct page_cgroup *node_page_cgroup;
#endif #endif #ifndef config_no_bootmem/* used in the kernel initialization phase * * struct bootmem_data *bdata;
#endif #ifdef config_memory_hotplug/* Spin lock * * spinlock_t node_size_lock; #endif the subscript of the first page box in the/* node, in the NUMA system, the page box will have two ordinal numbers, an ordinal number of all the page boxes, and an ordinal number in this node * such as the page box 1 in Node 2, its ordinal number in Node 2 is 1, but in all the page boxes The ordinal number is 1001, and this variable is
    Save this node the first frame of the serial number 1000, for easy conversion/unsigned long NODE_START_PFN; 
    /* The size of the memory node, excluding the hole (in the page frame) */unsigned long node_present_pages; /* The size of the node, including the hole (in the page box as a singlebit) */unsigned long node_spanned_pages;
    /* Node identifier */int node_id;
    /* kswaped page for the daemon to use the waiting queue * * wait_queue_head_t kswapd_wait;
    wait_queue_head_t pfmemalloc_wait;    /* The pointer points to the KSWAPD kernel thread's process descriptor/struct task_struct *KSWAPD; /* Protected by Mem_hotplug_begin/end () * * KSWAPD the value of the idle block size to be created/int kswapd_max_order
    ;
Enum Zone_type Classzone_idx; #ifdef config_numa_balancing/* The following load balancing for NUMA * * * Lock serializing the Migrate rate Limiting window * * * SPINL

    ock_t Numabalancing_migrate_lock;

    /* Rate Limiting time interval * * unsigned long Numabalancing_migrate_next_window;
/* Number of pages migrated during the rate limiting time interval/unsigned long numabalancing_migrate_nr_pages; #endif} pg_data_t;

All node descriptors in the system are stored in the Node_data array. In the pg_data_t node descriptor, the Node_zones array holds all the admin area descriptors in the node, although the system divides the physical memory into three regions, but logically, the system is divided into four admin areas, one of which is zone_movable, This area is a virtual admin area, it does not correspond to a region of memory, its main purpose is to avoid memory fragmentation, its memory either all from the Zone_highmem area, or all from the Zone_normal area. These we will see in later initialization functions.

Each node has a kernel thread that KSWAPD the process or kernel, but the infrequently used pages are swapped to disk to free up more available memory.

Let's take another look at the admin Zone Descriptor:

/* Memory Admin Zone descriptor/struct Zone {/* read-mostly fields/* Zone watermarks, access with *_wmark_pages (zone) macros * * * including Pages_min,pages_low,pages_high * pages_min: Number of reserved pages in the admin zone * Pages_low: The lower bound of the Recycle page box is also used by the Admin zone distributor as the valve value, which is usually the number The word is Pages_min's 5/4 * Pages_high: The upper bound of the Recycle page box is also used as the valve value by the admin distributor, which is usually pages_min 3/2/unsigned long watermark[nr_

    Wmark];

/* Indicates the number of page boxes that must be retained by the admin zone in the critical case of low memory handling, as well as the Atomic Memory allocation request (that is, blocking memory allocation requests) issued in the interrupt or critical area.
#ifdef Config_numa int node;  #endif/* * The target ratio of Active_anon to inactive_anon pages in * this zone ' LRU.
     Maintained by the Pageout code.

    * * unsigned int inactive_ratio;
    * * Point to this admin zone belongs to the node * * struct pglist_data *zone_pgdat;

    /* Implementation of each CPU page box cache, which contains the single page of each CPU chain list/struct per_cpu_pageset __percpu *pageset;
     * * * is a per-zone reserve of pages, that should. * Considered dirtyable memory. * * unsigned long Dirty_balaNce_reserve; #ifndef CONFIG_SPARSEMEM * * Flags for a pageblock_nr_pages block.
     Pageblock-flags.h.
* In Sparsemem, this map was stored in struct mem_section/unsigned long *pageblock_flags;
     #endif/* Config_sparsemem/#ifdef Config_numa * * Zone reclaim becomes active if more unmapped pages exist.
    * * unsigned long min_unmapped_pages;
unsigned long min_slab_pages; #endif/* Config_numa */////////* ZONE_START_PFN = = zone_start_paddr >> page_shift *///

    Ed Long Zone_start_pfn;
    /* All pages that are normally available, the total number of pages (excluding holes) minus the reserved number of pages */unsigned long managed_pages;
    /* The total size of the admin zone (page), including the hole */unsigned long spanned_pages;
    /* Admin Zone Total size (page), excluding holes/unsigned long present_pages;

    /* point to the traditional name of the admin zone, "DMA" "NORMAL" "highmem" * * * const char *name;

/* Corresponds to the number of page blocks in the Migrate_reseve chain in the partner system */int nr_migrate_reserve_block; #ifdef Config_memory_isolatION * * Number of isolated pageblock. It is used to solve I

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More