"Deep understanding of Linux memory Management" Learning notes (i) 0.02.01 revised version, the Red Letter section for the revised content

Source: Internet
Author: User
Tags memory usage

Solemn statement: Without my permission, not for commercial or non-commercial reprint and use, if necessary, please contact: yrj1978@hotmail.com

Introduction
Why do you write this note:
1, the Chinese version of the book translation is too rubbish to read. Read the original English, you can very good understanding of the author's ideas. Make a note of this memo
2, has been learning Linux kernel knowledge lack of systematization, borrow the book Study, systematic study of Linux kernel.
3, oneself has been doing a too small,too simple single process, privileged mode, 64bit protection mode of the OS OS, has done bootloader, the idea of kernel realization, confusion in the implementation of memory management, read this book, Hope to benefit their own OS writing.
4, overcome inertia, read more, hope to read 5 pages a day, for half a year to read the original book more than 700 pages.

Insufficient:

I can't fully understand the essence of Linux memory management, there must be a lot of places to understand mistakes. I hope you can correct me in order to improve, thank you.

Learning methods:

Perhaps the first time you read a lot of places do not understand, do not worry. Then you may need to read some knowledge of the file system.

Or read all the notes, and then look back, and in some places you'll understand.

Anyway
I. Overview Available Tools

Codeviz: A tool for generating code call diagrams that I have not yet used and interested in trying to build the call graph.
http://www.csn.ul.ie/~mel/projects/codeviz/
Linux Cross Reference (LXR): A Web-based tool for reading and locating Linux kernel source code. This tool is quite cumbersome to install and I suggest direct read code directly to its official website.
http://lxr.linux.no/linux+v2.6.24/
Module

The Linux Memory Management code module is divided into 4 parts: Out of memory code when MM/OOM_KILL.C looks like it's used to kill a process. The allocation code of virtual memory in MM/VMALLOC.C physical memory page allocation code in MM/PAGE_ The creation of ALLOC.CVMA (virtual memory addresses) and the management of memory areas within the process these modules, throughout with other kernel code, form more complex system modules, such as page replacement strategy, buffer input and output, etc. Middle Segment AD

Linux Culture T-shirts, Taobao sales, interested in can buy.

Taobao Store Address:

Http://list.taobao.com/browse/search_auction.htm?user=b0ccaa7bfdc57fdec4594501767832b6&commend=all Continue
second, physical memory

The memory system from the hardware point of view, there are 2 kinds of mainstream architecture, inconsistent memory access system (NUMA), I don't know what system is in this mode, this system divides the memory system into 2 areas (BANK), a piece is dedicated to the CPU to access, a piece is to the peripheral Board card DMA access. Another architecture, a consistent memory access system (UMA), is a structure that is used by PCs, and the memory that is accessed by CPUs and other peripheral devices is not any different on a single memory strip.

The Linux kernel needs to support these 2 architectures. It introduces a concept called node, a node corresponding to a bank, and for the UMA system, there is only one node in the system. Introduce a data structure "struct Pglist_data" in Linux to describe a node, defined in the Include/linux/mmzone.h file. (This structure is pg_data_t by typedef)

For NUMA systems, the entire system's memory is managed by a node_data array of pg_data_t pointers. (because there may be multiple node) for UMA systems such as PCs, the struct Pglist_data contig_page_data is used as the only node in the system to manage all memory areas. (Only one node in the UMA system)

Each node is divided into multiple zone, each of which describes the range in memory. Zone is described by the struct ZONE_STRUCT data structure. The type of ZONE is indicated by zone_t, there are ZONE_DMA, Zone_normal, Zone_highmem of these three types. Their use is different, ZONE_DMA type of memory area in the low end of physical memory, mainly ISA device can only use low-end address for DMA operations. The memory area of the Zone_normal type is mapped directly to the area above the linear address space by the kernel, which is described in detail later in this chapter. Zone_highmem will be reserved for system use.
In a PC system, the type of memory area is distributed as follows:
ZONE_DMA 0-16MB
Zone_normal 16MB-896MB
Zone_highmem 896mb-Physical Memory End

Most kernel operations use only the Zone_normal area,
System memory consists of a large number of fixed-size blocks of memory, which are known as page, in the x86 architecture, page size is 4,096 bytes. Each physical page is described by a struct page data structure object. The data structure objects of the page are stored in the Mem_map global array. The data structure object from the memory area behind the low address memory area of the kernel, which is the memory of the place where Zone_normal started, is stored in this global array.

Because the memory space in the Zone_normal area is also limited, Linux also supports high memory access, which is described in the following section, which will mainly describe the node,zone,page and their associations.
Nodes

The data structure that represents node is pg_data_t, which is struct pglist_data, which is defined in <linux/mmzone.h>:

typedef struct PGLIST_DATA {
struct zone node_zones [max_nr_zones];
struct Zonelist node_zonelists [max_zonelists];
int nr_zones;

struct page *node_mem_map;
struct Bootmem_data *bdata;
unsigned long node_start_pfn;
unsigned long node_present_pages; /* Total number of physical pages * *
unsigned long node_spanned_pages; /* Total size of physical page
range, including holes * *
int node_id;
wait_queue_head_t kswapd_wait;
struct Task_struct *KSWAPD;
int kswapd_max_order;
} pg_data_t;
Node_zones: Zone_dma,zone_normal,zone_highmem
Node_zonelists: The regional order when allocating memory operations, set by the build_zonelists () function in the mm/page_alloc.c file when Free_area_init_core () is invoked.
The number of zone in the Nr_zones:node, between 1 and 3. Not all node has 3 zone, for example, some have no ZONE_DMA area.
The first page in the Node_mem_map:node that can point to any page in the Mem_map.
Bdata: This memory allocation for boot only, described below
NODE_START_PFN:PFN is the abbreviation for page frame number. This member is used to represent the location of the starting page in physical memory in node.
2.4 Previous versions, expressed in physical addresses, and later due to hardware development, physical memory is likely to be greater than the 32bit indicated
4G memory address, so instead, it is represented as a page.
The number of page numbers that can actually be used in Node_present_pages:node
The number of page numbers that exist in the Node_spanned_pages:node, including the available, and also the area occupied by the DMA used by the Mem_map mentioned later. (made a correction)
The original English version describes this: "node spanned pages" is the total area which is addressed by the node, including
Any holes this may exist is the number of areas that can be accessed by the node that includes hold.
Node_id:node Node ID, starting from 0
Waiting queues for Kswapd_wait:node

For a single node system, Contig_page_data is the only node data structure object of the system.

Zone

Each zone is described by a struct zone data structure object. The zone object holds memory usage state information, such as page usage statistics, unused memory areas, mutually exclusive access locks (LOCKS), and so on. The struct zone is defined in <linux/mmzone.h> (omitting the members who are not concerned with NUMA and memory HotPlug):
struct Zone {
unsigned long free_pages;
unsigned long pages_min, Pages_low, Pages_high;
unsigned long lowmem_reserve[max_nr_zones];

struct Per_cpu_pageset Pageset[nr_cpus];

spinlock_t lock;

struct Free_area Free_area[max_order];

Zone_padding (_pad1_)//For byte alignment

spinlock_t Lru_lock;
struct List_head active_list;
struct List_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long nr_active;
unsigned long nr_inactive;
unsigned long pages_scanned;
int all_unreclaimable;

atomic_t reclaim_in_progress;

atomic_long_t Vm_stat[nr_vm_zone_stat_items];

int prev_priority;


Zone_padding (_pad2_)//For byte alignment
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;

struct Pglist_data *zone_pgdat;
unsigned long zone_start_pfn;

unsigned long spanned_pages;
unsigned long present_pages;

const char *name;
} ____CACHELINE_INTERNODEALIGNED_IN_SMP;
Free_pages: The number of unused page (not allocated).
Pages_min, Pages_low and Pages_high:zone Some of the parameters of the page management schedule, as described in the following section.
Lowmem_reserve[max_nr_zones]: In order to prevent some code must run in the low address area, so in advance to keep some of the low address area of memory.
Pageset[nr_cpus]: page-managed data structure object, which has a list of page (lists) to manage. Each CPU maintains a page list, avoiding spin-lock collisions. The size of this array is related to the Nr_cpus (number of CPUs), and this value is determined at compile time.
Lock: Spin lock for zone concurrent Access Protection
Free_area: Page usage status information, with each bit identifying whether the corresponding page can be assigned
Spin lock for LRU_LOCK:LRU (least recently used algorithm)
Reclaim_in_progress: Atomic lock for recycle operation
Active_list: The list of active page
Inactive_list: Inactive Page List
Refill_counter: Number of page removed from active page list
Nr_active: Number of active page
Nr_inactive: Number of inactive page
Pressure: Check the index of the Recycle page
All_unreclaimable: Set to 1 if you detect 2 times or you cannot recycle zone page
Pages_scanned: The number of page scans that were scanned after the page was last reclaimed.
Wait_table: Waiting for a hash table to be released by a page. It will be used by the wait_on_page (), Unlock_page () function. Prevent a process from waiting for resources for a long time by using a hash table instead of waiting for a queue.
Wait_table_hash_nr_entries: Number of wait queues in hash table
Zone_pgdat: Point to the Pglist_data object where the zone is located.
ZONE_START_PFN: The same meaning as NODE_START_PFN. This member is used to indicate the position of the page in physical memory for the start of the zone.
Present_pages, Spanned_pages: Similar to the member meaning in node.
Zone:zone's name, string representation: "DMA", "Normal" and "Highmem"
Zone_padding: Since spin locks are frequently used, it is helpful to improve performance by aligning some members to cache line for performance reasons. Using this macro, you can determine that zone->lock,zone->lru_lock,zone->pageset these members use a different cache line.some parameters of the Zone Management schedule: (Zone watermarks),

English literal translation for the level of zone, a metaphor, like a reservoir, the water stock is very small when the increase in water, the water stock to reach a standard time, reduce the amount of water, when the full time, may close the inlet. Pages_min, Pages_low and Pages_high are similar to this standard.

When there is little memory available in the system, the system code KSWAPD is awakened and the page is reclaimed and released. Pages_min, Pages_low and Pages_high These parameters affect the behavior of this code.

Each zone has three horizontal criteria: Pages_min, Pages_low and Pages_high, to help determine the pressure state used for memory allocation in zone. The interaction between KSWAPD and these 3 parameters is shown in the following diagram:

The number value of the page represented in Page_min is computed in Free_area_init_core () during memory initialization. This number is determined by dividing the number of page in zone by a >1 factor. This is usually the initialization of the zonesizeinpages/128.

Page_low: When the number of free pages reaches the number Page_low, the KSWAPD thread wakes up and starts releasing the Recycle page. This value defaults to twice times the page_min.

Page_min: When the number of free pages reaches the number of page_min, the action of assigning pages and KSWAPD threads running synchronously

Page_high: When the number of free pages reaches the number Page_high, the KSWAPD thread will hibernate again, usually 3 times times the value of Page_min. calculation of the size of the zone

The Setup_memory () function calculates the size of each zone:

PFN is the amount of physical memory offset in page. The first PFN that the system can use is the MIN_LOW_PFN variable, which begins with the _end label, which is where the kernel ends. Initialize the variable in the file MM/BOOTMEM.C. The last PFN available to the system is the MAX_PFN variable, the initialization of which is completely dependent on the hardware architecture. In a x86 system, the FIND_MAX_PFN () function obtains the highest number of page frame values by reading the e820 table. This variable is also initialized in the file MM/BOOTMEM.C. The e820 table was created by the BIOS.

In x86, the MAX_LOW_PFN variable is evaluated and initialized by the FIND_MAX_LOW_PFN () function, which is initialized to the position of the last page of the Zone_normal. This position is kernel direct access to the physical memory, but also related to the kernel/userspace through the "Page_offset macro" to the linear address memory space separate memory address location. (Original: This is the physical memory directly accessible
By the kernel and are related to the kernel/userspace split into the linear address spaces
Marked by PAGE OFFSET. I understand that this address kernel can be accessed directly by Page_offset macros to convert the virtual address used by kernel directly to the section of the physical address. Initialize the variable in the file MM/BOOTMEM.C. MAX_PFN and MAX_LOW_PFN values are the same in systems with smaller memory.

The 3 values of MIN_LOW_PFN, MAX_PFN, and MAX_LOW_PFN are also used to compute the starting and ending positions of the high-end memory (high memory). Similar HIGHSTART_PFN and HIGHEND_PFN variables are initialized in the arch/i386/mm/init.c file. These variables are used for allocating high-end memory pages. will be described later.

Zone waiting Queue table (Zone wait \ Table)

When I/O is done on a page, I/O operations need to be locked to prevent incorrect data from being accessed. Before the process accesses the page, it calls the Wait_on_page () function, which causes the process to join a waiting queue. When the access is complete, the unlockpage () function unlocks the access to the page by other processes. Other processes that are waiting in the queue are awakened. Each page can have a waiting queue, but too many separate wait queues make it expensive to spend too much memory access cycles. The alternative solution is to put all the queues in the struct zone data structure.

It can also be possible that there is only one queue in the struct zone, but this means that when a page unlock, all dormant processes that access the memory page in this zone will be awakened, which will cause congestion (thundering herd) problems. Creating a hash table to manage multiple wait queues can solve this problem, and zone->wait_table is the hash table. The method of a hash table may still cause some process to be unnecessarily awakened. But the chances of such a thing happening are not very frequent. The following diagram is the running relationship between the process and the wait queue:

The allocation and establishment of the hash table for the queue is made in the Free_area_init_core () function. The number of table entries in the Hashtable is computed in the wait_table_size () function and remains in the Zone->wait_table_size member. Maximum 4,096 wait queues. The smallest is Nopages/pages_per_waitqueue's 2-second party, Nopages is the number of zone-managed page, Pages_per_waitqueue is defined as 256. (Original: For smaller tables, the size of the table
is the minimum power of 2 required to store nopages/pages per Waitqueue
Number of queues, where nopages is the number of pages in the zone and
PAGE per waitqueue are defined to be 256.)

The following formula can be used to compute this value:

Zone->wait_table_bits is used to calculate: an algorithm factor that obtains the index of the waiting queue in the hash table based on the page address. The Page_waitqueue () function returns the wait queue for the page in zone. It uses a simple multiplication hash algorithm based on the struct page virtual address to determine the waiting queue.

The Page_waitqueue () function determines the index of the waiting queue in the hash table using a product of the Golden_ratio_prime's address and the "right zone→wait_table_bits an index value". Initialization of Zone

Zone is initialized after the kernel page table completely establishes z through the Paging_init () function. This is described in the following sections. Of course, the process of different architectures is certainly not the same, but their purpose is the same: determine what parameters need to be passed to the Free_area_init () function (for the UMA architecture) or free_area_init_node () function (for NUMA architecture). This omits the description of the NUMA architecture.

Parameters for the Free_area_init () function:

unsigned long *zones_sizes: An array of the number of page managers per zone in the system. At this time, it is not yet possible to determine which page in the zone is available for allocation (free). This information is not known until the boot memory allocator is complete.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.