Linux exists in various architectures, so it is necessary to describe the memory architecture as an independent method. This chapter describes the data structures used to manage memory banks, page boxes, and flags that affect VM behavior.
The first important concept of VM is non uniform memory access (NUMA ). For the mainframe, the access cost varies depending on the distance between the memory and the processor in different banks. For example, a memory bank can be specified to each CPU or a memory bank that is suitable for DMA operations can be designated to a nearby device or card.
The bank is called a node. in Linux, whether it is non uniflomr memory access (NUMA) or uniform memory access (UMA), struct pglist_data (typedef pg_data_t) is used to represent a node. Each node in the system is stored in the pgdat_list linked list. The node is linked to the next node through pg_data_t-> node_next. For a UMA structure such as PC worker tops, there is only a static pg_data_t structure called contig_page_data. We will continue to discuss node in section 2.1
Each node is divided into multiple memory zones, which are called zones. Struct zone_struct is used to describe the zone, which has the following types: zone_dma, zone_normal, zone_highmem. Different types of zones are used for different scenarios. Zone_dma zone contains low-end physical memory suitable for devices that cannot access more than 16 MB physical memory. Zone_normal is directly mapped to the low address area of the Kernel linear address space. We will discuss it further in section 4.1. Zone_highmem is the remaining memory that cannot be directly mapped to the kernel space.
For x86 machines, the memory zones is as follows:
Zone_dma first 16mib of memory
Zone_normal 16mib-896mib
Zone_highmem 896mib-end
Most kernel operations use zone_normal, so zone_normal is the most critical zone for performance. We will further discuss these zones in section 2.2. The system memory is composed of fixed page frames. The physical page frame is represented by a struct page, which is saved in the global mem_map array, mem_map is usually stored at the starting position of zone_normal or after the kernel image retention area.
We will discuss struct page in detail in section 2.4 and global mem_map array in detail in section 3.7. Figure 2.1 demonstrates the relationship between these data structures.
Because the number of memories that can be directly accessed by the kernel (zone_normal zone size) is limited, Linux supports more physical memory through high memory. We will discuss high memory in section 2.7. Before introducing high memory management, we should first discuss how nodes, zones, and pages are represented.
2.1 nodes
As mentioned before, each memory node is described by pg_data_t, that is, strcut pglist_data. When a page is allocated, Linux uses the local node allocation policy to allocate memory from the node closest to the current CPU, because the process may also run on this CPU. This data structure is defined in <Linux/mmzone. h>.
struct bootmem_data;typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelists[MAX_ZONELISTS]; int nr_zones;#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ struct page *node_mem_map;#ifdef CONFIG_CGROUP_MEM_RES_CTLR struct page_cgroup *node_page_cgroup;#endif#endif#ifndef CONFIG_NO_BOOTMEM struct bootmem_data *bdata;#endif#ifdef CONFIG_MEMORY_HOTPLUG /* * Must be held any time you expect node_start_pfn, node_present_pages * or node_spanned_pages stay constant. Holding this will also * guarantee that any pfn_valid() stays that way. * * Nests above zone->lock and zone->size_seqlock. */ spinlock_t node_size_lock;#endif unsigned long node_start_pfn; unsigned long node_present_pages; /* total number of physical pages */ unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; wait_queue_head_t kswapd_wait; struct task_struct *kswapd; int kswapd_max_order;} pg_data_t;
Introduction to Data Structure members
Node_zones: An array containing the zones in this node
Node_zonelists: This is the node allocation order. The build_zonelists () function in mm/page_alloc.c creates this node_zonelists and specifies the list of slave nodes. When the current node has no available space, the slave node allocates memory. For example, if zone_highmem fails to be allocated, zone_normal will be allocated. If zone_normal fails, zone_dma will be allocated.
Nr_zones: number of zones contained in the node, usually 1 ~ 3. Not all nodes have three zones. For example, a bank may not have zone_dma.
Node_mem_map: A node consists of multiple physical frames. Each physical page frame has a page structure. node_mem_map is the Page Structure of the first physical page frame, the page structure is located in a certain position of the global mem_map array.
Bdata: during system startup, the kernel also needs to use the memory before the memory management subsystem is initialized. Boot Memory Allocator uses this member. We will discuss boot Memory Allocator in chapter 5.
Node_start_pfn: the logical number of the first page frame in the node. Page frames of all nodes in the system are numbered sequentially, and each page frame number is globally unique.
Node_present_pages: specifies the number of page frames in the node,
Node_spanned_pages: the range of page frames contained by the node, including holes.
All nodes in the system are maintained through list pgdat_list. Create this list in the initialization function init_bootmem_core (). We will discuss the initialization of pgdat_list in section 5.3. Before kernel 2.4.18, the code for traversing the pgdat_list linked list is as follows:
pg_data_t *pgdat;pgdat = pgdat_list;do { /* do something with pgdata_t */} while ((pgdat = pgdat->node_next));
In the latest kernel version, the macro for_each_pgdat provides the function of traversing pgdat_list.
2.2 zones
Each zone is described by struct zone_struct. struct zone_struct stores the statistics, idle information, and locks used on the page. This structure is declared in <Linux/mmzone. h>:
typedef struct zone_struct { spinlock_t lock; unsigned long free_pages; unsigned long pages_min, pages_low, pages_high; int need_balance; free_area_t free_area[MAX_ORDER]; wait_queue_head_t *wait_table; unsigned long wait_table_size; unsigned long wait_table_shift; struct pglist_data *zone_pgdat; struct page *zone_mem_map; unsigned long zone_start_paddr; unsigned long zone_start_mapnr; char *name; unsigned long size;} zone_t;
Introduction to Data Structure members:
Lock: A spinlock to protect concurrent access to the Zone
Free_pages: Total number of free pages in the Zone
Pages_min, page_low, pages_high: watermark of this zone. These three values will affect the behavior of the switch daemon process. Section 2.3 describes these watermarks in detail. We can see from here that the switch daemon process the usage of each zone.
Need_balance: this flag is used to notify the kswapd balance zone. When the number of available pages in a zone reaches the zone watermark, mark the need_balance flag.
Free_area: A zone idle location chart that identifies the usage of each page. The Buddy distributor uses this bitmap to allocate and release the page.
Wait_table: the hash table, wait_on_page (), and unlock_page functions of the Process waiting queue on the page will use this H_3 table. We will discuss wait_table in section 2.2.3
Wait_table_size: the number of waiting queues in the hash table, which is the power of 2
Wait_table_shift:
Zone_pgdat: point to the structure pg_data_t corresponding to the node of the Zone
Zone_mem_map: The first physical page in the zone corresponds to the page.
Zone_start_paddr: the starting physical address of the zone.
Zone_start_mapnr: The page frame number corresponding to the first physical page box, that is, the offset in global mem_map.
Name: String describing this zone: "DMA", "normal", and "highmem"
Size: number of page boxes contained in the Zone
2.2.1 zone watermarks
When the available memory of the system is reduced, the pageout daemon kswapd is awakened and free pages are released. Under high pressure, the daemon immediately releases the memory, which is called direct-reclaim. These parameters that control pageout behaviors are similar in both FreeBSD and Solaris.
Each zone has three Watermarks: pages_low, pages_min, and pages_high, which can be used to determine the current memory pressure of the zone. Pages_min is calculated in free_area_init_core (). The formula is zonesizeinpages/128, but the minimum is no less than 20 pages, and the maximum is no more than 255 pages.
The system takes different actions when the watermark is in different zones based on the page usage.
Pages_low: when the number of idle pages is lower than pages_low, kswapd is awakened by buddy allocator to release pages. The default value is twice that of pages_min.
Pages_min: when the number of idle pages is lower than pages_min, Allocator will release pages synchronously instead of kswapd, that is, directly recycle pages.
Pages_high: When kswapd is awakened to release pages, the collected pages have reached the page number marked by pages_high. kswapd stops recycling and goes to sleep. The default value is generally three times the value of pages_min.
No matter what the names of these parameters are in a similar system, the functions are similar. The pageout daemon or thread is used to collect pages.
2.2.2 calculating the size of zones
The zone size is calculated in the setup_memory function, 2.3
The PFN physical page number is the page offset in the memory of the physical page box. The first PFN in the system is saved in min_low_pfn, which is the first page after the loaded kernel image. This value is stored in mm/bootmem. C.
How is max_pfn calculated in the last physical page? Different system calculation methods are different.
Max_low_pfn refers to the maximum number of physical pages at the low end, which is used to mark the end of zone_normal. This is the maximum physical memory that the kernel can directly access. This address value is related to the kernel/userspace address space division. This value is stored in mm/bootmem. C. For low memory systems, max_pfn is equal to max_low_pfn.
Using min_low_pfn, max_low_pfn, and max_pfn, you can easily calculate the starting and ending addresses of high memory, highstart_pfn, and highend_pfn.
2.2.3 zone wait queue table
2.3 zone Initialization
2.4 initializing mem_map
Mem_map is created when the system is started. In the UMA system, free_area_init () uses contig_page_data as the node, and global mem_map as the local mem_map of the ode.
Free_area_init_core () assigns a local mem_map to the current node. The mem_map array is allocated through the boot memory distributor alloc_bootmem_node.
2.5 pages
Each physical page frame in the system has a corresponding page to track the status of this page. In the 2.2 kernel, this structure is similar to that in System V, but for other UNIX variants, this structure has been written. Struct page is defined in <Linux/mm. h>:
typedef struct page { struct list_head list; struct address_space *mapping; unsigned long index; struct page *next_hash; atomic_t count; unsigned long flags; struct list_head lru; struct page **prev_hash; struct buffer_head *buffers;#if defined(CONFIG_HIGHMEM) || defined(WAIT_PAGE_VIRTUAL) void *virtual;#endif} mem_map_t;
The following describes the member variables in the page structure:
List: pages can belong to multiple linked lists. This list member is used to link to these linked lists. For example, a mapped pages can belong to one of the three circular linked lists in address_space. They are clean_pages, dirty_pages, and locked_pages. In the slab distributor, when a page has been allocated by the slab distributor, it is used to store the slab and cache structure pointers managed by the page. It can also be used to connect idle blocks in the page
Mapping: When files and devices perform memory mapped, their inode has an address_space. The page Member points to this address space if pages belong to this file. If the page is an anonymous ing, address_space points to swapper_space.
Index: This member has two purposes, depending on the page status. If the page is part of the file ing, the index is the page offset in the file. If the page is part of the SWAp cache, the index is the offset in the swapper_space of the SWAp address space. If the block in pages is being released by a specific process and the order of the released block is stored in the index, this value is set by the function _ free_pages_ OK.
Next_hash: If pages is part of the file ing, pages uses inode and offset for hash. This next_hash links pages with the same hash.
Count: Index count of the page. If it changes to 0, the page can be released. If it is greater than 0, it indicates that it is used by one or more processes, or is used by the kernel.
Flags: stores signs unrelated to the architecture and describes the page status. These status tags are defined in <Linux/mm. h> and are also listed in Table 2.1.
LRU: For the page replacement policy, pages in the active_list or inactive_list linked list may be exchanged. LRU is the head of these least recently used linked lists that can exchange pages. We will discuss these two linked lists in chapter 10.
Pprev_hash: supplements next_hash to form a two-way linked list.
Buffers: If a page is associated with the buffer of a block device, this member is used to record buffer_head. If an anonymous page is replaced with a backup file, the page also has a corresponding buffer_head. This page has to synchronize data to the block-Size Block of the backup storage.
Virtual: Normally, only pages of zone_dma and zone_normal are directly mapped by kernel. Pages of zone_highmem cannot be directly mapped to pages in the kernel memory. When a page is mapped, this member records the virtual address of the page.
2.6 mapping pages to zones
In the latest kernel version 2.4.18, struct page adds the member pointer zone, pointing to the zone where the page is located. Later, this member was considered useless, because even a pointer still consumes a lot of memory for a large amount of struct pages. In the nearest kernel, this zone member is removed. Instead, the highest zone_shift bit in page-> flags is used to determine which zone the page belongs. These bits record the index of the zone corresponding to the page in zone_table.
Zone_table of zones will be created at system startup. zone_table is declared as follows in mm/page_alloc.c:
zone_t *zone_table[MAX_NR_ZONES*MAX_NR_NODES];EXPORT_SYMBOL(zone_table);
Max_nr_zones is the maximum number of zones in a node, such as 3. Max_nr_nodes is the maximum number of system nodes. The function export_symbol enables zone_table to be accessed by loadable modules. This table is like a multi-dimensional array. When free_area_init_core is used, all pages in the node are initialized. First, record the zone to zone_table.
zone_table[nid * MAX_NR_ZONES + j] = zone
NID is the node ID, J is the index of the zone in this node, and zone is the zone_t structure. For each page, the set_page_zone function is executed.
set_page_zone(page, nid * MAX_NR_ZONES + j)
The zone index of the parameter @ page is set to nid * max_nr_zones + J
2.7 high memory
Because the available address space (zone_normal) of the kernel is limited, the kernel supports more physical memory through high memory. The 32-bit x86 system has two thresholds: 4gib and 64gib.
The 4gib limit is the maximum addressable space of 32-bit physical addresses. To access the memory in the range of 1 gib to 4 gib (the lower limit of 1 Gib is not fixed, and it is not only related to the reserved vmalloc space, it is also related to the address space division of the kernel/user space). The kernel must use kmap to temporarily Map pages to zone_normal. We will discuss it further in Chapter 9.
The second limitation 64gib is related to PAE. Intel's invention allows access to more RAM on 32-bit systems. It uses extra bits for memory addressing, and the addressing range can reach 2 ^ 36 (64gib)
In theory, PAE allows the addressing range to be 64 gib, but in reality the Linux Process cannot access so much RAM, because the virtual address space is still 4 Gib. It is impossible for a user to use malloc to allocate all physical memory in their processes.
In addition, PAE does not allow the kernel to have so much RAM. The structure page is used to describe each page box. The size of the structure page is 44 bytes, which is stored in the virtual address space zone_normal. This means that the physical memory of 1 gib will occupy 11 MIB of memory, while that of 16 Gib is 176mib, which causes zone_normal to become smaller. Although it does not seem to be too bad, we should consider other small structures, such as page table entry (PTEs ), in the worst case, 16mib (16 gib memory requires 4 mb ptes) space is required. This makes 16mib an actual limit on the available physical memory of x86 Linux. To access more physical memory, we recommend that you use a 64-bit system.