Linux page box management

Source: Internet
Author: User

In the previous blog, we explained the segmentation and paging mechanisms of Linux kernel based on the 80x86 system, and discussed in detail the Linux memory layout. With these basic concepts, we will discuss in detail how the kernel dynamically manages the available memory space.

 

For a 32-bit 80386 processor structure, Linux uses the 4 kb page size as the standard memory allocation unit. The kernel must record the current status of each page box, for example, to identify which pages contain processes and which pages contain kernel code or kernel data. The kernel must also be able to determine whether the page box in the dynamic memory is idle. If the page box in the dynamic memory does not contain useful data, the page box is idle. In the following cases, the page is not idle: contains user State process data, data cached by a software, dynamically distributed kernel data structure, data buffered by the device driver, kernel module code, and so on.

 

The kernel uses the data structure page to describe the status information of a page box. All page descriptors are stored in the global mem_map array. The subscript of the array is the page box number (PFn ). Because each descriptor is 32 bytes in length, the space required by mem_map is slightly less than 1% of the total Ram.

 

How can a page descriptor be associated with a page frame occupying 4 K (ing? With the mem_map array, this problem is simple. If you know the page data address PD, use PD to subtract mem_map to get the page number PFN of PD. The physical address of this physical page is physaddr = PFN <page_shift.

After learning that the physical address of the physical page is physaddr, you can obtain its virtual address based on the size of physaddr:
1. The virtual address of physaddr <896 m is physaddr + page_offset (page_offset = 3G)
2. The virtual address corresponding to physaddr> = 896 m is not statically mapped. A virtual address is obtained through the kernel's high-end virtual address ing.

 

After obtaining the virtual address of this page, the kernel can access this physical page normally.

 

The kernel provides a plu_to_page (ADDR) macro to generate the page descriptor address corresponding to the linear address ADDR. The pfn_to_page (PFn) macro generates the page descriptor address corresponding to the page box number PFN. On the contrary, the page_to_pfn (PG) macro is also provided to generate the page number PFN of the page corresponding to the page descriptor. NOTE: For the 80x86 structure, the above macros are determined not directly through the men_map array but through the zone_mem_map in the memory management area, but in the same principle:

# Define page_to_pfn (PG )/
({/
Struct page * _ page = PG ;/
Struct zone * _ zone = page_zone (_ page );/
(Unsigned long) (_ page-_ zone-> zone_mem_map )/
+ _ Zone-> zone_start_pfn ;/
})

 

Pay attention to this! Do not confuse a concept. Although physaddr indicates a physical address, it does not indicate that the data of this address must exist in the physical memory. So how can we determine whether this page is out of memory? As you can see, the previous knowledge uses the paging mechanism.That is to say, if the page is swapped out for a variety of reasons, the page's present flag is 0. The page missing exception is involved here. For more information, please pay attention to the post-Author's blog.

 

Here we only need to discuss the following two fields in detail on the data structure page:
1. _ count: Reference Counter of the page. If this field is-1, the corresponding page box is idle and can be allocated to any process or kernel itself. If the value of this field is greater than or equal to 0, it indicates that the page is allocated to one or more processes, or used to store some kernel data structures. The page_count () function returns the value after _ count plus 1, that is, the number of users on the page.
2. Flags: contains up to 32 flags used to describe the page box status. For each pg_xyz flag, the kernel defines some macros that manipulate its value. Generally, the pagexyz macro returns the flag value, while the setpagexyz and clearpagexyz macros set and clear the corresponding BITs respectively.

 

Flag name

Description

Pg_locked

Pages are locked, for example, pages involved in disk I/O operations.

Pg_error

An error occurred on the transfer page.

Pg_referenced

Page just accessed

Pg_uptodate

After the read operation is completed, unless a disk I/O error occurs.

Pg_dirty

Page modified

Pg_lru

Pages are in the active or inactive linked list.

Pg_active

Page in the linked list of the activity page

Pg_slab

Page box included in Slab

Pg_highmem

The page is in the zone_highmem management area.

Pg_checked

Identifier used by some file systems (such as ext2 and ext3)

Pg_arch_1

Not used in the 80x86 architecture

Pg_reserved

The page box is left to the kernel code or is not used

Pg_private

The private field of the page descriptor stores meaningful data.

Pg_writeback

Writing pages to disk using writepage Method

Pg_nosave

Used for system suspension/wake-up

Pg_compound

Process the page by using the extended paging Mechanism

Pg_swapcache

Page belongs to swap Cache

Pg_mappedtodisk

All the data in the page corresponds to the block allocated on the disk.

Pg_reclaim

The page has been marked as written to disk to recycle memory

Pg_nosave_free

Used for system suspension/recovery

 

How does the system allocate a memory space for the process or kernel, or how does it allocate a page with the linear address corresponding to the linear page descriptor? This requires the kernel's partition page box distributor and partner system algorithm. Before discussing these details, we will introduce some necessary concepts.

 

1. Non-uniform memory access (NUMA) Architecture

 

Linux2.6 supports the non-uniform memory access (NUMA) model. In this model, the access time of a given CPU to different memory units may be different. The physical memory of the system is divided into several nodes ). In a single node, the time required for any given CPU to access the page is the same, and for different CPUs, this time is different. For each CPU, the kernel tries to minimize the number of accesses to time-consuming nodes, which requires selecting the storage location of the kernel data structure that the CPU frequently references.

 

Each node is represented by a descriptor of the type pg_data_t. The descriptors of all nodes are stored in a one-way linked list. Its first element is directed by the kernel global variable pgdat_list. In the x86 system, memory access time is the same for multiple cores, so NUMA is not required, but the kernel still uses nodes. However, this is only a separate node, it contains all the physical memory in the system. Therefore, the pgdat_list variable points to a linked list, which consists of only one element. This element is the node 0 descriptor, which is stored in the contig_page_data variable.

 

The three fields to be noted in the pg_data_t descriptor are node_zones, node_zonelists, node_mem_map, and zone_t [], zonelist_t [], and page. The first two are used to describe the memory management area. We will talk about it now; node_mem_map is the page descriptor array of all pages of the current node. The kernel puts these three fields inside, that is, to establish some column connections for the memory area and page boxes.

 

2. Memory Management Zone

 

The Linux kernel must handle two hardware constraints in the 80x86 architecture:
(1) Direct Memory Access (DMA) processors of the ISA bus have a strict limit that they can only address the first 16 MB of RAM.
(2) In modern 32-bit computers with large Ram capacity, the CPU cannot directly access all physical memory because the linear address space is too small.

 

To address these two restrictions, linux2.6 divides the physical memory of each memory node into three management zones ). The management areas in the x 86 Uma architecture are as follows:
Zone_dma: memory page containing less than 16 MB
Zone_normal: contains a memory page that is larger than 16 MB and smaller than MB.
Zone_highmem: contains a memory page that is later than 896mb from 896mb

 

The zone_dma and zone_normal zones contain the memory "regular" Page box. By ing their linear addresses to the 4th GB of linear address space, the kernel can directly access them. Memory pages in the zone_highmem area cannot be directly accessed by the kernel, although they can also be linearly mapped to 4th GB of linear address space through high-end memory kernel ing.

 

Each memory management area has its own descriptor zone_t. Many of its fields are used for page recycling. In fact, each page descriptor page has a link to the memory node and to the memory node management area. Why can't we see it? The reason is to save space. These links are stored in different ways than typical pointers. They are encoded as indexes and stored at the high position of the flags field.

 

The zone_t field is as follows:

Type

Name

Description

Unsigned long

Free_pages

Number of idle pages in the management area

Unsigned long

Pages_min

Number of reserved pages in the management area

Unsigned long

Pages_low

The lower limit used by the recycle page. It is also used by the management area distributor as the threshold value.

Unsigned long

Pages_high

The upper limit used by the recycle page. It is also used by the management area distributor as the threshold.

Unsigned long[]

Lowmem_reserve

Specifies the number of page boxes that must be retained in each management area when the Processing Memory is insufficient.

Struct per_cpu_pageset []

Pageset

Data structure is used to implement special high-speed cache for a single page frame

Spinlock_t

Lock

Protect the spin lock of this descriptor

Struct free_area []

Free_area

Identifies the free page box block in the management area

Spinlock_t

Lru_lock

Spin locks used by active and inactive linked lists

Struct list head

Active_list

Linked List of activity pages in the management area

Struct list head

Inactive_list

Inactive page linked list in Management Area

Unsigned long

Nr_scan_active

Number of active pages to be scanned when memory is recycled

Unsigned long

Nr_scan_inactive

Number of inactive pages to be scanned when memory is recycled

Unsigned long

Nr_active

Page number on the linked list of the management area

Unsigned long

Nr_inactive

Number of pages on the inactive linked list in the management area

Unsigned long

Pages_scanned

Counters used in the recycle page of the management area

Int

All_unreclaimable

This flag is set when the page cannot be recycled is filled in the management area

Int

Temp_priority

Priority of the Temporary Management Area (used in the recycle page)

Int

Prev_priority

Management area priority, ranging from 12 to 0 (used by the recycling page box algorithm)

Wait_queue_head_t *

Wait_table

A list of processes waiting for queues. These processes are waiting for a page in the management area.

Unsigned long

Wait_table_size

Size of the waiting queue hash

Unsigned long

Wait_table_bits

Size of the array in the waiting queue hash list; Value: 2 order

Struct pglist_data *

Zone_pgdat

Memory Node

Struct page *

Zone_mem_map

Pointer to the first page descriptor of the management area

Unsigned long

Zone_start_pfn

Subscript of the first page of the management area

Unsigned long

Spanned_pages

The total size of the management area in pages, including holes

Unsigned long

Present_pages

The total size of the management area in the unit of page, excluding Holes

Char *

Name

The Pointer Points to the traditional name of the management area: "DMA", "normal", or "highmem"

 

In fact, the number of page frames is limited, so it is more than enough to keep the highest bit of the flags field to encode the memory node and management area. Linux provides the page_zone () function to receive the address of a page descriptor as its parameter; it reads the highest bit of the flags field in the descriptor, then, check the zone_table array to determine the address of the descriptor in the corresponding management zone. By the way, when the system is started, the kernel puts the addresses of all management zone descriptors of all Memory nodes into the zone_table array.

 

When the kernel calls a memory allocation function, it must specify the management area of the Request page. The kernel usually specifies the management zone it is willing to use. In order to specify the preferred management zone in the memory allocation request, the kernel uses the zonelist data structure. This is the Management Zone descriptor pointer array, which has only three zones in 80x86, therefore, the pointers to the three zones in the zonelist data structure are arranged according to certain rules ., The zonelist array is the arrangement and combination of the three zones.

For example, if you want to allocate a page box for DMA, you can obtain the preferred zone from a zonelist element in the specified zonelist array. It should be zone_dma. If the space in this area is used up, select zone_norma and then zone_highmem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.