Linux advanced application CpuMemSets implementation in Linux-Linux general technology-Linux programming and kernel information. The following is a detailed description. I. Preface
The Non-Uniform Memory Access structure is the main branch of the Distributed Shared Memory (Distributed Shared Memory) architecture. It combines the Distributed Memory technology with a single system image (SSI) technology, achieving a compromise between SMP system programming and MPP system scalability, has become one of the mainstream architecture of today's high-performance servers. At present, famous foreign server manufacturers have successively launched high-performance Servers Based on the NUMA architecture, such as HP Superdome, SGI's Altix 3000, Origin 3000, IBM's x440, NEC's TX7, and AMD's Opteron.
As the high-performance servers of the NUMA architecture are gradually promoted, system software has made a lot of optimization work in terms of scheduler, storage management, and user-level interfaces in view of the features of this distributed shared memory architecture. For example, the SGI Origin 3000 ccNUMA system has been widely used in many fields and is a very successful system. To optimize the performance of Origin 3000, SGI's IRIX operating system implements CpuMemSets on it. By binding the application to the processor and memory, it gives full play to the advantages of NUMA's local memory access. The Linux community has also implemented CpuMemSets in its NUMA project and has been applied in SGI's Altix 3000 server.
In this paper, we take the SGI ProPack v2.2 as the research object and analyze the specific implementation of CpuMemSets in Linux-2.4.20. CpuMemSets is an open source code project of SGI. It consists of four parts: patches for the Linux2.4 kernel, user libraries, python modules, and runon commands, to implement partitions of processors and memory blocks, control the distribution of system resources (processor and memory block) to kernels, tasks, and virtual storage areas, provides support for dplace, RunOn, and other NUMA tools to optimize NUMA performance in Linux.
Ii. Related Work
Partition Technology (Partition) was first deployed on the MainFrame and is now widely used in the server field, it supports running multiple instances of an operating system or multiple instances of multiple operating systems on a single server. The main features are "machine independence, reliable barrier protection, and single point management ". With the support of the partitioning technology, multiple operating systems running on multiple servers can run simultaneously on one server in the same location, it is better to distribute multiple servers in an organization to support different operating systems, thus effectively implementing server integration. Servers that support partition technology can be used as application servers and run Windows platforms for marketing departments. At the same time, they can run Linux systems for engineering departments. You can also test other operating systems for the Development Group in another partition while most users are running the Windows 2000 Advanced Server System, or all nodes are applied in one operating system environment. The main difference of various partitioning implementation technologies is that partition fault isolation methods (hardware or software) partition resource granularity, flexibility of partition resources, virtual partition resources, and support for dynamic partition restructuring. Typical include ibm lpar and DLAPAR (AIX 5L 5.1), HP nPartitions and vPartitions (HP-UX 11i), SUN's Dynamic Domains (Solaris 8) and Compaq Alpha Servers (Tru64 Unix 5.1 ). However, the Partitioning technology used by the NUMA system is in conflict with the single-system image advantage of the NUMA system.
From the user's perspective, the NUMA system provides transparency for local and remote master memory access. However, from the performance perspective, because the storage modules are physically distributed on different nodes, the storage access latency is inconsistent, which also has a great impact on the system performance. In such systems, the access latency of a node to remote node storage is generally one to two orders of magnitude higher than the local access latency. Page migration and page replication are one of the main methods to dynamically optimize data locality. The essence is a prediction technology that predicts future access to the page based on the collected information, and then makes a decision to migrate or copy the page. Using appropriate page replication and page migration policies can reduce cache capacity and conflict failures, balance the inconsistency between remote and local access delays, and optimize NUMA system performance. However, most of the existing page migration and page replication policies rely heavily on the architecture and special hardware support, resulting in high overhead and poor universality.
In a NUMA-structured multi-processor system, a task can run on any processor. However, in various situations, the execution of a task is interrupted. When the interrupted task is resumed, if you choose to resume execution on another processor, it will cause it to lose the original processor cache data. We know that it takes only a few nanoseconds to access the cache data, and it takes about 50 nanoseconds to access the primary storage. At this time, the processor runs at the access level to the master memory until the task runs for enough time, and the data required for running the task is refilled with the cache of the processor. To solve this problem, the system can use the processor to dispatch tasks on each node in close proximity to the scheduling policy: the system records the processor that finally executes the task and maintains this relationship, when resuming an interrupted task, try to resume the task execution on the processor that finally executes the task. However, because applications have different characteristics and the working set has dynamic attributes, the function of close-to-scheduling of processors is limited.
Users are system users and performance reviewers. They are the most clear about the system requirements and Evaluation Indicators of applications. In a large NUMA system, users often want to control a portion of the processor and memory for some special applications. CpuMemSets allow users to have more flexible control (it can overlap and divide the processor and memory of the system), allow multiple processes to regard the system as a single system image, and do not need to restart the system, ensure that some processor and memory resources are allocated to the specified application at different times. This is also a useful supplement to partition technology, page migration, and close scheduling policies.
Iii. System implementation
Before introducing the specific implementation of CpuMemSets in the Linux-2.4.20, we first explain several basic concepts involved in CpuMemSets, including:
Processor: it refers to the physical processor that carries task scheduling, but does not include DMA devices, vector processors, and other dedicated processors;
Memory block: In SMP and UP systems, all memory blocks are the same distance from all processors, so there is no difference; but in NUMA systems, memory blocks can be divided into equivalence classes based on the distance from the processor. In addition, CpuMemSets do not consider special memory with different speeds, such as input/output device caching and frame caching.
Task: a task that runs at any time, or waits for events, resources, interruptions, or processors.
Virtual Storage Area: multiple virtual address areas maintained by the kernel for each task, which can be shared by multiple tasks. Page located in the virtual storage area, or not allocated, or allocated but swapped out, or allocated and in memory. You can specify the memory blocks that can be allocated to a virtual storage area and the allocation sequence.
CpuMemSets provides a mechanism for Linux to bind system services and applications to a specified processor for scheduling and allocate memory on a specified node. CpuMemSets adds two layers of cpumemmap (cmm) and cpumemset (cms) structures based on the existing Linux scheduling and memory allocation code. The underlying cpumemmap layer provides a simple ing pair, the system maps the processor Number of the system to the processor Number of the application, the memory block number of the system to the memory block number of the application. This ing is not necessarily a single shot. A system number can correspond to multiple application numbers. The cpumemset layer on the upper layer is responsible for explaining which application processor numbers can be scheduled to run on the processors corresponding to, and which application memory block numbers can be used as the corresponding kernel or virtual memory Allocation page, that is, a set of resources that can be used by the kernel, task, and virtual storage area. In this two-layer structure, the system number of the resource is used by the kernel for scheduling and memory allocation. the application number of the resource is used when the user process specifies the resource set of the application. The system number is valid throughout the system during startup, and the application number is only valid for all user processes that share the same cmm. In addition, the changes in the physical number of resources caused by load balancing and hot swapping are invisible to the application number.
In Linux, process scheduling and memory allocation add CpuMemSets support to keep the existing code running normally, use "system processor number" and "system memory block number" and other data structures such as cpus_allowed and mems_allowed to partition resources. CpuMemSets APIs support cpusets, dplace, runon, psets, MPI, OpenMP, and nodesets, the/proc interface is provided to display the structure and settings of cmm and cms, as well as the connection relationship with tasks, virtual storage areas, kernels, system resource numbers, and application resource numbers. The following describes the cpumemmap, cpumemset, process scheduling, memory allocation, and APIs:
3.1 cmm & cms
3.1.1 Data Structure
The data structures of cpumemmap and cpumemset are as follows, which are defined in include/linux/cpumemset. h. The scpus and smems fields in Cpumemmap point to a group of system processor numbers and a group of system memory block numbers respectively to implement the application resource numbers (array subscript) and system resource numbers (array element values). The acpus domain in Cpumemset points to a group of application processor numbers, while the amems domain points to a group of memory blocks of the cms_memory_list_t type. Each memory block list describes a group of application memory blocks (mems) and a group of application processor numbers (cpus) that enjoy the list ). The memory block allocation policy is determined by the policy field in the cpumemset. The local priority is used by default. Cpumemset establishes an association with the corresponding cpumemmap through the cmm domain. The functions of the counter domains in the two data structures will be described later.
[Include/linux/cpumemset. h]
84 typedef struct cpumemmap {
85 int nr_cpus;/* number of cpus in map */
86 int nr_mems;/* number of mems in map */
87 cms_scpu_t * cpus;/* array maps application to system cpu num */
88 cms_smem_t * mems;/* array maps application to system mem num */
89 atomic_t counter;/* reference counter */
90} cpumemmap_t;
92 typedef struct cpumemset {
93 cms_setpol_t policy;/* CMS _ * policy flag: Memory allocation policy */
94 int nr_cpus;/* Number of cpus in this CpuMemSet */
95 int nr_mems;/* Number of Memory Lists in this CpuMemSet */
96 cms_acpu_t * cpus;/* The 'nr _ cpus' app cpu nums in this set */
97 cms_memory_list_t * mems;/* Array 'nr _ mems 'Memory Lists */
98 unsigned long mems_allowed;/* memory_allowed vector used by vmas */
99 cpumemmap_t * cmm;/* associated cpumemmap */
100 atomic_t counter;/* reference counter */
101} cpumemset_t;
76 typedef struct cms_memory_list {
77 int nr_cpus;/* Number of cpu's sharing this memory list */
78 int nr_mems;/* Number of memory nodes in this list */
79 cms_acpu_t * cpus;/* Array of cpu's sharing this memory list */
80 cms_amem_t * mems;/* Array of 'nr _ mems 'memory nodes */
81} cms_memory_list_t;
A numa system with four nodes is used as an example to describe the usage of the above data structure. Assume that each node is composed of four nodes and one memory block. The system numbers of 16 processors are: c0 (0), c1 (1 ),... ... And c15 (15). The system numbers of the four memory blocks are mb0 (0), mb1 (1), mb2 (2), and mb3 (3 ). Construct a processor and memory block cpumemmap that only contains 2nd and 3rd nodes, bind an application to an odd number processor, and allocate memory blocks in local priority mode. Data Structure 1:
Multiple tasks can simultaneously access, mount, release, or replace the specified cms & cmm. These operations are performed through the following routine:
The above routine ensures the integrity of parallel operations on CpuMemSets through the locking mechanism and the management of reference count counter in cmm & cms. The counter in each cms records the reference count of the user (the task pointing to this cms, the virtual storage area, and the total number of kernels) and the reference count of the handle (the total number of handles temporarily pointing to this cms ). The counter in each cmm records the reference count of the cms (the total number of cms points to this cmm) and the reference count of the handle (the total number of handles temporarily pointing to this cmm ).
To replace CpuMemSets, perform the following three steps:
1. Call the appropriate cmsGetHandle * () to securely obtain the cms and cmm handles;
2. construct a new cms and cmm;
3. Call appropriate cmsExchange * () to securely replace the old cms and cmm mentioned above.
To access a task or virtual storage area, perform the cms and cmm operations in the following three steps:
1. Call the appropriate cmsGetHandle * () to securely obtain the cms and cmm handles;
2. Call cms and cmm. During this period, it is not guaranteed that the original task or virtual storage area still uses these cms and cmm;
3. Call cmsRelease () to release the handle.
To attach or discard a handle, follow these steps:
1. Call the appropriate cmsGetHandle * () or cmsNewHandle * () to securely obtain the cms and cmm handles;
2. Call the appropriate cmsAttachNew * () or cmsDiscard ().
3.1.3 Basic settings
The kernel has its own kernel_cms. After the kernel starts with start_kernel () (build_all _ zonelists () and before trap_init (), it first calls cms_cm_static_init () to construct a static initial cmm & cms for kernel_cms, it only contains the processor that executes the kernel and the memory block of the node, and assigns the kernel_cms-> mems_allowed value-1UL, allowing the kernel to use all the memory blocks during the cpu_init () process. Then, the kernel executes cms_cm_init () at the end of start_kernel (), creates the cmm and cms cache, constructs the cmm & cms containing all the processors and memory blocks for kernel_cms, and passes it to init_task. If the cpumemset_minimal parameter is set at system startup, the minimum set constructed by cms_cm_static_init () is used.
Each process has two cms: current_cms affects its processor allocation and virtual storage zone creation. child_cms inherits from its fork sub-process. The current_cms and child_cms of each newly created process are inherited from the child_cms of the parent process.
[Include/linux/sched. h]
296 struct task_struct {
:
325 cpumask_t cpus_allowed;
:
429 cpumemset_t * current_cms;
430 cpumemset_t * child_cms;
431/* stash mems_allowed of most recent vma to page fault here */
432 unsigned long mems_allowed;
:
462 };
[Kernel/fork. c]
620 int do_fork (unsigned long clone_flags, unsigned long stack_start,
621 struct pt_regs * regs, unsigned long stack_size)
622 {
:
751 SET_CHILD_CMS (p, current );
:
871}
Each virtual storage zone has its own vm_mems_allowed bit vector. Vm_mems _ allowed of the new virtual storage zone inherits from mems_allowed of the Creator task current_cms through cms_current_mems_allowed. For the attached virtual storage area, such as the memory object of mmap and the shared memory area, it inherits from the mems_allowed of the mounted process current_cms. The mems_allowed bit vector of cms is constructed by mems_allowed_build (cms) based on the list of all memory blocks in cms.
[Kernel/cpumemset. c]
1538 static unsigned long
1539 mems_allowed_build (cpumemset_t * cms)
1540 {
1541 int I;
1542 unsigned long mems_allowed = 0;
1543
1544 for (I = 0; I <cms-> nr_mems; I ++)
1545 mems_allowed | = mems_allowed_value (cms, cms-> mems + I );
1546 return mems_allowed;
1547}
3.2 process scheduling and Memory Allocation
The kernel calls update_cpus_allowed (struct task_struct * p) to change its cpus_allowed bit vector based on the current_cms processor list of the task, thus affecting the Processor Scheduling of the task.
The kernel allocates memory for the task based on the vm_mems_allowed bit vector of the virtual storage area. If the task is interrupted, the memory allocation in the virtual storage area depends on mems_allowed of kernel_cms. Macro CHECK_MEMS_ALLOWED (mems_allowed, zone) is responsible for checking whether the node of the zone falls into the memory block set by mems_allowed.
[Mm/memory. c] 1383 int handle_mm_fault (struct mm_struct * mm, struct vm_area_struct * vma, 1384 unsigned long address, int write_access) 1385 { : 1390 /* 1391 * We set the mems_allowed field of the current task 1392 * the one pointed by the faulting vma. The current 1393 * process will then use the correct mems_allowed mask 1394 * if a new page has to be allocated. 1395 */ 1396 if (! In_interrupt ()) 1397 current-> mems_allowed = vma-> vm_mems_allowed; : 1417}
[Mm/page_alloc.c] 334 struct page * _ alloc_pages (..) : 343 if (in_interrupt ()) 344 mems_allowed = kernel_cms-> mems_allowed; 345 else 346 mems_allowed = current-> mems_allowed; 347 if (mems_allowed = 0 ){ 348 printk (KERN_DEBUG "workaround zero mems_allowed in alloc_pages \ n "); 349 mems_allowed =-1UL; 350} : If (! CHECK_MEMS_ALLOWED (mems_allowed, z )) Continue; : 450}
[Include/linux/cpumemset. h] 194/* Used in _ alloc_pages () to see if we can allocate from a node */ 195 # define CHECK_MEMS_ALLOWED (mems_allowed, zone) 196 (1UL <(zone)-> zone_pgdat-> node_id) & (mems_allowed ))
If the processor currently executing this task is included in the cms of the virtual storage area, it is allocated from the memory block list of the processor, otherwise, it is allocated from the memory block list of CMS_DEFAULT_CPU defined by cms in the virtual storage area.
3.3 API design CpuMemSets provides a series of kernel-level and application-level programming interfaces, which are defined in the kernel's include/linux/cpumemset. h file and library code CpuMemSets/cpumemsets. h file (as shown in table 2 ).
By calling the user-level interface to set cmm & cms, changes will occur to the system bit vectors used by the kernel scheduler and memory distributor, such as cpus_allowed and mems_allowed, so that the kernel scheduling Code uses the new system processor number and memory allocation code to allocate memory pages from the new memory block. However, the memory pages originally allocated from the old memory block will not be migrated, unless other measures are enforced. Specifically, during the cmsAttachNewTask (), cmsExchangeTask (), and cmsExchangePid () processes, the system executes update_cpus_allowed () to change its cpus_allowed bit vector based on the current processor list of the task current_cms; during the cms_set () process, execute mems_allowed_build () to change the mems_allowed bit vector of the current virtual storage zone, task, and kernel according to the current memory block list of the current current_cms of the task.
[Kernel/cpumemset. c] 1661 static int 1662 cms_set (unsigned long * preq, char * rec, int size, target_option cm_or_cms) : 1713 if (choice = CMS_VMAREA ){ :
In terms of permission protection, only root users can modify the cms & cmm used by the kernel and the cms & cmm of any task; in general, users can only modify their own tasks and cms & cmm owned by the virtual storage area. jobs with the same uid can modify the cms & cmm of each other. Only root users can expand their own cmm. Generally, users can only contract their cmm.
[Kernel/cpumemset. c] 1478 /* 1479 * Unless you have CAP_SYS_ADMIN capability, you can only shrink cmm. 1480 */ 1481 1482 static int 1483 cm_restrict_checking (cpumemmap_t * oldmap, cpumemmap_t * newmap) 1484 { 1485 int I; 1486 1487 if (capable (CAP_SYS_ADMIN )) 1488 return 0; 1489 1490/* newmap must be a subset of oldmap */ 1491 for (I = 0; I <newmap-> nr_cpus; I ++) 1492 if (! Foundin (newmap-> cpus, Oldmap-> cpus, oldmap-> nr_cpus )) 1493 return-EINVAL; 1494 for (I = 0; I <newmap-> nr_mems; I ++) 1495 if (! Foundin (newmap-> mems, Oldmap-> mems, oldmap-> nr_mems )) 1496 return-EINVAL; 1497 return 0; 1498}
Iv. Examples Example 1: display the processor in the current task current_cms
/* * Sample1-display current cpumemset cpus * * Compile: * Cc sample1.c-o sample1-lcpumemsets * Displays on stdout the number and a list of the cpus * On which the current process is allowed to execute. */
# Include "cpumemsets. h"
Main () { Int I; Cpumemset_t * pset;
Pset = cmsQueryCMS (CMS_CURRENT, (pid_t) 0, (void *) 0 ); If (pset = (cpumemset_t *) 0 ){ Perror ("cmsQueryCMS "); Exit (1 ); } Printf ("Current CpuMemSet has % d cpu (s): \ n \ t", pset-> nr_cpus ); For (I = 0; I <pset-> nr_cpus; I ++) Printf ("% s % d", (I> 0? ",": ""), Pset-> cpus); Printf ("0 ); Exit (0 ); }
Example 2: Set the subtask of the current task to run only on processor 0, and start sh to run
/* * Sample2-change child cpumemset cpus to just cpu 0 * * Compile: * Cc sample2.c-o sample2-lcpumemsets * Change the cpus which the child task is allowed * Execute on to just cpu 0. Start a subshell, * Instead of just exiting, so that the user has * The opportunity to verify that the change occurred. */
# PS1 = 'sub> './sample2 Invoking subshell running on cpu 0. Sub>./sample1 Current CpuMemSet has 1 cpu (s ): 0 Sub> exit
V. Summary CpuMemSets adds the cpumemmap (cmm) and cpumemset (cms) layers based on the existing Linux scheduling and memory allocation code, it provides a mechanism for Linux to bind system services and applications to a specified processor for scheduling and allocate memory on a specified node. From the data structure and control mechanism, the current implementation is relatively simple and practical, but there is still room for further optimization. However, CpuMemSets only provides a mechanism to mine the advantages of local memory access and optimize the NUMA performance of Linux systems, more in-depth research is required on how to develop proper NUMA system optimization policies based on such support methods.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.