Implementation of CpuMemSets in Linux

Source: Internet
Author: User
Article title: Linux Advanced application CpuMemSets implementation in Linux. Linux is a technology channel of the IT lab in China. Includes basic categories such as desktop applications, Linux system management, kernel research, embedded systems, and open source.
   I. Preface
The Non-Uniform Memory Access structure is the main branch of the Distributed Shared Memory (Distributed Shared Memory) architecture. it combines the Distributed Memory technology with a single system image (SSI) technology, achieving a compromise between SMP system programming and MPP system scalability, has become one of the mainstream architecture of today's high-performance servers. At present, famous foreign server manufacturers have successively launched high-performance servers based on the NUMA architecture, such as HP Superdome, SGI's Altix 3000, Origin 3000, IBM's x440, NEC's TX7, and AMD's Opteron.
  
As the high-performance servers of the NUMA architecture are gradually promoted, system software has made a lot of optimization work in terms of scheduler, storage management, and user-level interfaces in view of the features of this distributed shared memory architecture. For example, the SGI Origin 3000 ccNUMA system has been widely used in many fields and is a very successful system. to optimize the performance of Origin 3000, SGI's IRIX operating system implements CpuMemSets on it. By binding the application to the processor and memory, it gives full play to the advantages of NUMA's local memory access. The Linux community has also implemented CpuMemSets in its NUMA project and has been applied in SGI's Altix 3000 server.
  
In this paper, we take the SGI ProPack v2.2 as the research object and analyze the specific implementation of CpuMemSets in Linux-2.4.20. CpuMemSets is an open source code project of SGI. it consists of four parts: patches for the Linux2.4 kernel, user libraries, python modules, and runon commands, to implement partitions of processors and memory blocks, control the distribution of system resources (processor and memory block) to kernels, tasks, and virtual storage areas, provides support for dplace, RunOn, and other NUMA tools to optimize NUMA performance in Linux.
  
   II. Related work
Partition technology (Partition) was first deployed on the MainFrame and is now widely used in the server field, it supports running multiple instances of an operating system or multiple instances of multiple operating systems on a single server. The main features are "machine independence, reliable barrier protection, and single point management ". With the support of the partitioning technology, multiple operating systems running on multiple servers can run simultaneously on one server in the same location, it is better to distribute multiple servers in an organization to support different operating systems, thus effectively implementing server integration. Servers that support partition technology can be used as application servers and run Windows platforms for marketing departments. at the same time, they can run Linux systems for engineering departments. You can also test other operating systems for the development group in another partition while most users are running the Windows 2000 Advanced Server system, or all nodes are applied in one operating system environment. The main difference of various partitioning implementation technologies is that partition fault isolation methods (hardware or software) partition resource granularity, flexibility of partition resources, virtual partition resources, and support for dynamic partition restructuring. Typical include ibm lpar and DLAPAR (AIX 5L 5.1), HP nPartitions and vPartitions (HP-UX 11i), SUN's Dynamic Domains (Solaris 8) and Compaq Alpha Servers (Tru64 Unix 5.1 ). However, the partitioning technology used by the NUMA system is in conflict with the single-system image advantage of the NUMA system.
  
From the user's perspective, the NUMA system provides transparency for local and remote master memory access. However, from the performance perspective, because the storage modules are physically distributed on different nodes, the storage access latency is inconsistent, which also has a great impact on the system performance. In such systems, the access latency of a node to remote node storage is generally one to two orders of magnitude higher than the local access latency. Page migration and page replication are one of the main methods to dynamically optimize data locality. The essence is a prediction technology that predicts future access to the page based on the collected information, and then makes a decision to migrate or copy the page. Using appropriate page replication and page migration policies can reduce cache capacity and conflict failures, balance the inconsistency between remote and local access delays, and optimize NUMA system performance. However, most of the existing page migration and page replication policies rely heavily on the architecture and special hardware support, resulting in high overhead and poor universality.
  
In a NUMA-structured multi-processor system, a task can run on any processor. However, in various situations, the execution of a task is interrupted. when the interrupted task is resumed, if you choose to resume execution on another processor, it will cause it to lose the original processor cache data. We know that it takes only a few nanoseconds to access the cache data, and it takes about 50 nanoseconds to access the primary storage. At this time, the processor runs at the access level to the master memory until the task runs for enough time, and the data required for running the task is refilled with the cache of the processor. To solve this problem, the system can use the processor to dispatch tasks on each node in close proximity to the scheduling policy: The system records the processor that finally executes the task and maintains this relationship, when resuming an interrupted task, try to resume the task execution on the processor that finally executes the task. However, because applications have different characteristics and the working set has dynamic attributes, the function of close-to-scheduling of processors is limited.
  
Users are system users and performance reviewers. they are the most clear about the system requirements and evaluation indicators of applications. In a large NUMA system, users often want to control a portion of the processor and memory for some special applications. CpuMemSets allow users to have more flexible control (it can overlap and divide the processor and memory of the system), allow multiple processes to regard the system as a single system image, and do not need to restart the system, ensure that some processor and memory resources are allocated to the specified application at different times. this is also a useful supplement to partition Technology, page migration, and close scheduling policies.
  
   III. System Implementation
Before introducing the specific implementation of CpuMemSets in the Linux-2.4.20, we first explain several basic concepts involved in CpuMemSets, including:
  
Processor: it refers to the physical processor that carries task scheduling, but does not include DMA devices, vector processors, and other dedicated processors;
Memory block: In SMP and UP systems, all memory blocks are the same distance from all processors, so there is no difference; but in NUMA systems, memory blocks can be divided into equivalence classes based on the distance from the processor. In addition, CpuMemSets do not consider special memory with different speeds, such as input/output device caching and frame caching.
  
Task: a task that runs at any time, or waits for events, resources, interruptions, or processors.
Virtual Storage Area: multiple virtual address areas maintained by the kernel for each task, which can be shared by multiple tasks. Page located in the virtual storage area, or not allocated, or allocated but swapped out, or allocated and in memory. You can specify the memory blocks that can be allocated to a virtual storage area and the allocation sequence.
  
CpuMemSets provides a mechanism for Linux to bind system services and applications to a specified processor for scheduling and allocate memory on a specified node. CpuMemSets adds two layers of cpumemmap (cmm) and cpumemset (cms) structures based on the existing Linux scheduling and memory allocation code. the underlying cpumemmap layer provides a simple ing pair, the system maps the processor number of the system to the processor number of the application, the memory block number of the system to the memory block number of the application. This ING is not necessarily a single shot. a system number can correspond to multiple application numbers. The cpumemset layer on the upper layer is responsible for explaining which application processor numbers can be scheduled to run on the processors corresponding to, and which application memory block numbers can be used as the corresponding kernel or virtual memory allocation page, that is, a set of resources that can be used by the kernel, task, and virtual storage area. In this two-layer structure, the system number of the resource is used by the kernel for scheduling and memory allocation. The Application Number of the resource is used when the user process specifies the resource set of the application. The system number is valid throughout the system during startup, and the application number is only valid for all user processes that share the same cmm. In addition, the changes in the physical number of resources caused by load balancing and hot swapping are invisible to the application number.
  
In Linux, process scheduling and memory allocation add CpuMemSets support to keep the existing code running normally, use "system processor number" and "system memory block number" and other data structures such as cpus_allowed and mems_allowed to partition resources. CpuMemSets APIs support cpusets, dplace, runon, psets, MPI, OpenMP, and nodesets, the/proc interface is provided to display the structure and settings of cmm and cms, as well as the connection relationship with tasks, virtual storage areas, kernels, system resource numbers, and application resource numbers. The following describes the cpumemmap, cpumemset, process scheduling, memory allocation, and APIs:
  
3.1 cmm & cms
3.1.1 Data structure
  
The data structures of cpumemmap and cpumemset are as follows, which are defined in include/linux/cpumemset. h. The scpus and smems fields in Cpumemmap point to a group of system processor numbers and a group of system memory block numbers respectively to implement the application resource numbers (array subscript) and system resource numbers (array element values). The acpus domain in Cpumemset points to a group of application processor numbers, while the amems domain points to a group of memory blocks of the cms_memory_list_t type. Each memory block list describes a group of application memory blocks (mems) and a group of application processor numbers (cpus) that enjoy the list ). The memory block allocation policy is determined by the policy field in the cpumemset. The local priority is used by default. Cpumemset establishes an association with the corresponding cpumemmap through the cmm domain. The functions of the counter domains in the two data structures will be described later.
  
[Include/linux/cpumemset. h]
84 typedef struct cpumemmap {
85 int nr_cpus;/* number of cpus in map */
86 int nr_mems;/* number of mems in map */
87 cms_scpu_t * cpus;/* array maps application to system cpu num */
88 cms_smem_t * mems;/* array maps application to system mem num */
89 atomic_t counter;/* reference counter */
90} cpumemmap_t;
  
92 typedef struct cpumemset {
93 cms_setpol_t policy;/* CMS _ * policy flag: Memory allocation policy */
94 int nr_cpus;/* Number of cpus in this CpuMemSet */
95 int nr_mems;/* Number of Memory Lists in this CpuMemSet */
96 cms_acpu_t * cpus;/* The 'NR _ CPUs' app cpu nums in this set */
97 cms_memory_list_t * mems;/* Array 'NR _ mems 'Memory Lists */
98 unsigned long mems_allowed;/* memory_allowed vector used by vmas */
99 cpumemmap_t * cmm;/* associated cpumemmap */
100 atomic_t counter;/* reference counter */
 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.