High-performance server technology based on NUMA architecture (2)

Source: Internet
Author: User
Article Title: high-performance server technology based on NUMA architecture (2 ). Linux is a technology channel of the IT lab in China. Includes basic categories such as desktop applications, Linux system management, kernel research, embedded systems, and open source.
   Iii. NUMA Scheduler
In NUMA system, because the latency of local memory access is lower than that of remote memory access, the performance of applications can be greatly optimized when processes are allocated to a processor near the local memory. The scheduler in the Linux 2.4 kernel is not ideal on the SMP platform because only one running queue is designed with poor scalability. When the number of running tasks is large, multiple CPUs increase competition for system resources and limit the load throughput. During kernel 2.5 development, Ingo Molnar writes a multi-queue scheduler called O (1). From 2.5.2 onwards, the O (1) scheduler has been integrated into kernel 2.5. O (1) is a multi-queue scheduler, and each processor has its own running queue. However, since O (1) Scheduler cannot better perceive the node Structure in NUMA system, therefore, the process still runs on the same node after scheduling. Therefore, Eirch Focht develops a node-friendly NUMA scheduler, which is an O (1) created on the Ingo Molnar) based on the scheduler, Eirch migrates the scheduler back to the 2.4.X kernel, which was initially developed for the 2.4 kernel of the IA64-based NUMA machine, matt Dobson later ported it to X86-based NUMA-Q hardware.
  
3.1 initial load balancing
  
Each task is created with a HOME node (the so-called HOME node is the node for the task to obtain the initial memory allocation). It is the least load node in the system when the task was created, currently, Linux does not support the migration of task memory from one node to another, so the HOME node remains unchanged during the life cycle of the task. The initial load balancing (that is, the HOME node of the task) of a task is completed by the exec () system call by default, or by the fork () System Call. The node_policy field in the task structure determines the initial load balancing selection method.
    
3.2 Dynamic Load Balancing
  
In the node, the NUMA scheduler is the same as the O (1) scheduler. The dynamic load balancing on an idle processor is triggered by a clock interruption every 1 ms. It tries to find a high-load processor, and migrate the tasks on the processor to the idle processor. At a node with a heavy load, it is triggered every Ms. The Scheduler only searches for the processors in the current node. Only tasks that are not running can be moved from the Cache pool to other idle processors.
  
If the load balancing of the current node is already very good, the load of other nodes will be calculated. If the load of a node exceeds 25% of the current node, select the node for load balancing. If the local node has an average load, the task migration of the node is delayed. If the load is very poor, the delay time is very short. The delay time depends on the system topology.
  
   Iv. CpuMemSets
The SGI Origin 3000 ccNUMA system has been widely used in many fields and is a very successful system. To optimize the performance of Origin 3000, the sgi irix operating system implements CpuMemSets on it, by binding an application to the CPU and memory, you can take full advantage of the NUMA system's local memory access. Linux has also implemented CpuMemSets in the numa project, and has been applied in SGI's Altix 3000 server.
  
CpuMemSets provides a mechanism for Linux to Schedule System services and applications on the specified CPU and allocate memory on the specified node. CpuMemSets adds the cpumemmap and cpumemset two-layer structure based on the existing Linux scheduling and resource allocation code. The cpumemmap layer at the underlying layer provides a simple ing pair. Its main functions are as follows: map the CPU number of the system to the CPU number of the application, and map the system memory block number to the memory block number of the application. The main functions of the cpumemset layer on the upper layer are: specifies the application CPU on which a process schedules tasks, and the application memory blocks that can be allocated to the kernel or virtual storage area.
  
4.1 cpumemmap
  
The kernel task scheduling and memory allocation Code use the system number. The CPU and memory block in the system both have the corresponding system number. The CPU number and memory block number used by the application are application numbers, which are used to specify the affinity between CPU and memory in cpumemmap. Each process, each virtual memory area, and Linux kernel have cpumemmap. These mappings are inherited when fork (), exec () are called, or when a virtual memory area is created, processes with root permissions can expand cpumemmap, including adding CPU and memory blocks to the system. The modification of the ing will cause the kernel scheduling code to start to use the new system CPU. The storage allocation Code uses the new memory block to allocate the Memory Page, and the memory allocated on the old block cannot be migrated. Holes are not allowed in Cpumemmap. For example, if the size of cpumemmap is n, the ing application numbers must be from 0 to n-1. In Cpumemmap, system numbers and application numbers are not one-to-one mappings. Multiple application numbers can be mapped to the same system number.
  
4.2 cpumemset
  
When the system starts, the Linux kernel creates a default cpumemmap and cpumemset. The initial cpumemmap ing and cpumemset contain all CPU and memory block information of the system.
  
The Linux kernel only schedules this task on the CPU Of The cpumemset task, and only selects the memory allocated to the user's virtual memory zone from the memory list in the region, the kernel allocates memory only from the cpumemset memory list appended to the CPU that is executing the allocation request.
  
A newly created virtual memory area is obtained from the current cpumemset created by the task. If it is attached to an existing virtual memory area, the situation will be more complex, for example, the shared memory zone of the memory ing object and Unix System V can be attached to multiple processes or multiple times to different places of the same process. If it is appended to an existing memory area, the new virtual memory area inherits the cpumemset of the current append process by default. If the flag is CMS_SHARE, the new virtual memory area is linked to the same cpumemset.
  
On the allocation page, if the CPU used by the task has a corresponding storage area in cpumemset, the kernel selects from the memory list of the CPU, otherwise, select the memory list from the cpumemset corresponding to the default CPU.
  
4.3 hard partition and CpuMemSets
  
In a large NUMA system, users often want to control a portion of CPU and memory for some special applications. Currently, CpuMemSets are mainly used in two ways: Hard partitioning and soft partitioning. The hard partition technology of a large NUMA system is in conflict with the single system image advantage of the big NUMA system, while CpuMemSets allows users to control it more flexibly, it can overlap and divide the CPU and memory of the system, and allow multiple processes to regard the system as a single system image without restarting the system, ensure that some CPU and memory resources are allocated to the specified application at different times.
  
SGI's CpuMemSets soft Partitioning technology effectively solves the deficiencies in hard partitioning. A single-system SGI ProPack Linux Server can be divided into multiple different systems, each system can have its own console, root file system, and IP address. Each software-defined CPU group can be considered as a partition, and each partition can be restarted, installed, shut down, and updated. Communication is performed through SGI NUMAlink connections. The global shared memory of the intervals is supported by the XPC and XPMEM kernel modules. It allows processes in one partition to access the physical memory of another partition.
  
   V. Test
To effectively verify the performance and efficiency of the Linux NUMA system, we tested the performance of the NUMA architecture to SGI Altix 350 At the SGI Shanghai office.
  
The system configuration is as follows:
CPU: 8 1.5 GHz Itanium2
Memory: 8 GB
Interconnect Structure: 3
    
   SGI Altix350 Ring topology of 4 computing modules
Test cases:
  
1. Presta MPI test package (Benchmark from ASCI Purple)
From the topology of the interconnection, we can see that the latency of memory access in the computing module does not need to be interconnected, and the latency is the greatest. The rest needs to be interconnected to the computing module through one or two steps, we use the Presta MPI test package to test the impact of each step on the system. The specific results are as follows:
    
2. NPB testing by NASA
  
The above tests show that the SGI Altix 350 system has high access and computing performance, and the Linux NUMA technology has entered the practical stage.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.