NUMA trade-off and optimization settings

Source: Internet
Author: User
 

When the OS-layer NUMA is disabled, enabling NUMA on the BIOS layer will affect the performance, and the QPS will decrease by 15-30%;

When NUMA is disabled at the BIOS level, performance is not affected regardless of whether NUMA is enabled at the OS level.

Install numactl:
# Yum install numactl-y
# Numastat is equivalent to CAT/sys/devices/system/node/node0/numastat, record detailed information about all Memory nodes in the system in the/sys/devices/system/node/folder. # Numactl -- hardware lists NUMA nodes on the System

# Numactl -- show view binding information

 

 

 

In the RedHat or centos system, you can use commands to determine whether the BIOS layer enables NUMA.
# Grep-I NUMA/var/log/dmesg
If the output result is: no NUMA configuration found
Indicates that NUMA is disable. If not, numa is enable. For example, numa: Using 30 for the hash shift.
You can use the lscpu command to view the NUMA topology of the machine.

When the numa_miss value is found to be high, it indicates that the allocation policy needs to be adjusted. For example, bind a specified process to a specified CPU to increase the memory hit rate.


---------------------------------------------

The current machine has multiple CPUs and multiple memory blocks. In the past, we used to regard memory blocks as a large block of memory. The access messages from all CPUs to this shared memory are the same. This is the SMP model that was widely used before. However, as the number of processors increases, shared memory may cause more and more memory access conflicts. If memory access reaches the bottleneck, the performance will not increase. NUMA (non-uniform memory access) is a model introduced in such an environment. For example, a machine has two processors and four memory blocks. We combine one processor and two memory blocks, called a NUMA node, so that this machine will have two NUMA nodes. In terms of physical distribution, the physical distance between NUMA node processor and memory block is smaller, so access is faster. For example, this machine divides two processors (cpu1, cpu2) into two memory blocks (memory1.1, memory1.2, memory2.1, memory2.2) on both sides of each processor ), in this way, numa node1 accesses memory1.1 and memory1.2 faster than memory2.1 and memory2.2. Therefore, if the NUMA mode is used to ensure that the CPU in the node only accesses the memory block in the node, the efficiency is the highest.

When running a program, you can use numactl-M and-physcpubind to determine the CPU and memory on which the program runs. CPU-topology provides a table. When the program uses only one node resource and compares multiple node resources (the gap between 38 S and 28 S ). Therefore, it is of practical significance to limit the running of programs in NUMA node.

But then again, Will NUMA be formulated? -- NUMA trap. Swap's Crime and Penalty article addresses the NUMA trap. The phenomenon is that when your server still has memory, it finds that it is already using swap, and even has caused machine stagnation. This may be due to NUMA restrictions. If a process limits that it can only use the memory of its own NUMA node, after the memory usage of its own NUMA node, it will not use the memory of other NUMA nodes, and it will start to use swap, or even worse. When the machine does not set swap, it may directly crash! Therefore, you can use numactl -- interleave = all to remove the NUMA node restriction.

 

The conclusion is that the use of NUMA is determined based on the specific business.

If your program occupies a large amount of memory, you should mostly choose to disable the NUMA node limit (or disable NUMA from hardware ). At this time, your program is very likely to encounter the NUMA trap.

In addition, if your program does not occupy large memory, it requires faster program running time. Most of you should choose to restrict only access to this NUMA node for processing.

---------------------------------------------------------------------

Kernel Parameter overcommit_memory:

It is a memory allocation policy.

Optional values: 0, 1, and 2.

0: indicates that the kernel will check whether there is enough available memory for use by the process. If there is enough available memory, the memory application will be allowed; otherwise, the memory application will fail, and return the error to the application process.

1: indicates that the kernel allows allocation of all physical memory regardless of the current memory status.

2: indicates that the kernel can allocate more memory than the total physical memory and swap space.

Kernel Parameter zone_reclaim_mode:

Optional values: 0 and 1

A. When a node has insufficient memory:

1. If it is 0, the system will tend to allocate memory from other nodes.

2. If the value is 1, the system will tend to reclaim the cache memory from the local node.

B. cache is very important to performance, so 0 is a better choice.

----------------------------------------------------------------------

MongoDB NUMA Problems

The MongoDB log is shown as follows:

Warning: You are running on a NUMA machine.

We suggest launching every d like this to avoid performance problems:

Numactl-interleave = all other d [other options]

Solution: temporarily modify the NUMA Memory Allocation Policy To interleave = all (the policy for intertwined allocation on all node nodes ):

1. Add numactl-interleave = all before the original startup command.

For example, # numactl -- interleave = all $ {mongodb_home}/bin/mongod -- config CONF/MongoDB. conf

2. Modify Kernel Parameters

Echo 0>/proc/sys/Vm/zone_reclaim_mode; echo "VM. zone_reclaim_mode = 0">/etc/sysctl. conf

----------------------------------------------------------------------

I. NUMA and SMP

NUMA and SMP are two CPU-related hardware architectures. In the SMP architecture, all CPUs compete for one bus to access all the memory. The advantage is resource sharing, while the disadvantage is that the competition for bus is fierce. As the number of CPUs on the PC Server increases (not just the number of CPU cores), the disadvantages of bus contention gradually become more and more obvious, so Intel launched the NUMA architecture on the nehalem CPU, AMD also launched opteron CPU Based on the same architecture.

The biggest feature of NUMA is the introduction of the concepts of node and distance. NUMA strictly divides the two most valuable hardware resources, CPU and memory, into resource groups ), the CPU and memory in each resource group are almost the same. The number of resource groups depends on the number of physical CPUs (most of the existing PC servers have two physical CPUs, each of which has four cores ); distance is used to define the resource overhead for each node to call and provide data support for resource scheduling optimization algorithms.

Ii. NUMA-related policies

1. Each process (or thread) inherits the NUMA policy from the parent process and assigns a priority node. If the NUMA policy permits, the process can call resources on other nodes.

2. NUMA's CPU allocation policies include cpunodebind and physcpubind. Cpunodebind specifies the cores on which the process runs, while physcpubind can more precisely specify the cores on which the process runs.

3. The memory allocation policies of NUMA include localalloc, preferred, membind, and interleave.

Localalloc specifies that the process requests to allocate memory from the current node;

Preferred is relatively loose and specifies a recommended node to get the memory. If the recommended node does not have enough memory, the process can try another node.

Membind can specify several nodes, and the process can only request memory allocation from these specified nodes.

Interleave requires a process to request memory allocation from a specified number of nodes by using the RR (round robin scheduling) algorithm.

 

 

The default memory allocation policy of NUMA is to allocate priority to the local memory of the CPU where the process is located. This will result in unbalanced memory allocation between CPU nodes. When the memory of a CPU node is insufficient, this will cause swap generation, rather than allocating memory from remote nodes. This is the so-called swap insanity phenomenon.

MySQL adopts the thread mode and does not support NUMA. If a single machine runs only one MySQL instance, we can disable NUMA. There are three methods to disable it:

1. on the hardware layer, set disable in BIOS.

2. OS kernel, set NUMA = off at startup;

3. You can use the numactl command to change the Memory Allocation Policy To interleave ).

If a single machine runs multiple MySQL instances, we can bind MySQL to different CPU nodes and use the bound memory allocation policy to forcibly allocate memory to the current node, this not only makes full use of the NUMA feature of the hardware, but also avoids the low CPU utilization of a single instance of MySQL for multiple cores.

Iii. Relationship between NUMA and swap

We may have discovered that NUMA's memory allocation policy is not fair for processes (or threads. In existing RedHat Linux, localalloc is the default NUMA memory allocation policy. This configuration option makes it easy for the resource exclusive program to exhaust the memory of a node. When the memory of a node is exhausted, Linux allocates the node to a process (or thread) that consumes a large amount of memory, and swap generates the node properly. Although there are still many page caches that can be released, there is even a lot of free memory.

4. Solve swap Problems

Although NUMA's principle is relatively complex, it is actually easy to solve swap: you only need to use numactl-interleave to modify the NUMA policy before starting MySQL.

It is worth noting that the numactl command can be used not only to adjust the NUMA policy, but also to view the resource usage of each node.

 

 

I. CPU
Start with CPU.
If you check it carefully, some servers may have an interesting phenomenon: When you cat/proc/cpuinfo, you will find that the CPU frequency is different from its nominal frequency:
# Cat/proc/cpuinfo
Processor: 5
Model name: Intel (r) Xeon (r) CPU E5-2620 0 @ 2.00 GHz
CPU MHz: 1200.000
This is the intel E5-2620 CPU, Which is 2.00g * 24 CPU, but we found that the frequency of 5th CPU is 1.2G.
Why?
These are all due to the latest CPU Technology: energy-saving mode. When the operating system is not busy with the CPU hardware, it will reduce the CPU frequency to save power and reduce the temperature. This is a good news for environmental protection and global warming resistance, but it may be a disaster for MySQL.
To ensure that MySQL can fully utilize CPU resources, we recommend that you set the CPU to the maximum performance mode. This setting can be set in bios and operating system. Of course, it is better and more thorough to set this option in BIOS. Due to the differences between various BIOS types, setting the CPU to the maximum performance mode varies significantly, so we will not detail how to set it here.
Then let's look at the memory, which of the following can be optimized.
I) Let's first look at NUMA.
Inconsistent storage access structure (NUMA: Non-uniform memory access) is also the latest memory management technology. It corresponds to the symmetric multi-processor structure (SMP: symmetric multi-processor. Simple teams are as follows:
, The detailed NUMA information is not described here. However, we can intuitively see that the cost of SMP access to memory is the same; but in the NUMA architecture, the cost of local memory access is different from that of non-local memory access. Based on this feature, we can set the memory allocation mode for processes on the operating system. Currently, the following methods are supported:
-- Interleave = nodes
-- Membind = nodes
-- Cpunodebind = nodes
-- Physcpubind = CPUs
-- Localalloc
-- Preferred = node
In short, you can specify the number of CPU nodes that are allocated locally or in polling mode. Unless it is set to -- interleave = nodes Round Robin, that is, the memory can be allocated on any NUMA node. In other ways, even if there is memory surplus on other NUMA nodes, Linux does not allocate the remaining memory to this process, but uses swap to obtain the memory. Experienced system administrators or DBAs know how poor the database performance caused by SWAp is.
So the simplest method is to disable this feature.
You can disable this feature temporarily when starting a process in the bios, operating system, or operating system.
A) due to the differences in various BIOS types, how to disable NUMA varies greatly. We will not detail how to set it here.
B) disable it in the operating system. You can add NUMA = off at the end of the kernel line of/etc/grub. conf, as shown below:
Kernel/vmlinuz-2.6.32-220.el6.x86_64 Ro root =/dev/mapper/volgroup-root rd_no_luks lang = en_US.UTF-8 kernel = volgroup/root rd_no_md quiet sysfont = latarcyrheb-sun16 rhgb crashkernel = auto kernel = volgroup/swap rhgb crashkernel = auto quiet keyboardtype = pc keytable = US rd_no_dm NUMA = off
In addition, you can set VM. zone_reclaim_mode = 0 to recycle the memory as much as possible.
C) when MySQL is started, disable the NUMA feature:
Numactl -- interleave = All mysqld
Of course, the best way is to disable it in BIOS.
Ii) let's take a look at VM. swappiness.
VM. swappiness is the operating system's policy to control physical memory switching. The value is a percentage value. The minimum value is 0 and the maximum value is 100. The default value is 60. VM. swappiness is set to 0, which indicates that swap is minimized, and 100 indicates that inactive memory pages are switched out as much as possible.
Specifically, when the memory is basically full, the system will determine based on this parameter whether to swap inactive memory rarely used in the memory or to release the data cache. The cache caches data read from the disk. According to the program's local principle, the data may be read again later. As the name suggests, inactive memory is mapped by applications, but memory is not used for a long time.
We can use vmstat to see the number of inactive memory:
# Vmstat-An 1
Procs ----------- memory ---------- --- swap -- ----- Io ---- System -- ----- CPU -----
R B SWPD free Inact active Si so Bi Bo in CS us Sy ID wa st
1 0 0 27522384 326928 1704644 0 0 153 11 10 0 0 100 0
0 0 0 27523300 326936 1704164 0 0 74 784 590 0 100 0 0
0 0 0 27523656 326936 0 0 8 1704692 439 0 0 0 0
0 0 0 27524300 326916 0 0 4 52 1703412 0 0 198 0
You can see more detailed information through/proc/meminfo:
# Cat/proc/meminfo | grep-I Inact
Inactive: 326972 KB
Inactive (Anon): 248 KB
Inactive (File): 326724 KB
Here we will further discuss inactive memory in depth. In Linux, memory may be in three states: free, active, and inactive. As we all know, Linux kernel maintains many LRU lists internally for memory management, such as lru_inactive_anon, lru_active_anon, lru_inactive_file, lru_active_file, and lru_unevictable. Here, lru_inactive_anon and lru_active_anon are used to manage anonymous pages, lru_inactive_file and lru_active_file are used to manage page caches page cache. The system kernel will occasionally move the active memory to the inactive list based on the access status on the memory page. These inactive memories can be exchanged to swap.
In general, MySQL, especially InnoDB, manages the memory cache. It occupies a large amount of memory and may not frequently access it. If these memories are exchanged by Linux errors, it will waste a lot of CPU and IO resources. InnoDB manages the cache by itself. The cached file data occupies the memory, which is of almost no benefit to InnoDB.
Therefore, we 'd better set VM. swappiness = 1 or 0 on the MySQL server.

 

NUMA trade-off and optimization settings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.