Loading editor... Source: http://www.realzyy.com /? P = 923 Linux has many good memory and IO Scheduling Mechanisms, but they are not suitable for all scenarios. One of the headaches for DBA Linux is that it does not avoid ing the address space allocated to MySQL to swap because MySQL is very important. For systems that frequently perform read/write operations, the data seems to be in the memory but is actually very bad on the disk, and the growth in response time is likely to drag down the entire system. This blog mainly talks about how we, as a DBA, try to avoid MySQL from being attacked by swap. First, we need to understand some basic things, such as why swap is generated. Assume that the physical memory is 16 GB, And the SWAp is 4 GB. If MySQL occupies 12 GB of physical memory and other programs or system modules require 6 GB of memory, the operating system may map some of the address space of MySQL to swap. When CP is a large file or mysqldump is used to export a large database, the file system will usually apply for a large amount of memory from Linux as the cache. If you are not careful, l will use swap. This scenario is common. The following are the simplest three adjustment methods: 1. Change the content of/proc/sys/Vm/swappiness to 0 (temporary) and/etc/sysctl. add VM on Conf. swappiness = 0 (permanent) determines whether Linux prefers to use swap or release the file system cache. When the memory is tight, the lower the value, the more inclined it is to release the file system cache. Of course, this parameter can only reduce the probability of using swap, and does not prevent Linux from using swap. 2. Modify the MySQL Configuration Parameter innodb_flush_method to enable the o_direct mode. In this case, the buffer pool of InnoDB directly bypasses the file system cache to access the disk, but the redo log still uses the file system cache. It is worth noting that the redo log is in the overwriting mode, and will not occupy too much even if the file system cache is used. 3. Adding the MySQL Configuration Parameter memlock will force the address space of the mysqld process to be locked in the physical memory. This is a very overbearing requirement for the OS. You must use the root account to start MySQL to take effect. Another complicated method is to specify that MySQL uses the large page memory (large page ). On Linux, the large page memory will not be swapped out of the physical memory, which is similar to memlock. Specific configuration method can refer to: http://harrison-fisk.blogspot.co... pages-on-linux.html
Http://www.realzyy.com /? P = 943
I previously introduced four methods for MySQL to avoid using swap. The Principle and Implementation Mechanism need to be supplemented here. If you are not interested in Linux APIs, you can skip them directly.
I. objective of setting swap in the operating system
A necessary condition for running a program is sufficient memory, and the memory is usually a resource that is relatively tight in the system. To meet the requirements of more programs, the operating system virtualizes some memory addresses and maps them to swap. For a program, it only knows that the operating system has allocated memory addresses to itself, but it is not clear whether these memory addresses are mapped to physical memory or swap.
The physical memory and swap have the same functions, but they differ greatly in performance because of the differences in physical storage components (memory and disk. The operating system will switch in and out based on the memory usage characteristics of the program, and leave the physical memory to the program that requires it as much as possible. However, such scheduling follows predefined rules and does not fully meet the needs of the program. Some special programs (such as MySQL) want to store their data in the physical memory forever to provide higher performance. Therefore, the operating system sets several APIs to provide "Special Services" for callers ".
Ii. Several APIs provided by Linux
1. mlockall () and munlockall ()
This pair of functions enables the caller's address space to be resident in the physical memory, and can also cancel this privilege as needed. The flag bit of mlockall () can be any combination of mcl_current and mcl_future, they represent "Keep allocated address space resident physical memory" and "keep future allocated address space resident physical memory ". For Linux, this is very domineering for functions. Only the root user has the permission to call them.
2. shmget () and shmat ()
This function can apply to the operating system for large page memory (large page ). The large page memory features pre-allocated and permanent physical memory. Because the shared memory segment is used, the page table may be smaller than the traditional small page memory allocation method. For multi-process Memory sharing programs (such as Oracle), large page memory can save a lot of page table overhead; For MySQL, performance and resource overhead are not significantly changed, the advantage is that it reduces the possibility of memory addresses being mapped to swap. As to why it is reduced rather than completely avoided, I will explain it later.
3. o_direct and posix_memalign ()
Both methods do not reduce memory usage. The caller's intention is to obtain higher system privileges, rather than saving system resources. O_direct is a more idealized method. By avoiding double buffer, it saves the file system cache overhead and eventually reduces swap usage. O_direct is a flag related to Linux I/O scheduling and is called in open functions. Files opened using the o_direct flag do not use the file system cache for read/write operations. Traditional databases (Oracle and MySQL) usually have o_direct related switches, which also reduces memory usage while improving performance. Posix_memalign () is used to apply for an alignment memory address. Only the memory address requested by posix_memalign () can be used to read and write file descriptors in o_direct mode.
4. madvise () and fadvise ()
This is also mild to the function, which can pass the caller's expectation of the data access mode to Linux for better performance.
We are interested in the two flags madv_dontneed and fadv_noreuse. The former recommends that Linux release the specified memory area, while the latter recommends that the file system release the cache occupied by the specified file.
Iii. MySQL memory usage code
1. memlock
Query memlock In the MySQL source code directory. you can know that this parameter is used to make MySQL call mlockall (). Matching in the source code shows that both NDB, MyISAM, and mysqld call mlockall (). NDB is a storage engine that can exist independently of MySQL. The method in which mysqld calls mlockall () is somewhat unexpected. The flag passed to mlockall () in the init_server_components () function is mcl_current, that is, the memory applied for later does not need to be locked. Let's take a look at the call sequence of MyISAM: mlockall () <-lock_memory () <-mi_repair (). MyISAM calls the mlockall () function only when it is fixed.
2. Large-pages
According to the Linux kernel documentation, there are two ways to use large page memory: one is to create a hugetlb file and map it To the memory address of the program, then perform normal read/write operations. The other is the shmget () + shmat () mentioned earlier, which is exactly the way MySQL uses. Matching shmget In the MySQL source code directory shows that bdb, NDB, InnoDB, and MyISAM call this function. Next, let's take a look at the commonly used InnoDB and MyISAM engines.
In InnoDB, we can find that OS _mem_alloc_large () calls shmget (), while only the buf_pool_init () -- InnoDB buffer pool initialization function that calls OS _mem_alloc_large. According to the Observed conclusion, InnoDB will use large page memory in the buffer pool based on the configuration parameters, and the redo log does not seem to have this treatment.
For MyISAM, direct calls to shmget () cannot be found in the storage-level code. This is because MyISAM is the native storage engine of MySQL, and many functions are stored in the mysys directory on the previous layer. By searching shmget (), we can find that the call sequence of MyISAM is as follows: shmget () <-my_large_malloc_int () <-my_large_malloc () <-init_key_cache (). That is to say, MyISAM only uses the index cache to use the large page memory, which is easy to understand, because MyISAM data is directly thrown to the file system for caching and cannot use the large page memory.
3. innodb_flush_method
O_direct is a unique parameter of bdb, NDB, and InnoDB. Here we only discuss InnoDB, a common engine. Matching o_direct in the source code directory of InnoDB makes it easy to find a function called OS _file_set_nocache (). This function is used to change the file opening mode to o_direct mode. After tracking, we will find that only the OS _file_create () function calls OS _file_set_nocache (). Although the function name contains "CREATE", OS _file_create () will choose to open or create a new file based on the input parameters. At the same time, OS _file_create () will also call OS _file_set_nocache () to disable the corresponding cache of the file system according to the MySQL configuration. The OS _file_create () function contains the following code:
/* We disable OS caching (o_direct) only on data files */
If (type! = OS _log_file &&
Srv_unix_file_flush_method = srv_unix_o_direct)
{
OS _file_set_nocache (file, name, mode_str );
}
This Code indicates that only InnoDB data files are eligible to use the o_direct mode, and redo log cannot be used.
The above analysis is based on MySQL 5.0.85, and InnoDB is Innobase.
There may be some discrepancies in different versions. You are welcome to join the discussion.
References:
Virtual Memory @ Wiki
All about Linux swap space
Hugetlb-large page support in the Linux Kernel
Page Table @ Wiki
Http://www.realzyy.com /? P = 1245
It must be admitted that even after reading MySQL how to avoid using swap and MySQL and how to avoid using swap (2), swap may still be stubborn to reproduce on the host. Fortunately, the victims of many swap problems have worked tirelessly to find the ultimate cause-NUMA. Here we stand on the shoulders of giants and give you a brief explanation of NUMA's principles and optimization methods.
I. NUMA and SMP
NUMA and SMP are two CPU-related hardware architectures. In the SMP architecture, all CPUs compete for one bus to access all the memory. The advantage is resource sharing, while the disadvantage is that the competition for bus is fierce. As the number of CPUs on the PC Server increases (not just the number of CPU cores), the disadvantages of bus contention gradually become more and more obvious, so Intel launched the NUMA architecture on the nehalem CPU, AMD also launched opteron CPU Based on the same architecture.
The biggest feature of NUMA is the introduction of the concepts of node and distance. NUMA strictly divides the two most valuable hardware resources, CPU and memory, into resource groups ), the CPU and memory in each resource group are almost the same. The number of resource groups depends on the number of physical CPUs (most of the existing PC servers have two physical CPUs, each of which has four cores ); distance is used to define the resource overhead for each node to call and provide data support for resource scheduling optimization algorithms.
Ii. NUMA-related policies
1. Each process (or thread) inherits the NUMA policy from the parent process and assigns a priority node. If the NUMA policy permits, the process can call resources on other nodes.
2. NUMA's CPU allocation policies include cpunodebind and physcpubind. Cpunodebind specifies the cores on which the process runs, while physcpubind can more precisely specify the cores on which the process runs.
3. The memory allocation policies of NUMA include localalloc, preferred, membind, and interleave. Localalloc requires that the process request to allocate memory from the current node. Preferred is loose and specifies a recommended node to obtain the memory. If the recommended node does not have enough memory, the process can try other nodes. Membind can specify several nodes, and the process can only request memory allocation from these specified nodes. Interleave specifies that a process requests memory allocation from a specified number of nodes in an RR algorithm.
Iii. Relationship between NUMA and swap
We may have discovered that NUMA's memory allocation policy is not fair for processes (or threads. In existing RedHat Linux, localalloc is the default NUMA memory allocation policy. This configuration option makes it easy for the resource exclusive program to exhaust the memory of a node. When the memory of a node is exhausted, Linux allocates the node to a process (or thread) that consumes a large amount of memory, and swap generates the node properly. Although there are still many page caches that can be released, there is even a lot of free memory.
4. Solve swap Problems
Although NUMA's principle is relatively complex, it is actually easy to solve swap: you only need to use numactl-interleave to modify the NUMA policy before starting MySQL.
It is worth noting that the numactl command can be used not only to adjust the NUMA policy, but also to view the resource usage of each node. It is a very worth studying command.
References:
The MySQL "SWAp insanity" problem and the effects of the NUMA Architecture
NUMA status: item Definition
Linux administrator's Manual (# Man numactl)