How MySQL avoids using the swap partition of Linux to improve read and write performance
Linux has a lot of good memory, IO scheduling mechanism, but it does not apply to all scenarios. One of the headaches for DBAs is that it doesn't have to be that MySQL is important to avoid mapping the address space assigned to MySQL to swap. For systems with frequent read and write operations, the data appears to be in memory and the disk is actually very bad, and response time growth is likely to directly drag down the entire system. This blog is mainly about how we, as DBAs, try to avoid MySQL being killed by swap.
First we need to know something about the basics, such as why swap is generated. Suppose our physical memory is 16g,swap is 4G. If MySQL itself already occupies 12G of physical memory while other programs or system modules require 6G of memory, then the operating system may map a portion of the address space owned by MySQL to swap.
CP a large file, or use mysqldump to export a large database, the file system will often apply to Linux for a large amount of memory as a cache, accidentally will cause L to use swap. This scenario is more common, and the following are the simplest three adjustment methods:
1,/proc/sys/vm/swappiness content changed to 0 (temporary),/etc/sysctl.conf add vm.swappiness=0 (permanent)
This parameter determines whether Linux is inclined to use swap or is disposed to release the filesystem cache. In the case of tight memory, the lower the value the more inclined to release the file system cache.
Of course, this parameter only reduces the probability of using swap, and does not prevent Linux from using swap.
2, modify the configuration parameters Innodb_flush_method MySQL, turn on O_direct mode.
In this case, the InnoDB buffer pool will directly bypass the file system cache to access the disk, but redo log will still use the file system cache. It is important to note that Redo log is overwrite mode and does not take up too much even if the file system cache is used.
3. Add MySQL configuration Parameters Memlock
This parameter forces the address space of the mysqld process to remain locked in physical memory, which is a very overbearing requirement for the OS. You must start MySQL with the root account to take effect.
There is also a more sophisticated way to specify that MySQL uses large page memory (Large page). Large pages of memory on Linux will not be swapped out of physical memory, and Memlock have the same wonderful. The specific configuration method can be consulted: http://harrison-fisk.blogspot.com/2009/01/enabling-innodb-large-pages-on-linux.html
The four methods of how MySQL avoids using swap are described earlier. Here you need to add the principle and implementation mechanism, for the Linux API is not interested in students can skip directly.
I. The purpose of the operating system setting swap
A necessary condition for a program to run is sufficient memory, which is often a source of tension within the system. To meet the requirements of more programs, the operating system virtual a portion of the memory address and maps it to swap. For a program, it knows only that the operating system allocates memory addresses to itself, but it is unclear whether the memory addresses are mapped to physical memory or swap.
Physical memory and swap are functionally identical, except for the difference in performance between physical storage components (memory and disk). The operating system will swap in and out according to the characteristics of the program using memory, leaving the physical memory as much as possible to the program that needs it most. However, this scheduling is based on a predetermined set of rules, and does not fully meet the needs of the program. Some special programs, such as MySQL, want their data to always be in physical memory to provide higher performance. The operating system then sets up several APIs to provide the caller with a "special service".
Ii. several APIs provided by Linux
1, Mlockall () and Munlockall ()
This pair of functions allows the caller's address space to reside in physical memory, or it can be canceled when needed. The flag bit of mlockall () can be any combination of mcl_current and mcl_future, respectively, representing "keeping the allocated address space resident physical memory" and "maintaining the future allocated address space resident physical memory". For Linux, this is a very overbearing function, and only the root user has permission to invoke it.
2, Shmget () and Shmat ()
This pair of functions can be applied to the operating system using large page memory (Large page). Large-page memory is characterized by pre-allocation and perpetual physical memory, because the way that shared memory segments are used, page table is likely to be smaller than traditional small-page allocations. For multi-process shared memory programs (such as Oracle), large page memory saves a lot of page table overhead, while for MySQL there is no significant change in performance and resource overhead, and the benefit is to reduce the possibility that memory addresses are mapped to swap. As for why it is reduced rather than completely avoided, then explain it later.
3, O_direct and Posix_memalign ()
None of the above two methods will reduce the amount of memory used, and the caller's intention is to get higher system privileges rather than conserve system resources. O_direct is a more idealized way to save the file system cache overhead and ultimately reduce the swap usage by avoiding double buffer. O_direct is a flag related to Linux IO scheduling and is called in the Open function. Files opened with the O_DIRECT flag will not be used to read or write to the file system cache. Traditional databases (ORACLE, MySQL) basically have o_direct-related switches that improve performance while reducing memory usage. As for Posix_memalign (), it is used to apply the aligned memory address. Only the memory address requested by posix_memalign () can be used to read and write file descriptors in O_direct mode.
4, Madvise () and Fadvise ()
This is also a more moderate function, you can pass the caller's expectation of the data access pattern to Linux, in order to get better performance.
We are more interested in the two flags of Madv_dontneed and Fadv_noreuse. The former would recommend that Linux release the specified memory area, and the latter would recommend that the file system release the cache occupied by the specified file.
Third, MySQL memory usage related to some code
1, Memlock
In the MySQL source directory query Memlock, you can know that the role of this parameter is to make MySQL call Mlockall (). In the source code inside the match can know NDB, MyISAM and mysqld all call Mlockall (). NDB is a storage engine that can exist independently of MySQL and is not listed here. The way Mysqld calls Mlockall () is somewhat unexpected, and the flag passed to Mlockall () in the init_server_components () function is mcl_current, which means that the requested memory is not locked. And look at the sequence of calls to MyISAM: Mlockall () <-lock_memory () <-Mi_repair (), MyISAM will call the Mlockall () function only when repairing.
2, Large-pages
According to the kernel documents of Linux, there are two ways to use large-page memory: One is to create a hugetlb type of file, and mmap it into the memory address of the program, and then perform normal read and write operations. The other is the Shmget () +shmat () that was mentioned earlier, and it is the way MySQL uses it. In the MySQL source directory to match Shmget, you can find BDB, NDB, InnoDB, MyISAM all call this function. Then look at the more commonly used InnoDB and MyISAM engines. The
can be found in InnoDB os_mem_alloc_large () called Shmget (), and the function calling Os_mem_alloc_large () is only buf_pool_init ()--innodb Buffer The initialization function of the pool. According to the observation, InnoDB will use the large page memory in buffer pool according to the configuration parameters, Redo log seems to have no such treatment.
for MyISAM, a direct call to Shmget () is not found in the storage level code. This is because MyISAM is the native storage engine for MySQL, and many functions are stored in the Mysys directory on the previous layer. by searching Shmget (), we can find MyISAM in the order of invocation: Shmget () <-my_large_malloc_int () <-my_large_malloc () <-Init_key_cache ()。 That is to say, MyISAM only the index cache to use the large page memory, it is very easy to understand, because the MyISAM data is thrown directly to the file system cache, unable to use large page memory.
3, Innodb_flush_method
O_direct is a unique parameter for BDB, NDB, and InnoDB, where only the more common engine of InnoDB is discussed. In the InnoDB source directory to match the O_direct, it is easy to find a function called Os_file_set_nocache (), and this function is to change the way the file is opened to O_direct mode. Tracking again, you will find that only the Os_file_create () function calls Os_file_set_nocache (). Although there is create in the function name, actually os_file_create () chooses to open or create a new file depending on the parameters passed in. Os_file_create () also calls Os_file_set_nocache () to close the corresponding cache of the file system, depending on the configuration of MySQL. The following code is in the Os_file_create () function:
/* We Disable OS caching (o_direct) only on data files */
if (type! = Os_log_file &&
Srv_unix_file_flush_method = = Srv_unix_o_direct)
{
Os_file_set_nocache (file, name, MODE_STR);
}
This code means that only InnoDB data files are eligible to use O_direct mode, Redo log is not available.
The above analysis is based on the 5.0.85 version of the original Mysql,innodb is innobase.
There may be some discrepancies in the version, please participate in the discussion.
Reference documents:
Virtual Memory@wiki
All about Linux swap space
Hugetlb–large Page Support in the Linux Kernel
Page Table@wiki
Source: http://www.taobaodba.com/html/552_mysql_avoid_swap.html
Http://www.taobaodba.com/html/554_mysql_avoid_swap_2.html
Permanent Link: http://www.ha97.com/4201.html
How MySQL avoids using the swap partition of Linux to improve read and write performance