How does MySQL avoid using Linux swap partitions to improve read/write performance?

Source: Internet
Author: User
Linux has many good memory and IO Scheduling Mechanisms, but they are not suitable for all scenarios. One of the headaches for DBA Linux is that it won't be caused by MySQL

Linux has many good memory and IO Scheduling Mechanisms, but they are not suitable for all scenarios. One of the headaches for DBA Linux is that it won't be caused by MySQL

Linux has many good memory and IO Scheduling Mechanisms, but they are not suitable for all scenarios. One of the headaches for DBA Linux is that it does not avoid ing the address space allocated to MySQL to swap because MySQL is very important. For systems that frequently perform read/write operations, the data seems to be in the memory but is actually very bad on the disk, and the growth in response time is likely to drag down the entire system. This blog mainly talks about how we, as a DBA, try to avoid MySQL from being attacked by swap.

First, we need to understand some basic things, such as why swap is generated. Assume that the physical memory is 16 GB, And the swap is 4 GB. If MySQL occupies 12 GB of physical memory and other programs or system modules require 6 GB of memory, the operating system may map some of the address space of MySQL to swap.

When cp is a large file or mysqldump is used to export a large database, the file system will usually apply for a large amount of memory from Linux as the cache. If you are not careful, L will use swap. This scenario is common. The following are the three simplest adjustment methods:

1. Change the/proc/sys/vm/swappiness content to 0 (temporary), and add vm. swappiness = 0 (permanent) to/etc/sysctl. conf)
This parameter determines whether Linux prefers to use swap or release the file system cache. When the memory is tight, the lower the value, the more inclined it is to release the file system cache.
Of course, this parameter can only reduce the probability of using swap, and does not prevent Linux from using swap.

2. Modify the MySQL Configuration Parameter innodb_flush_method to enable the O_DIRECT mode.
In this case, the buffer pool of InnoDB directly bypasses the file system cache to access the disk, but the redo log still uses the file system cache. It is worth noting that the Redo log is in the overwriting mode, and will not occupy too much even if the file system cache is used.

3. Add the MySQL Configuration Parameter memlock.
This parameter forces the IP address space of the mysqld process to be locked in the physical memory, which is an overbearing requirement for the OS. You must use the root account to start MySQL to take effect.

Another complicated method is to specify that MySQL uses the Large Page memory (Large Page ). On Linux, the large page memory will not be swapped out of the physical memory, which is similar to memlock.

I previously introduced four methods for MySQL to avoid using swap. The Principle and Implementation Mechanism need to be supplemented here. If you are not interested in Linux APIs, you can skip them directly.

I. objective of setting swap in the operating system
A necessary condition for running a program is sufficient memory, and the memory is usually a resource that is relatively tight in the system. To meet the requirements of more programs, the operating system virtualizes some memory addresses and maps them to swap. For a program, it only knows that the operating system has allocated memory addresses to itself, but it is not clear whether these memory addresses are mapped to physical memory or swap.
The physical memory and swap have the same functions, but they differ greatly in performance because of the differences in physical storage components (memory and disk. The operating system will switch in and out based on the memory usage characteristics of the program, and leave the physical memory to the program that requires it as much as possible. However, such scheduling follows predefined rules and does not fully meet the needs of the program. Some special programs (such as MySQL) want to store their data in the physical memory forever to provide higher performance. Therefore, the operating system sets several APIs to provide "Special Services" for callers ".

Ii. Several APIs provided by Linux
1. mlockall () and munlockall ()
This pair of functions enables the caller's address space to be resident in the physical memory, and can also cancel this privilege as needed. The flag bit of mlockall () can be any combination of MCL_CURRENT and MCL_FUTURE, they represent "Keep allocated address space resident physical memory" and "keep future allocated address space resident physical memory ". For Linux, this is very domineering for functions. Only the root user has the permission to call them.

2. shmget () and shmat ()
This function can apply to the operating system for Large Page memory (Large Page ). The large page memory features pre-allocated and permanent physical memory. Because the shared memory segment is used, the page table may be smaller than the traditional small page memory allocation method. For multi-process Memory sharing programs (such as Oracle), large page memory can save a lot of page table overhead; For MySQL, performance and resource overhead are not significantly changed, the advantage is that it reduces the possibility of memory addresses being mapped to swap. As to why it is reduced rather than completely avoided, I will explain it later.

3. O_DIRECT and posix_memalign ()
Both methods do not reduce memory usage. The caller's intention is to obtain higher system privileges, rather than saving system resources. O_DIRECT is a more idealized method. By avoiding double buffer, it saves the file system cache overhead and eventually reduces swap usage. O_DIRECT is a flag related to Linux I/O scheduling and is called in open functions. Files opened using the O_DIRECT flag do not use the file system cache for read/write operations. Traditional databases (ORACLE and MySQL) usually have O_DIRECT related switches, which also reduces memory usage while improving performance. Posix_memalign () is used to apply for an alignment memory address. Only the memory address requested by posix_memalign () can be used to read and write file descriptors in O_DIRECT mode.

4. madvise () and fadvise ()
This is also mild to the function, which can pass the caller's expectation of the data access mode to Linux for better performance.
We are interested in the two flags MADV_DONTNEED and FADV_NOREUSE. The former recommends that Linux release the specified memory area, while the latter recommends that the file system release the cache occupied by the specified file.

Iii. MySQL memory usage code
1. memlock
Query memlock In the MySQL source code directory. you can know that this parameter is used to make MySQL call mlockall (). Matching in the source code shows that both NDB, MyISAM, and mysqld call mlockall (). NDB is a storage engine that can exist independently of MySQL. The method in which mysqld calls mlockall () is somewhat unexpected. The flag passed to mlockall () in the init_server_components () function is MCL_CURRENT, that is, the memory applied for later does not need to be locked. Let's take a look at the call sequence of MyISAM: mlockall () <-lock_memory () <-mi_repair (). MyISAM calls the mlockall () function only when it is fixed.

2. large-pages
According to the Linux kernel documentation, there are two ways to use large page memory: one is to create a hugetlb file and map it To the memory address of the program, then perform normal read/write operations. The other is the shmget () + shmat () mentioned earlier, which is exactly the way MySQL uses. Matching shmget In the MySQL source code directory shows that BDB, NDB, InnoDB, and MyISAM call this function. Next, let's take a look at the commonly used InnoDB and MyISAM engines.
In InnoDB, we can find that OS _mem_alloc_large () calls shmget (), while only the buf_pool_init () -- InnoDB Buffer Pool initialization function that calls OS _mem_alloc_large. According to the Observed conclusion, InnoDB will use large page memory in the Buffer Pool based on the configuration parameters, and the Redo log does not seem to have this treatment.
For MyISAM, direct calls to shmget () cannot be found in the storage-level code. This is because MyISAM is the native storage engine of MySQL, and many functions are stored in the mysys directory on the previous layer. By searching shmget (), we can find that the call sequence of MyISAM is as follows: shmget () <-my_large_malloc_int () <-my_large_malloc () <-init_key_cache (). That is to say, MyISAM only uses the index cache to use the large page memory, which is easy to understand, because MyISAM data is directly thrown to the file system for caching and cannot use the large page memory.

3. innodb_flush_method
O_DIRECT is a unique parameter of BDB, NDB, and InnoDB. Here we only discuss InnoDB, a common engine. Matching O_DIRECT in the source code directory of InnoDB makes it easy to find a function called OS _file_set_nocache (). This function is used to change the file opening mode to O_DIRECT mode. After tracking, we will find that only the OS _file_create () function calls OS _file_set_nocache (). Although the function name contains "create", OS _file_create () will choose to open or create a new file based on the input parameters. At the same time, OS _file_create () will also call OS _file_set_nocache () to disable the corresponding cache of the file system according to the MySQL configuration. The OS _file_create () function contains the following code:
/* We disable OS caching (O_DIRECT) only on data files */
If (type! = OS _LOG_FILE &&
Srv_unix_file_flush_method = SRV_UNIX_O_DIRECT)
{
OS _file_set_nocache (file, name, mode_str );
}
This Code indicates that only InnoDB data files are eligible to use the O_DIRECT mode, and Redo log cannot be used.

The above analysis is based on MySQL 5.0.85, and InnoDB is Innobase.
There may be some discrepancies in different versions. Please compare them in detail.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.