Case study of OOM kill MySQL caused by insufficient swap space

Source: Internet
Author: User

Case study of OOM kill MySQL caused by insufficient swap space

Background:

  • A machine has a memory of 256 GB and 2 mysql instances are installed. Each buffer_pool has 106 GB of memory, totaling 212 GB;
  • A database is migrated to the environment around, and an OOM kill message is sent around on the first day. A Mysql instance is forced to kill due to insufficient swap space;
  • The mysqld process of this instance is not completely cleared. Instead, it becomes a zombie process. As a result, the instance cannot be restarted and the machine can be restarted.

Investigation:

Is the top output after oom kill, because the mysqld becomes a zombie process, the memory has not been released.

Mysql's BP is set to 106 GB, but its RES reaches 119 GB and GB respectively, which is close to the physical memory ceiling of the machine, while the machine swap is only 7 GB and is consumed.

Now the reason is clear, and the solution is simple. Reduce BP to 90 GB.

Note: No similar fault has occurred in Self-Tuning for more than 10 days.

Extension

1. mysql memory overhead

Innodb_buffer_pool_size defines the size of the cache pool, but the buffer pool itself requires additional data structures for management.

For example, each page in the buffer pool needs a buf_block_t for management. This part of memory is not included in the parameter.

The additional consumption accounts for about 8% of the total BP. For more information, see http://mysqlha.blogspot.co.uk/2008/11/innodb-memory-overhead.html.

These are only the overhead of the global buffer. With the session buffer added, the memory required by Mysql is only higher.

2. Why is swap?

First, let's take a rough look at the linux memory management. In the numa architecture, linux memory is divided into multiple nodes, while non-numa only has one, which is described by pg_data_t. Each node is divided into three zones, it is described by the zone_struct struct.

Each zone has active_lru and inactive_lru, and each lru is divided into an anon anonymous page and a file cache ing page linked list, with a total of four LRU;

A zone also defines pages_low, pages_min, and pages_high. When the zone's available memory is smaller than pages_low, kswapd is awakened to reclaim the memory, and kswapd is awakened synchronously when it is smaller than pages_min, until the available zone memory reaches pages_high;

Linux caches a lot of data, such as page cache and slab cache. When this part is recycled, It is synchronized to the disk and then reused directly. For other memory pages, for example, the anonymous page of the User-state address space and the page of the IPC shared memory area can only be replaced with the swap partition and cannot be recycled directly.

When does the OS reclaim memory?

1. Regular recovery: kswapd wakes up regularly. If the free memory of the zone is smaller than pages_low, the page is recycled. If the free memory of the zone is smaller than pages_min, the page is recycled in synchronous mode;

2. Direct Recovery: if linux allocates memory or creates a buffer for user processes and the current system does not have enough physical memory, linux recycles the page; when the OS still cannot obtain enough pages after trying to recycle the memory, it calls find_bad_process and runs OOM kill;

Either way, shrink_list () is called at last. The scan logic for the four linked lists is defined in vmscan. in the get_scan_count function in c, the scan_balance variable determines the lru memory to be recycled. The general logic is as follows:

1. If the system disables swap or has no swap space, only the file-based linked list is scanned, that is, the anonymous page linked list is not scanned.

If (! SC-> may_swap | (get_nr_swap_pages () <= 0 )){

Scan_balance = SCAN_FILE;

Goto out;

}

2. If the current process is not global page recycling and swappiness = 0, the anonymous page linked list scan is not performed.

If (! Global_reclaim (SC )&&! Vmscan_swappiness (SC )){

Scan_balance = SCAN_FILE;

Goto out;

}

3. if the global page is recycled and the number of pages in the idle memory and file-based linked list is less than zone-> pages_high, the anonymous page is recycled. Even if swappiness is set to 0, the system performs swap.

If (global_reclaim (SC )){

Unsigned long zonefile;

Unsigned long zonefree;

Zonefree = zone_page_state (zone, NR_FREE_PAGES );

Zonefile = zone_page_state (zone, NR_ACTIVE_FILE) +

Zone_page_state (zone, NR_INACTIVE_FILE );

If (unlikely (zonefile + zonefree <= high_wmark_pages (zone ))){

Scan_balance = SCAN_ANON;

Goto out;

}

}

4. If the system has a sufficient inactive file linked list, do not consider recycling anonymous pages, that is, do not perform swap

If (! Inactive_file_is_low (lruvec )){

Scan_balance = SCAN_FILE;

Goto out;

}

So far, we can have a general understanding of the role of swappiness = 0, which cannot completely prohibit swap.

When Will OOM kill be triggered?

When the system memory is exhausted and the swap partition is filled up, the OOM deletion program is started because the kernel cannot be allocated to new idle memory;

The kernel call path is out_of_memory ()-select_bad_process ()-oom_kill_process ()

Select_bad_process () is used to select the process to be killed. It scans every process in the system and calls oom_badness (). The API logic is as follows:

1. Obtain the oom_score_adj Of the process. If it is equal to OOM_SCORE_ADJ_MIN, that is,-1000, It is not killed. This parameter is recorded by/proc/NNN/oom_score_adj and can be manually modified.

Adj = (long) p-> signal-> oom_score_adj;

If (adj = OOM_SCORE_ADJ_MIN ){

Task_unlock (p );

Return 0;

}

2. Calculate the score based on the memory consumed by the process. If it is a root process, multiply it by 3% to avoid being killed. Finally, return points.

Points = get_mm_rss (p-> mm) + get_mm_counter (p-> mm, MM_SWAPENTS) +

Atomic_long_read (& p-> mm-> nr_ptes) + mm_nr_pmds (p-> mm );

Task_unlock (p );

If (has_capability_noaudit (p, CAP_SYS_ADMIN ))

Points-= (points * 3)/100;

Select_bad_process () compares the oom_badness () returned values of each process to find out the process with the highest score and is not the leader of the thread group and return it to out_of_memory (), which calls oom_kill_process () send a sigkill signal to kill.

In addition to killing processes, Linux can choose to directly run panic when OOM occurs. It is valid when vm. panic_on_oom = 1.

Conclusion

So far, we can get a general idea of the significance of swappiness = 0 and the cause of OOM kill. To avoid this behavior, we should try to leave enough memory for the OS, generally around 20% of the physical memory.

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.