MySQL redo deadlock troubleshooting and troubleshooting process analysis, mysqlredo

Source: Internet
Author: User

MySQL redo deadlock troubleshooting and troubleshooting process analysis, mysqlredo

Problem background

I went to work on Monday. I first asked my colleagues about the test results last week. I was told that MySQL Server hang is in multiple instance scenarios and cannot be tested. The native version does not have this problem, this problem occurs in the new version, so I cannot help but feel at ease. I feel strange in my mind. Fortunately, the on-site environment is still there, providing a good environment for troubleshooting, then, I am involved in the intense troubleshooting process. The problem example is as follows:

Copy codeThe Code is as follows:
A problem occurs when the concurrency is 384;
The MySQL server cannot execute transaction-related statements, and even simple select statements cannot be executed;
All threads are waiting and cannot be killed.
On-site environment collection

First, you can use the pstack tool to obtain the stack information of the current problematic instance so that you can locate the specific thread and locate the problematic thread:


Use the pt-pmp tool to collect process information in hang.info, as shown below:










Problem Analysis

We can see from the stack that there are several types of threads:

Wait for the user thread to enter the INNODB engine layer. innodb_thread_concurrency = 16 in the test environment. When the number of active threads in the INNODB layer exceeds this value, the queue is required. Therefore, there will be a large number of queuing threads, the Influence and function of this parameter is a very good article. Due to the limited space, it is not extended here. for interested parties, refer to the official document: 14.14 InnoDB Startup Options and System Variables;
Background threads that need to write redo logs during operations, including page cleaner threads and asynchronous io threads;
Reading the purge thread on the Page & operating the master thread of change buffer;
A large number of user threads that need to write redo logs.
From the above classification, it is not difficult to see that all threads that need to write redo logs are waiting for log_sys-> mutex. So which thread has obtained the mutex to protect the redo log buffer from, we can troubleshoot the problem following this clue and solve the following problems:

Question 1: Which thread obtains log_sys-> mutex?
Question 2: Why is the thread that obtains log_sys-> mutex not continuing to execute? Is it waiting for other locks or other reasons?
Question 3: What is the entire competition for resources if it is not a hardware issue?

1. Question 1: from the table to the inside

When looking for the thread status of log_sys-> mutex, there are two points that can help us quickly locate this thread:

Because log_sys-> mutex can only be obtained by the same thread at the same time, in the pt-pmp information output, the thread with a thread number greater than 1 can be ruled out;
Since this thread has obtained log_sys-> mutex, it should still be in the log writing process. Therefore, we can focus on viewing the log writing logic, that is, the stack of mtr_log_reserve_and_write or log_write_up_to.
Following the above ideas, I quickly found the following threads from pstack:

Here we will briefly introduce the MySQL redo log writing process (excluding the undo & buffer pool part). When modifying the data, MySQL will first record different redo logs for the operation type, the main process is:

Record the data before the operation and generate different redo logs based on different types. For redo types, See src/storage/innobase/include/mtr0mtr. h. record the data after the operation. Different types of data will contain different content. For details, refer to the function: recv_parse_or_apply_log_rec_body (); write logs to redo buffer, add the data related to dirty pages to the flush list linked list of buffer_pool. Based on the innodb_flush_log_at_trx_commit value, determine whether to perform the sync operation during commit.

The above stack is written to Redo and then added to the flush list during the hang process, that is, after this thread obtains log_sys-> mutex, during the process of obtaining log_sys-> log_flush_order_mutex, there are a large number of threads waiting for the thread to release the log_sys-> mutex lock. Question 1 has an answer, so what is log_sys-> log_flush_order_mutex and what is it occupied?

Note:

1. MySQL buffer pool maintains an orderly dirty page linked list (flush list according LSN order ), in this way, during the checkpoint & log_free_check process, you can quickly locate the location where the redo log needs to be pushed and add the dirty pages;
2. During the flush list process, You need to lock it to ensure the ordering of the LSN in the flush list. However, if you use log_sys-> mutex, when the concurrency is large, log_sys-> mutex contention may cause performance problems. Therefore, another mutex is added to protect the ordering of dirty pages by LSN. The code is described as follows:


2. Question 2: make the best effort

In the troubleshooting process of Problem 1, we confirmed the thread of log_sys-> mutex, which had hang in the process of obtaining log_sys-> log_flush_order_mutex, therefore, the thread stack can be divided into the following categories:

Thread 446, get log_sys-> mutex, wait to get log_sys-> log_flush_order_mutex to add dirty pages to the flush list of buffer_pool; the thread that needs to obtain log_sys-> mutex to write or read log information; the unknown thread obtains log_sys-> log_flush_order_mutex, Which is hang when doing other tasks.


Therefore, the key to the problem is to find the thread that obtains log_sys-> log_flush_order_mutex.

The following operations are performed to find related threads:

Find the location for obtaining log_sys-> log_flush_order_mutex;


Based on the thread information in the existing pstack, carefully check the relevant code in the above search results and find that there is basically no thread to get log_sys-> log_flush_order_mutex; gdb enters the MySQL Server and prints log_sys-> log_flush_order_mutex, found {waiters = 1; lock_word = 0 }!!!, That is, Thread 446 is waiting for an idle mutex, and this Mutex is indeed waiting. Because our version is the Release version, there is no way to get a lot of useful information, if the debug version is used, it is difficult to reproduce the problem. The definition of log_flush_order_mutex is as follows:

From the above analysis, we can find the answer to question 2:

Only two threads are related to log_sys-> log_flush_order_mutex. One is Thread 446, and the other is the last Thread that calls log_flush_order_mutex_exit; in an existing Thread, a Thread does not wake up Thread 446 when log_sys-> log_flush_order_mutex is released, resulting in Thread 446 hang and failure to obtain log_sys-> mutex from other threads, resulting in instance unavailability; log_sys-> log_flush_order_mutex is not obtained by any thread.3. Question 3: Unexpected

From the analysis process in question 2, we can see that log_sys-> log_flush_order_mutex is not obtained by any Thread. Why is Thread 446 not awakened? Is the signal lost or is it a program problem? If the signal is lost, why can it be reproduced stably? There are no similar bugs in the official bug list. I searched the community and found that there was very little available information. At this time, the analysis seemed to be in a dead end, and the pressure began to grow virtually ...... There seems to be no way, but there is a reason for any problem. I found the cause, that is, there is a solution ...... Once again, I moved my attention to the stack of Thread 446 and checked the function:

From the analysis process of Question 2, we can conclude that Thread 446 is not awakened in the exit process of log_flush_order_mutex_exit of a Thread, so we can follow this function to find out how it can wake up other processes, when there is no way to analyze the code step by step, I hope some gains will be gained. As function calls continue to deepen, I will focus on mutex_exit_func, note the following:

The following two information can be obtained from the preceding Annotations:

Due to the existence of memory barrier, the call sequence of mutex_get_waiters & mutex_reset_lock_word may be different from the execution sequence. In this case, the hang problem may occur. A function handler () is specially written to solve the above problem.

From the preceding annotations, we can see that not the signal is lost, but the existence of multi-thread memory barrier may cause an exception in the order of command execution. This problem exists, but since sync_arr_wake_threads_if_sema_free () exists () to avoid this problem, why does hang still exist? With this clue, I feel a little hope ...... After searching for sync_arr_wake_threads_if_sema_free, this thread is called only in srv_error_monitor_thread. This thread monitors internal MySQL exceptions and prints error information. The notorious 600 S suicide case is also a masterpiece, the problem arises:

The machine was running on hang on weekends. Why didn't the exception be detected and abort? Since sync_arr_wake_threads_if_sema_free can be awakened, why not?

Following this idea, I checked the srv_error_monitor_thread stack in pstack and found that the thread had hang while obtaining log_sys-> mutex, so it could not execute sync_arr_wake_threads_if_sema_free () & the usual exception check just answered the above question. The detailed stack is as follows:



After the above analysis, the problem becomes clearer, and the process can be simply summarized:

Thread 446 obtains log_sys-> mutex, but is not woken up while waiting for log_sys-> log_flush_order_mutex; Thread XXX encounters the memory barrier problem when releasing log_sys-> log_flush_order_mutex, thread 446 is not awakened; Thread 470 is hang when obtaining log_sys-> mutex, resulting in unable to execute sync_arr_wake_threads_if_sema_free (), resulting in the hang of the entire instance; thread 470 needs to get the log_sys-> mutex of Thread 446, and Thread 446 needs to be awakened by Thread 470 to release log_sys-> mutex;

Combined with the status information of log_sys-> log_flush_order_mutex, the entire hang process of the instance is as follows:


For introduction to Memory barrier, refer:

Memory barrierhttp: // name5566.com/4535.html

Problem Solving

Now that you know the cause of the problem, the problem can be solved smoothly. There are two methods:

Directly remove the judgment of log_get_lsn here. It is the judgment information added by the developer. It is written to locate the LSN exception and Crash is also used, which is of little use. Retain the judgment, change log_get_lsn to log_peek_lsn. The latter performs try_lock first. When the lock fails, it returns the result directly without judgment. This method is more elegant; the modified version does not reproduce the problem during the test.

Problem Expansion

Although the problem has been solved, this problem exists in the official version. Why didn't buglist find the relevant information? So I checked the latest code and found that the problem has been fixed, the solution is the second method listed above. The detailed commit message information is as follows:


Bug impact scope: MySQL 5.6.28 and earlier versions all have this problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.