MySQL redo deadlock problem Troubleshooting and resolution process Analysis _mysql

Source: Internet
Author: User

Problem background

Working in Monday, I first learned about last week's Test with my colleagues, be told in a multi-instance scenario MySQL Server hang Live, unable to test, the original version does not exist this problem, and the new version of this problem, can not help but feel a shiver, the heart can not help feeling strange, fortunately, the scene environment is still in, Provides a good environment for troubleshooting problems and then puts them in a tight problem-checking process. Examples of problems are shown below:

Copy Code code as follows:

The problem occurs when the concurrent quantity is 384;
The MySQL server was unable to execute transaction-related statements even if a simple SELECT statement could not be executed;
All threads are in a waiting state and cannot be KILL.
Collection of site Environment

First, get the stack information for the current problem instance through the Pstack tool so that the specific thread's lookup & problem thread is positioned:


Use the PT-PMP tool to count process information in Hang.info, as follows:










Problem analysis

As you can see from the stack, there are several types of threads:

Waiting for the user thread to enter the INNODB engine layer, innodb_thread_concurrency=16 in the test environment, queues when the number of active threads in the INNODB layer is greater than this value, so there will be a large number of queued threads, the impact of this parameter & The role itself is a very good article, due to limited space, this does not expand, interested parties can refer to official documents: 14.14 InnoDB Startup Options and System Variables;
During the operation, we need to write the redo log background thread, mainly including page cleaner thread, asynchronous IO threads, etc.
Reading page page purge thread & operations master thread for change buffer;
A large number of user threads that need to write redo log.
From the above classification is not difficult to see, all the need to write redo log threads are waiting for Log_sys->mutex, then this protection redo log buffer of the mutex is the thread to get it, so we can follow this clue for troubleshooting, The following issues need to be addressed:

Question one: Which thread gets the Log_sys->mutex?
Question two: Why does the thread that gets the Log_sys->mutex not go on, wait for another lock or other reason?
Question three: If it is not a hardware problem, how is the entire resource competition process?

1. Question one: Youbiaojili

When looking for Log_sys->mutex-owning threads, there are two points that can help us quickly navigate to this thread:

Since Log_sys->mutex can only be obtained by the same thread at the same time, it is possible to exclude threads with a thread number greater than 1 in the PT-PMP information output;
Since this thread has acquired Log_sys->mutex, it should still be in the process of writing the log, so the focus can be on the logic of the Write log, which includes: Mtr_log_reserve_and_write or log_write_up_to stacks.
Follow the above ideas quickly found from the Pstack to the thread:

Here we briefly describe the process of MySQL write redo log (omit the Undo & Buffer Pool section), when the data is modified, MySQL will first of the action type to record a different redo log, the main process is:

Record the data before the operation, generate different redo logs according to different types, redo types can refer to the file: Src/storage/innobase/include/mtr0mtr.h records the data after the operation, for different types will contain different content, can refer to function: Recv_parse_or_apply_log_rec_body (), write log to redo buffer, and the data involved in the dirty page to the Buffer_pool flush list list; according to Innodb_ Flush_log_at_trx_commit value to determine whether the sync operation is performed at commit time.

The above stack is written redo after the dirty page is added to the flush list process hang live, that is, this thread after acquiring Log_sys->mutex, Log_sys->log_flush_order_mutex live in the process of obtaining hang , and at this time there are a lot of threads waiting for the thread to release Log_sys->mutex lock, the question has already had the answer, then Log_sys->log_flush_order_mutex is what Dongdong, it is occupied by Which?

Description

1. The MySQL buffer pool maintains an ordered list of dirty pages (flush list according LSN order) so that you can quickly navigate to redo in the process of doing checkpoint & Log_free_check Log needs to advance the position of adding dirty pages;
2, flush list process needs to lock it to ensure the order of LSN in the flush list, but if the use of Log_sys->mutex, in a large amount of concurrency will cause log_sys->mutex contention, which causes Performance issues, a separate mutex has been added to protect the order of the dirty pages by LSN, as shown in the code:


2. Question two: ammunition

During the troubleshooting process, we identified the Log_sys->mutex thread, which hang in the process of obtaining log_sys->log_flush_order_mutex, so the thread stack can be divided into the following categories:

Thread 446, get Log_sys->mutex, wait to get Log_sys->log_flush_order_mutex to add a dirty page to the flush list of buffer_pool; need to get log_sys-& A thread that Gt;mutex to write a log or read a log message; an unknown thread gets log_sys->log_flush_order_mutex and is hang when doing something else.


Therefore, the key to the problem is to find which thread has acquired the Log_sys->log_flush_order_mutex.

To find the relevant thread, do the following:

Find places to get Log_sys->log_flush_order_mutex;


Combined with the thread information in the existing pstack, carefully look at the relevant code in the search results above, found that basically no thread to obtain log_sys->log_flush_order_mutex; gdb into the MySQL Server, will log_sys-> Log_flush_order_mutex print out and discover {waiters=1 lock_word= 0}!!!, that Thread 446 is waiting for an idle mutex, and this mutex is actually waiting because our version is release version, so a lot of useful information is not available, and if you run with the debug version it is difficult to reproduce the problem, Log_flush_order_mutex is defined as follows:

The answers to question two can be drawn from the above analysis:

Only two threads are related to Log_sys->log_flush_order_mutex, one of which is thread 446, and the other is the thread that last called Log_flush_order_mutex_exit (); Thread 446 is not awakened by one of the threads in the existing thread during the release of Log_sys->log_flush_order_mutex, causing thread 446 hang and causing other threads to not get Log_sys->mutex. This causes the instance to be unavailable; Log_sys->log_flush_order_mutex is not obtained by any thread. 3. Question three: the Ledge

The analysis process of question two shows that Log_sys->log_flush_order_mutex is not obtained by any thread, but why is thread 446 not awakened, signal loss or procedural problems? If the signal is lost, why can it be stabilized? The official bug list is not a similar bug, search the community, found that the available information is very little, this time the analysis seems to fall into a dead end, the pressure began to become invisible in the heart. There seems to be no way, but any problem is a reason, found the reason, that is, there is a solution ... Once again, move your attention to the stack of Thread 446, and then look at the function:

The analysis process of problem two can be concluded that a thread in the Log_flush_order_mutex_exit exit process does not wake thread 446, then follow this function to find out how it wakes the other, in the absence of a way to only such step-by-step analysis of the code, I hope some of the gains, as the function call continues to deepen, will look at the Mutex_exit_func, the function of the annotation caught my attention:

You can get two points from the comments above:

Because of the presence of memory barrier, the sequence of calls to Mutex_get_waiters & Mutex_reset_lock_word may be the opposite of the order of execution, which can cause hang problems; a function is written specifically sync_arr_ Wake_threads_if_sema_free () to solve the above problem.

As you can see from the above comments, it is not a signal loss, but the existence of a multithreaded memory barrier may cause an exception to the order in which the instruction is executed, which is determined to exist, but since there are sync_arr_wake_threads_if_sema_free () Avoid this problem, why still exist hang? With this clue, I felt a little bit of something. After looking sync_arr_wake_threads_if_sema_free only in Srv_error_monitor_thread have called, this thread is dedicated to MySQL internal anomaly monitoring and print out the error information of the thread, The infamous 600S suicide case is also its masterpiece, so the question is:

Machines on weekends are hang, why not detect anomalies and abort? Since Sync_arr_wake_threads_if_sema_free can awaken, why not wake it?

Along this line of thought, we looked at the stack of Srv_error_monitor_thread in Pstack, and found that the thread was hang when it fetched Log_sys->mutex, and therefore could not execute Sync_arr_wake_threads_ The If_sema_free () & Normal anomaly Check just answers the questions above, and the detailed stack is as follows:



As the above analysis becomes clearer, the process can be summed up simply as follows:

Thread 446 obtained Log_sys->mutex, but was not awakened while waiting for Log_sys->log_flush_order_mutex; thread XXX is releasing Log_sys->log_flush The memory barrier problem occurred during the _order_mutex process, the thread 446 was not awakened, and thread 470 was log_sys->mutex when the hang was obtained, resulting in the inability to execute SYNC_ARR_WAKE_THR Eads_if_sema_free (), which leads to the hang of the entire instance, thread 470 needs to obtain the Log_sys->mutex of thread 446, and thread 446 needs to be awakened by thread 470 to release L og_sys->mutex;

Combining the state information of Log_sys->log_flush_order_mutex, the whole process of instance hang live as follows:


The introduction of Memory barrier can refer to:

Memory barrierhttp://name5566.com/4535.html

Problem solving

Now that you know the cause of the problem, then the problem can be resolved smoothly, there are two ways:

The direct removal of LOG_GET_LSN's judgment here, which is itself a developer's addition to some of the judgment information, in order to locate the LSN of the exception to write, the use of the time is also crash, not useful; reserve judgment, change LOG_GET_LSN to Log_peek_lsn, The latter will first be try_lock, when the lock fails to find a direct return, without judgment, this method is more elegant; the modified version did not reproduce the problem during the test.

Problem extension

Although the problem has been solved, but the official version of the problem is certainly, why Buglist did not find the relevant information, so after reviewing the latest code, found that the problem has been fixed, the fix method for the second method listed above, detailed commit message information is as follows:


Bug Impact Range: MySQL 5.6.28 and previous versions have this problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.