Fixed the ext4 bug. A kernel bug was reported during production: MySQL used the entire Kernel panic during reset master. DBAs are very enthusiastic to help find the re-step: it is a mine-like file, as long as it is open, then fdatasync, Kernel panic. The position of the panic code is j_assert (Journal-> j_running_transaction!) in jbd2_journal_commit_transaction! = NULL); however, why does jbd2 commit transactions when running_transaction is not triggered? Therefore, trace can be added to all the threads that wake up the kjournald2 kernel (jbd2_journal_commit_transaction is called in the thread -- that is, trace is added to the wake_up (& Journal-> j_wait_commit, soon I found the cause: When an open file is directly fdatasync, ext4_sync_file will be called, which calls commit to start submitting jbd2 logs. jbd2_log_start_commit will lock and then call _ jbd2_log_start_commit, the Code is as follows: int _ jbd2_log_start_commit (journal_t * Journal, tid_t target) {/** are we already doing a recent enough commit? */If (! Tid_geq (Journal-> j_commit_request, target) {/** we want a new commit: OK, mark the request and wakup the * commit thread. we do _ not _ do the commit ourselves. */Journal-> j_commit_request = target; jbd_debug (1, "jbd: requesting commit % d/% d \ n", Journal-> j_commit_request, journal-> j_commit_sequence ); wake_up (& Journal-> j_wait_commit); return 1;} return 0;} from the trace results, the value of journal-> j_commit_request is 2 177452108, and the target value is 0, it seems that j_commit_request is obviously smaller than target, and should not go to the IF judgment, but it is actually gone, because the implementation of tid_geq is: static inline int tid_geq (tid_t X, tid_t y) {int Difference = (x-y); Return (difference> = 0 );} unsigned int type 2177452108 minus 0 and then convert to int type. Guess what the result is? Equal to-2117515188! It seems that the implementation of tid_geq is coorse and strange, so I flipped through the comments and found that jbd2 gave each transaction a tid, which is constantly increasing, it is also an unsigned int type, so it is easy to overflow, so we get such a tid_geq, and regard 0 as a tid that is "later" than 2177452108, when commit_request is 2177452108 and target is 0, it means that TID number 2177452108 has been submitted, and 0 is "later" than 2177452108, so it is necessary to give the transaction number 0 to commit, therefore, kjournald2 (that sentence wake_up) is awakened ). This wake-up finds that there is no running_transaction, which is a tragedy. From the trace perspective, most of the target values passed in _ jbd2_log_start_commit are not 0. It seems that this 0 came out of a strange way. I flipped through upstream code and found a patch provided by Ted in March: commit events: Theodore ts 'o <tytso@mit.edu> date: Wed Mar 16 17:16:31 2011-0400 ext4: Initialize fsync transaction IDs in ext4_new_inode () when allocating a new inode, we need to make sure I _sync_tid and I _datasync_tid are initialized. otherwise, one or both These two values cocould be left initialized to zero, which cocould potentially result in bug_on in jbd2_journal_commit_transaction. (this cocould happen by having Journal-> commit_request getting set to zero, which cocould wake up the kjournald process even though there is no running transaction, which then causes a bug_on via the j_assert (j_ruinning_transaction! = NULL) statement. signed-off-by: "Theodore ts 'o" <tytso@mit.edu> diff -- git a/fs/ext4/ialloc. c B/fs/ext4/ialloc. cIndex 2fd3b0e .. a679a48 100644 --- A/fs/ext4/ialloc. c ++ B/fs/ext4/ialloc. c @-1054,6 + 1054,11 @ Got :}}+ if (ext4_handle_valid (handle) {+ EI-> I _sync_tid = handle-> h_transaction-> t_tid; + EI-> I _datasync_tid = handle-> h_transaction-> t_tid; ++} + err = ext4_mark_inode_dirty (handle, inode ); If (ERR) {ext4_std_error (SB, err); Aha, that's it. Because I _sync_tid and I _datasync_tid are not correctly assigned values, the default 0 value is included and all the way to ext4_sync_file, the _ jbd2_log_start_commit later mistakenly believes that 0 is a new transaction to be committed (In fact, the current transaction has not been linked to running_transaction yet), so it is wrong. Append the patch, and then follow the re-creation step. The kernel is not panic. Since it is so easy to reproduce, why is it not encountered on other machines? The reason is that this commit_request must be a very large value, which will change to a negative number when it is converted to the int type. I tried to create an empty file and fdatasync on ext4, And the commit_request was changed to 1 million in about 10 minutes. It would take at least 14 days for it to reach 2 billion, after all, the online I/O pressure is not as high as the manual stress test, so the bug was triggered only a few months later when the commit_request reached 2 billion. Redhat's latest 2.6.32-220 kernel has this problem, so be careful. Thanks to @ yuanyun and @ Xiyu for providing the reproduction steps. The most difficult part of kernel bug fixing is reproduction. The two of them provided the steps directly. It was so considerate and polite! ====== I wanted to use ksplice to upgrade the kernel without restarting, so that DBA could fix this bug without restarting the machine, but I studied ksplice, it is found that it requires the GCC parameter-ffunction-sections-fdata-sections to compile the kernel, and the two parameters conflict with the-PG parameter, and our kernel trace needs to use-PG, so .... there is no solution at present, and there is no way to use ksplice to help us upgrade the kernel online.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.