xtrabackup備份陷阱引起的一次嚴重業務故障

最後更新：2015-08-27 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：mysql xtrabackup

一、情境：早上7點被電話叫醒，說夜間有大量業務訂單不成功，web server上報了很多sqlException，懷疑跟新部署的xtrabackup備份指令碼有關。生產系統內容為redHat5.8，mysql版本為5.6，xtrabackup版本為2.2.10

二、分析過程：

1、上db一看，備份時間確實很異常，1點鐘的crontab，運行到6:55才完成：

innobackupex: Backup created in directory ‘/backup/inc_1‘
150825 06:55:57 innobackupex: Connection to database server closed
150825 06:55:57 innobackupex: completed OK!

2、之前測試的時候，與生產同樣規模的資料庫（30G)，全備半小時內可以完成，增量備份5分鐘左右完成，為何這次備份耗了接近6個小時？難道是鎖表？不應該啊，網上所有的資訊都說xtrabackup備份過程是絕對不會鎖innodb表的，只會鎖MyISAM表，而業務表全是innodb，問題在哪裡呢？繼續往前看備份日誌：

150825 01:04:09 innobackupex: Continuing after ibbackup has suspended
150825 01:04:09 innobackupex: Executing FLUSH TABLES WITH READ LOCK...
>> log scanned up to (547712171257)
>> log scanned up to (547712171257)
>> log scanned up to (547712171257)

。。。。。。

一堆log scanned up to

150825 06:55:44 innobackupex: All tables locked and flushed to disk
150825 06:55:44 innobackupex: Starting to backup non-InnoDB tables and files

。。。。。。

150825 06:55:57 innobackupex: All tables unlocked

很明顯，xtrabackup在執行"FLUSH TABLES WITH READ LOCK"的時候被阻塞了，一直阻塞到06:55才執行成功。之前沒太理解"FLUSH TABLES WITH READ LOCK"的威力，這個東西的執行，會遇到兩種情況：

1）、很快執行成功，這樣子後續的備份能較快完成。問題是"FLUSH TABLES WITH READ LOCK"到“All tables unlocked（unlock tables）"期間，肯定是會鎖表的，這裡不管你是innodb表還是MyISAM表。從上面的日誌可以看出，鎖了13秒，這13秒，所有的DML語句都會被阻塞。

2）、無法快速執行成功，這樣災難就來了，也就是本次故障發生的根源。什麼會阻塞"FLUSH TABLES WITH READ LOCK"的執行呢？可以做下實驗確認下

a、運行前有長查詢沒有執行完

650) this.width=650;" src="http://s3.51cto.com/wyfs02/M00/72/26/wKioL1XdyMmS1OdOAAJwS7lsKBU012.jpg" title="lock1.jpg" alt="wKioL1XdyMmS1OdOAAJwS7lsKBU012.jpg" />

這裡可以看到，一個長查詢就把"FLUSH TABLES WITH READ LOCK"給阻塞了。

b、運行前有任何錶被顯式加了鎖

對任何一個表，比如t1，執行lock table t1 read，然後再在其它會話執行"FLUSH TABLES WITH READ LOCK"，阻塞。如果對t1加寫鎖lock table t1 write，更會阻塞"FLUSH TABLES WITH READ LOCK"。

不管什麼原因，"FLUSH TABLES WITH READ LOCK"被阻塞後，結果是很悲哀的。不僅所有庫的DML語句會被阻塞，而且長查詢或者顯式加鎖的資料庫，查詢都會有問題。

三、DB日誌分析

可以從慢查詢日誌來驗證一下前面的分析：

1、訂單

# Time: 150825 6:55:57
# [email protected]: root[root] @ [172.17.5.15] Id: 9922648
# Query_time: 1534.181557 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SET timestamp=1440456957;
INSERT INTO `globallink_g`.`g_order` (`id`, `userid`, `create_date`, `modify_date`, `amount_paid`, `order_status`) VALUES (234977, ‘test‘, ‘2015-08-25 06:36:39‘, ‘2015-08-25 06:36:42‘, 232, 1);

這麼一條簡單的sql，執行了近半個小時，而且在6:55:57才執行完成，這正好是備份完成（解鎖的時刻）

2、報表

# Time: 150825 6:55:58
# [email protected]: root[root] @ [192.168.1.200] Id: 9921401
# Query_time: 4173.803182 Lock_time: 0.000081 Rows_sent: 1 Rows_examined: 9742
SET timestamp=1440456958;
call g_getActiveImsi(‘201508250551‘,‘10008‘);

這個也是被阻塞了1個多小時，在備份完成後才執行完畢。

四、總結

1、該生產系統的資料庫結構描述是傳統的HA模式，即資料目錄在共用磁陣，cluster軟體監控mysql進程，不是流行的mysql master-slave架構，如果是master-slave架構，可以直接備份從庫，這樣子不會對主庫有任何影響。

2、"xtrabackup全程備份不鎖innodb表，只鎖MyISAM表"這種說法有問題。經過測試，在innobackupex命令裡面加個--no-lock參數，就真的不會鎖表了，但使用這個參數有幾個前提。

官方解釋如下：

Use this option to disable table lock with FLUSH TABLES WITH READ LOCK. Use it only if ALL your tables are InnoDB and you DO NOT CARE about the binary log position of the backup. This option shouldn’ t be used if there are any DDL statements being executed or if any updates are happening on non-InnoDB tables (this includes the system MyISAM tables in the mysql database), otherwise it could lead to an inconsistent backup. If you are considering to use --no-lock because your backups are failing to acquire the lock,this could be because of incoming replication events preventing the lock from succeeding. Please try using
--safe-slave-backup to momentarily stop the replication slave thread, this may help the backup to succeed and you then don’t need to resort to using this option. xtrabackup_binlog_info is not created when –no-lock option is used (because SHOW MASTER STATUS may be inconsistent), but under certain conditions xtrabackup_binlog_pos_innodb can be used instead to get consistent binlog coordinates as described in Working with Binary Logs.

3、這個問題在業務不繁忙的系統上，可能還不容易顯現。半夜1點，鎖表鎖個10幾秒鐘，問題不大。

如果在業務繁忙的系統，或者備份剛好和某個繁重的task(日結月結之類的）在一起運行，那麼xtrabackup很有可能引起業務表長時間的鎖定導致業務故障。

本文出自 “記憶片段” 部落格，謝絕轉載！

xtrabackup備份陷阱引起的一次嚴重業務故障

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

xtrabackup備份陷阱引起的一次嚴重業務故障

聯繫我們

熱門內容

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

xtrabackup備份陷阱引起的一次嚴重業務故障

聯繫我們

熱門內容

熱門主題

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support