公測以來的兩次全服維護,都遇到了個別伺服器停服時間異常長的情況,具體表現為停服流程阻塞在MySQL存檔,待儲存資料一直不減少,持續大概一個小時到兩個小時之後繼續儲存,成功停服,並SQLError報錯“Lost connection to MySQL server during query”。
兩個修改方案,一個是按照官網建議增大net_read_timeout逾時時間,另一個是縮短odbc.ini的Readtimeout和Writetimeout逾時時間。前者的意義在於閥值提高不要觸發MySQL的報錯中斷連線,後者的意義在於報錯中斷連線之後能儘快恢複重連繼續儲存。前者需要重啟MySQL,維護時存在風險,我們決定先採用第二種方案。
一 問題描述
11月4日伺服器停服維護,2023伺服器存檔阻塞。
11月11日伺服器停服維護,2028,2096,10018伺服器存檔阻塞。
具體資料如下:
伺服器id |
報錯時間 |
報錯時DBServiceExec慢心跳最長時間 |
2028 |
2014-11-11_08:28:01 |
6013009ms (100min) |
2096 |
2014-11-11_07:06:09 |
6000006ms(100min) |
10018 |
2014-11-11_08:42:32 |
6601007ms(110min) |
2023 |
2014-11-04_05:59:47 |
2950005ms(49min) |
以上伺服器異常表現類似,DBserviceExec慢心跳時間超長,出現下面錯誤記錄檔後,儲存恢複正常,伺服器正常停服。
[unixODBC][MySQL][ODBC 3.51 Driver][mysqld-5.5.20-rel24.1-log]Lost connection to MySQL server during query
根據官網資料,這個問題一般出現在進行儲存大量資料時,停服存檔符合這個情況。
詳細內容見標紅部分。
B.5.2.3 Lost connection to MySQL server
There are three likely causes for this error message.
Usually it indicates network connectivity trouble and you should check the condition of your network if this error occurs frequently. If the error message includes “during query,” this is probably the case you are experiencing.
Sometimes the “during query” form happens when millions of rows are being sent as part of one or more queries. If you know that this is happening, you should try increasing net_read_timeout from its default of 30 seconds to 60 seconds or longer, sufficient for the data transfer to complete.
More rarely, it can happen when the client is attempting the initial connection to the server. In this case, if yourconnect_timeout value is set to only a few seconds, you may be able to resolve the problem by increasing it to ten seconds, perhaps more if you have a very long distance or slow connection. You can determine whether you are experiencing this more uncommon cause by using SHOW GLOBAL STATUS LIKE ‘Aborted_connects‘. It will increase by one for each initial connection attempt that the server aborts. You may see “reading authorization packet” as part of the error message; if so, that also suggests that this is the solution that you need.
If the cause is none of those just described, you may be experiencing a problem with BLOB values that are larger than max_allowed_packet, which can cause this error with some clients. Sometime you may see anER_NET_PACKET_TOO_LARGE error, and that confirms that you need to increase max_allowed_packet.
二 解決方案
1 官網給出修改net_read_timeout 配置項。此修改需要重啟Mysql。
2 錯誤記錄檔是Mysql的Client端在Server端無響應逾時後報出,可以通過修改Client端的逾時配置項,減少響應等待時間。此修改需要修改odbc.ini設定檔。
Readtimeout=N,N單位為秒,實際等待時間為3*N 秒。
Writetimeout=N,N單位為秒,實際等待時間為net_retry_count* N秒,Mysql預設配置net_retry_count為10,TCP/IP預設逾時時間為10min,所以,可以看出最大逾時時間為100min。和停服異常阻塞時間基本吻合。
根據11月10日2000服日誌,總結向資料庫寫入和讀取資料時的時間,建議配置項數值如下:
Readtimeout = 300,單位為秒,實際等待時間為900 秒,因每次串連處理為毫秒級,所以主要作用是減少等待逾時時間。
Writetimeout = 180,單位為秒,實際等待時間為1800秒,因每次串連處理為毫秒級,所以主要作用是減少等待逾時時間。
兩個配置項具體資訊見下表:
readtimeout |
The timeout in seconds for attempts to read from the server. Each attempt uses this timeout value and there are retries if necessary, so the total effective timeout value is three times the option value. You can set the value so that a lost connection can be detected earlier than the TCP/IP Close_Wait_Timeout value of 10 minutes. This option works only for TCP/IP connections, and only for Windows prior to MySQL 5.1.12. Corresponds to the MYSQL_OPT_READ_TIMEOUToption of the MySQL Client Library. Added in 3.51.27. |
writetimeout |
The timeout in seconds for attempts to write to the server. Each attempt uses this timeout value and there are net_retry_countretries if necessary, so the total effective timeout value isnet_retry_count times the option value. This option works only for TCP/IP connections, and only for Windows prior to MySQL 5.1.12. Corresponds to the MYSQL_OPT_WRITE_TIMEOUT option of the MySQL Client Library. Added in 3.51.27. |
修改readtimeout 和 writetimeout主要是配置socket讀寫逾時,配置項原理如下:
+void vio_timeout(Vio *vio, timeout_type which, uint milliseconds)
{
#ifdef __WIN__
- ulong wait_timeout= (ulong) timeout * 1000;
- (void) setsockopt(vio->sd, SOL_SOCKET,
- which ? SO_SNDTIMEO : SO_RCVTIMEO, (char*) &wait_timeout,
- sizeof(wait_timeout));
+ ulong wait_timeout;
+#else
+ struct timeval tv;
+#endif /* __WIN__ */
+ int optname;
+ switch (which)
+ {
+ case VIO_READ_TIMEOUT:
+ optname = SO_RCVTIMEO;
+ break;
+ case VIO_WRITE_TIMEOUT:
+ optname = SO_SNDTIMEO;
+ break;
+ default:
+ return;
+ }
+#ifdef __WIN__
+ wait_timeout= (ulong) milliseconds;
+ setsockopt(vio->sd, SOL_SOCKET, optname, (char*) &wait_timeout,
+ sizeof(wait_timeout));
+#else
+ tv.tv_sec = milliseconds / 1000;
+ tv.tv_usec = milliseconds % 1000 * 1000;
+ setsockopt(vio->sd, SOL_SOCKET, optname, &tv, sizeof(tv));
#endif /* __WIN__ */
}