在同一個叢集中,兩個從節點 db117,db118 都各出現了1次 “serverStatus was very slow”
db117出現 very slow:
2017-09-14T18:55:49.599+0800 [conn13113916] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1214, after repl: 1214, at end: 1214 }
前後沒有出現大的查詢(>10秒以上)。其它效能指標也正常。
只是前面看到空間增加了30個 串連。
2017-09-14T18:55:48.157+0800 [initandlisten] connection accepted from 10.116.207.64:57026 #13113915 (264 connections now open)
...................
2017-09-14T18:55:49.551+0800 [initandlisten] connection accepted from 10.174.177.10:64840 #13113946 (295 connections now open)
2017-09-14T18:55:49.599+0800 [conn13113916] serverStatus was very slow:
應該是突然查詢數量上來,雖然每個查詢的時間不長,但IO,CPU 有個波動的原因。
(但在監控資料中也沒看到有突增)
此現象在DB118 中昨天也出現一次,但不是在同時。
2017-09-14T14:12:16.102+0800 [initandlisten] connection accepted from 10.116.207.112:10800 #13301443 (286 connections now open)
..............................................................
2017-09-14T14:12:19.780+0800 [initandlisten] connection accepted from 10.168.168.212:51264 #13301552 (395 connections now open)
2017-09-14T14:12:19.819+0800 [conn13300638] query Mallcoo.MallCard query: { UID: 17433632, Status: 1, MallID: 273 } planSummary: IXSCAN { UID: 1.0, MallID: 1.0 } ntoreturn:1 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:4079097 nreturned:1 reslen:99 4079ms
2017-09-14T14:12:19.844+0800 [conn13301520] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1516, after repl: 1516, at end: 1526 }
2017-09-14T14:12:19.844+0800 [conn13301520] command admin.$cmd command: serverStatus { serverStatus: 1.0 } keyUpdates:0 numYields:0 locks(micros) r:83 reslen:7456 1557ms
DB118 14:12 時間,IO有一個高峰點,從 2%-->12%。而db117 卻沒有,問題原因比較難找。
從兩次的slow 日誌來看, 2017-09-14T14:12:19.844+0800 [conn13301520] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 1516, after repl: 1516, at end: 1526 }
慢的原因都在:after recordStats: 1516 ,
也就是說有查詢,而記憶體裡沒資料,出現了大量的 page faults.
再就是IO ,CPU 有個暫時的繁忙狀態。
這兩個從節點記憶體現在為64G,雖然不是很高。但暫時因為業務相對增長比較慢,不會增加記憶體。
如果有大量的此類預警日誌出現,現場可用 mongostat 查看 faults 狀態值情況,如果一直都高於 5以上。可以認為是記憶體不足了。