自從把我的一個網站搬家到linode 1G 的vps上後,總感覺有些不太正常。首先dnspod監控上顯示網站有時能訪問,有時不行。然後wp supercache外掛程式在執行預緩衝任務時經常失敗後重啟,郵件內容如“[http://www.111cn.net] 預緩衝可能已失去響應。預緩衝已經重新啟動。”。還有一個情況是linode有時會發來磁碟io使用高的警示郵件。
今天抽空檢查了下伺服器,發現一個現象是mysql進程頻繁無故重啟。
150424 17:41:14 [Note] Event Scheduler: Loaded 0 events
150424 17:41:14 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.5.37' socket: '/tmp/mysql.sock' port: 3306 Source distribution
150424 19:27:16 mysqld_safe Number of processes running now: 0
150424 19:27:16 mysqld_safe mysqld restarted
150424 19:27:24 [Note] Plugin 'InnoDB' is disabled.
150424 19:27:24 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
150424 19:27:24 [Note] - '0.0.0.0' resolves to '0.0.0.0';
150424 19:27:24 [Note] Server socket created on IP: '0.0.0.0'.
150424 19:27:24 [Warning] 'user' entry 'root@li676-235' ignored in --skip-name-resolve mode.
150424 19:27:24 [Warning] 'proxies_priv' entry '@ root@li676-235' ignored in --skip-name-resolve mode.
150424 19:27:25 [Note] Event Scheduler: Loaded 0 events
150424 19:27:25 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.5.37' socket: '/tmp/mysql.sock' port: 3306 Source distribution
通過這個日誌,mysql在提示他運行困難,也就是表示伺服器資源不夠用了,接下來開始檢查。
The error log message “mysqld_safe Number of processes running now: 0″ indicates scarcity for resources to pursue the operations.
運行free -m ,發現空閑記憶體還有很多,差不多才用一半。
[root@li676-235 ~]# free -m
total used free shared buffers cached
Mem: 990 903 87 0 114 351
-/+ buffers/cache: 436 554
Swap: 255 53 202
為了確定伺服器資源是否真的不夠用了,方法是查看系統日誌中是否有oom(Out Of memory) killer運行過,果然在日誌中發現有很多類似日誌。
[root@li676-235 var]# egrep -i "oom|kill|mysql" /var/log/messages |more
Apr 23 13:36:16 li676-235 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr 23 13:36:16 li676-235 kernel: mysqld cpuset=/ mems_allowed=0
Apr 23 13:36:16 li676-235 kernel: CPU: 0 PID: 16020 Comm: mysqld Not tainted 3.18.5-x86_64-linode52 #1
Apr 23 13:36:16 li676-235 kernel: [<ffffffff8112695f>] ? oom_kill_process+0x65/0x32f
Apr 23 13:36:16 li676-235 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 23 13:36:16 li676-235 kernel: [12128] 0 12128 26564 1 12 71 0 mysqld_safe
Apr 23 13:36:16 li676-235 kernel: [12405] 501 12405 155926 2868 120 3676 0 mysqld
Apr 23 13:36:16 li676-235 kernel: Out of memory: Kill process 9703 (php-fpm) score 41 or sacrifice child
Apr 23 13:36:16 li676-235 kernel: Killed process 9703 (php-fpm) total-vm:266976kB, anon-rss:38932kB, file-rss:0kB
Apr 23 13:36:23 li676-235 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr 23 13:36:24 li676-235 kernel: mysqld cpuset=/ mems_allowed=0
Apr 23 13:36:24 li676-235 kernel: CPU: 0 PID: 12405 Comm: mysqld Not tainted 3.18.5-x86_64-linode52 #1
Apr 23 13:36:24 li676-235 kernel: [<ffffffff8112695f>] ? oom_kill_process+0x65/0x32f
Apr 23 13:36:24 li676-235 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 23 13:36:24 li676-235 kernel: [12128] 0 12128 26564 1 12 71 0 mysqld_safe
Apr 23 13:36:24 li676-235 kernel: [12405] 501 12405 156056 2873 120 3676 0 mysqld
Apr 23 13:36:24 li676-235 kernel: Out of memory: Kill process 18168 (php-fpm) score 38 or sacrifice child
Apr 23 13:36:24 li676-235 kernel: Killed process 18168 (php-fpm) total-vm:263724kB, anon-rss:24872kB, file-rss:0kB
通過這個日誌,可以清楚看到oom kill在“Apr 23 13:36:16”被php-fpm觸發,也就是php在那時被系統強制kill掉了。我設定php-fpm進程最大可以開啟的數量是20。按照20*30 = 600M計算,php佔用的資源最大可能會大於600M 。
“Apr 23 13:36:16”時nginx日誌如下圖所示:
配合nginx日誌,可以確定當時php進程被開啟很多,從而導致系統資源不夠用。
我的解決辦法是減少pm.max_children的數值。當然這個值調小後肯定會犧牲網站的效能,不過我分析日誌後發現正常情況網站是沒那麼高並發的,所以影響應該不大。
Linux VPS伺服器記憶體不夠用的問題暫時先調整到這,先觀察幾天。