一、介紹
由於dba離職,所以公司所有的oracle資料庫伺服器我先兼職管理,今天登陸某省的資料庫,發現ssh登陸30秒左右才進入,之後查看了一下負載與記憶體,具體情況如:負載:
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32S0-0.jpg" />
沒有見過這樣高的負載,以前見過最多的就是負責1000多,java的問題記憶體:
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R353T-1.jpg" />
連交換記憶體都使用完了,實體記憶體就剩下71m了,太危險了top:
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R364B-2.jpg" />
發現了6個殭屍進程與大量的perl進行現在查看一下殭屍進程
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R3I13-3.jpg" />
發現都是[sh] <defunct>進程,以前遇到過這樣的問題,都是由於cron裡啟動指令碼的時候,沒有加入錯誤輸入到空裝置裡導致,解決方案是在cron裡運行指令碼後,添加>>/dev/null 2>&1,查看一下cron,查看是否與我的想法一致
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R31018-4.jpg" />
果然是沒有錯誤的輸出,添加完>>/dev/null 2>&1在重啟cron伺服器就解決了在查看perl進程
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R33J3-5.jpg" />
發現2726個進程,佔用了大量的cpu與記憶體去metalink裡查看,發現這個問題是oem的故障導致,oracle給的問題的描述與解決方案為:
- Server Has 100% Of Cpu Because Of Dbresp.pl [ID 764140.1]
-
-
-
- ________________________________________
- Modified:07-Feb-2012 Type:PROBLEM Status:MODERATED Priority:3
- Comments (0)
- To Bottom
-
-
-
-
- In this Document
- Symptoms
- Cause
- Solution
- References
- ________________________________________
- This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.
- Applies to:
- Enterprise Manager Base Platform - Version: 10.2.0.1 and later [Release: 10.2 and later ]
- Information in this document applies to any platform.
- ***Checked for relevance on 07-Feb-2012***
- Symptoms
- Server has 100% of CPU because of dbresp.pl . There are more than 50 process from this script
-
- emagent.trc shows:
- 2009-01-21 10:19:50 Thread-4099931040 WARN engine: Missing Properties : [limitSwitch]
- 2009-01-21 10:19:50 Thread-4099931040 ERROR engine: [oracle_database,orcl, alertLog] : nmeegd_GetMetricData failed : Missing Properties : [limitSwitch]
- 2009-01-22 06:54:33 Thread-4105165728 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
- 2009-01-22 06:54:33 Thread-4105165728 ERROR command: failed to kill process 4793 running perl: (errno=3: No such process)
- 2009-01-22 06:54:33 Thread-4105165728 ERROR engine: [oracle_database,orlc, Response] : nmeegd_GetMetricData failed : Metric execution timed out in 600 seconds
- Cause
- The Response metric is making a timed out then the Agent starts other process to take the Response metric. The process to kill the PID taking the Response metric is failing increasing the process running dbresp.pl
-
- Before the Response metric starts to do the timed out there is other error:
- 2009-01-21 10:19:50 Thread-4099931040 WARN engine: Missing Properties : [limitSwitch]
- 2009-01-21 10:19:50 Thread-4099931040 ERROR engine: [oracle_database,orcl,alertLog] :
- nmeegd_GetMetricData failed : Missing Properties : [limitSwitch]
- Solution
- 1. Stop DBConsole
-
- emctl stop dbconsole
-
- 2. Kill any running process.
-
- ps -ef | grep /opt/app/oracle/<hostname>_<sid>
-
- Kill any returned process.
-
- 3. Follow fix
-
- Note.361612.1 Ext/Mod Problem Performance Agent High CPU Consumption Gen
-
- 4. Start DB Console
-
- emctl start dbconsole
-
二、根據這個解決方案,我先關閉oem,關閉之前我先介紹一下我的系統與資料庫的環境系統版本為
- oracleserver:~ # cat /etc/SuSE
- SuSE-release SuSEconfig/
- oracleserver:~ # cat /etc/SuSE-release
- SUSE Linux Enterprise Server 10 (x86_64)
- VERSION = 10
- PATCHLEVEL = 3
資料庫版本為
- SQL> select * from v$version;
-
- BANNER
- ----------------------------------------------------------------
- Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
- PL/SQL Release 10.2.0.1.0 - Production
- CORE 10.2.0.1.0 Production
- TNS for Linux: Version 10.2.0.1.0 - Production
- NLSRTL Version 10.2.0.1.0 - Production
1、先登入oracle使用者,然後關閉oem
- oracleserver:~ # su - oracle
- oracle@oracleserver:~> id
- uid=1000(oracle) gid=1000(oinstall) groups=1000(oinstall),1001(dba)
- oracle@oracleserver:~> emctl stop dbconsole
- TZ set to Asia/Shanghai
- Oracle Enterprise Manager 10g Database Control Release 10.2.0.1.0
- Copyright (c) 1996, 2005 Oracle Corporation. All rights reserved.
- http://oracleserver.site:1158/em/console/aboutApplication
- Stopping Oracle Enterprise Manager 10g Database Control ...
- ... Stopped.
這裡需要注意的是,關閉oem的時候,剛開始什麼提示都沒有,查看系統的日誌與oracle的警示日誌也都沒有任何的提示,但大家還是需要耐心的等待,我這步操作在30分鐘的時候才完成了,當你運行完命令的時候,如果發現沒有提示,那我建議還是多等一會比較好,不用發現沒有提示就ctrl+c終止這個命令。2、殺掉perl進程oem關閉了,我們在查看一下記憶體與perl進程perl進程
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R36342-6.jpg" />
還是2726個,沒有變化記憶體
650) this.width=650;" border="0" alt="" src="http://img1.51cto.com/attachment/201206/113326232.jpg" />
55m空閑下面我們殺掉perl進程,使用 kill -9 $(ps -ef|grep perl|grep -v grep|awk '{print $2}')
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R31059-8.jpg" />
然後在查看perl進程
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32P6-9.jpg" />
現在perl進程沒有了查看一下記憶體
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R363V-10.jpg" />
現在記憶體已經有6673m了,恢複正常查看一下負載
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32942-11.jpg" />
現在負載變為正常了,負載這1分鐘的為3.15,5分鐘的為242.76,15分鐘的為1236.57,雖然負載為3,但我的伺服器內為16核,所有負載為3沒有問題伺服器cpu核心數
650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R35401-12.jpg" />
現在問題解決了,如果想開啟oem監控oracle的話,在oracle使用者下使用emctl start dbconsole就可以。提示:很多資料庫的故障,在解決的時候,我建議大家還是先確定問題是如何產生的,找到解決問題的思路與方法,如果有metalink帳號的話,最好登陸裡面搜尋問題產生的原因與解決方案,不太建議在百度或者Google裡搜尋問題解決方案,因為很多問題在百度或者Google裡搜尋的答案不一定準確或者適合你,如果你的生產庫出現了問題,你按照百度或者故障裡的解決方案解決,同時你也不明白問題的產生原因與解決的思路、辦法的話,那麼你只能拼運氣來解決問題,解決了還好,皆大歡喜,如果沒有解決甚至產生更壞的影響的話,估計你離走人不遠了。
本文出自 “吟—技術交流” 部落格,請務必保留此出處http://dl528888.blog.51cto.com/2382721/911535