oracle真實案例之oem大量佔用cpu與記憶體問題的解決方案

最後更新：2013-12-29 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

一、介紹

由於dba離職，所以公司所有的oracle資料庫伺服器我先兼職管理，今天登陸某省的資料庫，發現ssh登陸30秒左右才進入，之後查看了一下負載與記憶體，具體情況如：負載：

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32S0-0.jpg" />

沒有見過這樣高的負載，以前見過最多的就是負責1000多，java的問題記憶體：

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R353T-1.jpg" />

連交換記憶體都使用完了，實體記憶體就剩下71m了，太危險了top：

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R364B-2.jpg" />

發現了6個殭屍進程與大量的perl進行現在查看一下殭屍進程

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R3I13-3.jpg" />

發現都是[sh] <defunct>進程，以前遇到過這樣的問題，都是由於cron裡啟動指令碼的時候，沒有加入錯誤輸入到空裝置裡導致，解決方案是在cron裡運行指令碼後，添加>>/dev/null 2>&1，查看一下cron，查看是否與我的想法一致

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R31018-4.jpg" />

果然是沒有錯誤的輸出，添加完>>/dev/null 2>&1在重啟cron伺服器就解決了在查看perl進程

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R33J3-5.jpg" />

發現2726個進程，佔用了大量的cpu與記憶體去metalink裡查看，發現這個問題是oem的故障導致，oracle給的問題的描述與解決方案為：

 
  Server Has 100% Of Cpu Because Of Dbresp.pl [ID 764140.1]                 
   
        
   
  ________________________________________  
   Modified:07-Feb-2012 Type:PROBLEM Status:MODERATED Priority:3             
                       Comments (0)   
       To Bottom   
   
   
     
   
  In this Document  
  Symptoms  
  Cause  
  Solution  
  References  
  ________________________________________  
  This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.  
  Applies to:   
  Enterprise Manager Base Platform - Version: 10.2.0.1 and later [Release: 10.2 and later ]  
  Information in this document applies to any platform.  
  ***Checked for relevance on 07-Feb-2012***   
  Symptoms  
  Server has 100% of CPU because of dbresp.pl . There are more than 50 process from this script  
   
  emagent.trc shows:  
  2009-01-21 10:19:50 Thread-4099931040 WARN engine: Missing Properties : [limitSwitch]   
  2009-01-21 10:19:50 Thread-4099931040 ERROR engine: [oracle_database,orcl, alertLog] : nmeegd_GetMetricData failed : Missing Properties : [limitSwitch]   
  2009-01-22 06:54:33 Thread-4105165728 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds   
  2009-01-22 06:54:33 Thread-4105165728 ERROR command: failed to kill process 4793 running perl: (errno=3: No such process)   
  2009-01-22 06:54:33 Thread-4105165728 ERROR engine: [oracle_database,orlc, Response] : nmeegd_GetMetricData failed : Metric execution timed out in 600 seconds   
  Cause  
  The Response metric is making a timed out then the Agent starts other process to take the Response metric. The process to kill the PID taking the Response metric is failing increasing the process running dbresp.pl  
   
  Before the Response metric starts to do the timed out there is other error:  
  2009-01-21 10:19:50 Thread-4099931040 WARN engine: Missing Properties : [limitSwitch]  
  2009-01-21 10:19:50 Thread-4099931040 ERROR engine: [oracle_database,orcl,alertLog] :  
  nmeegd_GetMetricData failed : Missing Properties : [limitSwitch]  
  Solution  
  1. Stop DBConsole  
   
  emctl stop dbconsole  
   
  2. Kill any running process.  
   
  ps -ef | grep /opt/app/oracle/<hostname>_<sid> 
   
  Kill any returned process.  
   
  3. Follow fix  
   
  Note.361612.1 Ext/Mod Problem Performance Agent High CPU Consumption Gen  
   
  4. Start DB Console  
   
  emctl start dbconsole

二、根據這個解決方案，我先關閉oem，關閉之前我先介紹一下我的系統與資料庫的環境系統版本為

 
  oracleserver:~ # cat /etc/SuSE  
  SuSE-release  SuSEconfig/     
  oracleserver:~ # cat /etc/SuSE-release   
  SUSE Linux Enterprise Server 10 (x86_64)  
  VERSION = 10 
  PATCHLEVEL = 3

資料庫版本為

 
  SQL> select * from v$version;  
   
  BANNER  
  ----------------------------------------------------------------  
  Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi  
  PL/SQL Release 10.2.0.1.0 - Production  
  CORE    10.2.0.1.0  Production  
  TNS for Linux: Version 10.2.0.1.0 - Production  
  NLSRTL Version 10.2.0.1.0 - Production

1、先登入oracle使用者，然後關閉oem

 
  oracleserver:~ # su - oracle  
  oracle@oracleserver:~> id  
  uid=1000(oracle) gid=1000(oinstall) groups=1000(oinstall),1001(dba)  
  oracle@oracleserver:~> emctl stop dbconsole  
  TZ set to Asia/Shanghai  
  Oracle Enterprise Manager 10g Database Control Release 10.2.0.1.0    
  Copyright (c) 1996, 2005 Oracle Corporation.  All rights reserved.  
  http://oracleserver.site:1158/em/console/aboutApplication  
  Stopping Oracle Enterprise Manager 10g Database Control ...   
   ...  Stopped.

這裡需要注意的是，關閉oem的時候，剛開始什麼提示都沒有，查看系統的日誌與oracle的警示日誌也都沒有任何的提示，但大家還是需要耐心的等待，我這步操作在30分鐘的時候才完成了，當你運行完命令的時候，如果發現沒有提示，那我建議還是多等一會比較好，不用發現沒有提示就ctrl+c終止這個命令。2、殺掉perl進程oem關閉了，我們在查看一下記憶體與perl進程perl進程

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R36342-6.jpg" />

還是2726個，沒有變化記憶體

650) this.width=650;" border="0" alt="" src="http://img1.51cto.com/attachment/201206/113326232.jpg" />

55m空閑下面我們殺掉perl進程，使用 kill -9 $(ps -ef|grep perl|grep -v grep|awk '{print $2}')

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R31059-8.jpg" />

然後在查看perl進程

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32P6-9.jpg" />

現在perl進程沒有了查看一下記憶體

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R363V-10.jpg" />

現在記憶體已經有6673m了，恢複正常查看一下負載

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R32942-11.jpg" />

現在負載變為正常了，負載這1分鐘的為3.15,5分鐘的為242.76,15分鐘的為1236.57，雖然負載為3，但我的伺服器內為16核，所有負載為3沒有問題伺服器cpu核心數

650) this.width=650;" border="0" alt="" src="http://www.bkjia.com/uploads/allimg/131229/195R35401-12.jpg" />

現在問題解決了，如果想開啟oem監控oracle的話，在oracle使用者下使用emctl start dbconsole就可以。提示：很多資料庫的故障，在解決的時候，我建議大家還是先確定問題是如何產生的，找到解決問題的思路與方法，如果有metalink帳號的話，最好登陸裡面搜尋問題產生的原因與解決方案，不太建議在百度或者Google裡搜尋問題解決方案，因為很多問題在百度或者Google裡搜尋的答案不一定準確或者適合你，如果你的生產庫出現了問題，你按照百度或者故障裡的解決方案解決，同時你也不明白問題的產生原因與解決的思路、辦法的話，那麼你只能拼運氣來解決問題，解決了還好，皆大歡喜，如果沒有解決甚至產生更壞的影響的話，估計你離走人不遠了。

本文出自 “吟—技術交流” 部落格，請務必保留此出處http://dl528888.blog.51cto.com/2382721/911535

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More