一個系統最近頻頻死掉,癥狀是系統的CPU資源全部耗盡,系統內容如下:
雙CPU、RAID5
Oracle 8.1.7.0.1
Weblogic 8.12
JAVA JDBC
FT的是Oracle和weblogic分別跑在RedHat 7.3上!也不知道是怎麼想的!
字元集是ZHS16GBK
系統每隔一段時間就會出現CPU被99%佔用,兩個CPU空閑全為0,同時v$session_wait中的事件都是SQL*Net more data to client或SQL*Net message from client,alert.log中一切正常,StatsPack Report也沒有發現任何的不正常,唯獨SQL*Net事件等待次數非常的多!最初以為是不是內網的交換器出問題了,網路原因導致資料庫問題?
後來經分析發現,當在Web用戶端查詢某個表的特定記錄時,CPU佔用率直線上升,而Web卻查不到這個記錄,但是這個記錄在資料庫表裡的確存在!如果只有一個Web在查這個記錄一個CPU被耗盡,如果開了多個Web查這個記錄比如開5個,則這5個server process平均每個佔20%左右CPU,同時兩個CPU被耗盡,系統掛掉!但是在SQLPLUS裡查詢該記錄沒有任何的不正常!
針對一個號碼的測試如下(X為隱藏部分):
現在在Web用戶端查詢130XXXXXXXX這個號碼,然後TOP裡看到:
1:22pm up 1 day, 10:15, 4 users, load average: 0.98, 0.50, 0.33
106 processes: 104 sleeping, 2 running, 0 zombie, 0 stopped
CPU0 states: 15.0% user, 6.0% system, 0.0% nice, 78.4% idle
CPU1 states: 90.3% user, 1.4% system, 0.0% nice, 7.2% idle
Mem: 1031044K av, 1021448K used, 9596K free, 0K shrd, 46060K buff
Swap: 2097112K av, 37424K used, 2059688K free 860096K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
5309 oracle 25 0 8832 8832 8252 R 99.0 0.8 0:59 oracle
3026 oracle 16 0 138M 138M 137M S 2.9 13.7 27:41 oracle
3028 oracle 16 0 144M 144M 143M S 2.5 14.3 28:59 oracle
3030 oracle 16 0 129M 129M 129M S 2.1 12.8 23:25 oracle
5193 oracle 15 0 1092 1092 856 S 2.1 0.1 1:39 top
5310 oracle 15 0 1092 1092 856 R 2.1 0.1 0:06 top
1234 oracle 15 0 136M 136M 135M S 0.3 13.5 1:27 oracle
3020 oracle 15 0 26004 25M 25380 S 0.3 2.5 3:48 oracle
1199 oracle 15 0 11788 11M 11256 S 0.1 1.1 0:13 oracle
1219 oracle 15 0 161M 161M 159M S 0.1 16.0 0:47 oracle
1 root 15 0 476 432 416 S 0.0 0.0 0:05 init
2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1
5 root 15 0 0 0 0 SW 0.0 0.0 0:06 kswapd
6 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush
7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated
8 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd
16 root 15 0 0 0 0 SW 0.0 0.0 0:28 kjournald
95 root 16 0 0 0 0 SW 0.0 0.0 0:00 khubd
此時:
SQL> select a.sid,status from v$session a,v$process b where a.paddr=b.addr and b.spid=5309;
SID STATUS
---------- --------
29 ACTIVE
SQL> select sid,event,wait_time,seconds_in_wait,state from v$session_wait where sid=29;
SID EVENT WAIT_TIME SECONDS_IN_WAIT STATE
---------- ----------------------------- ---------- --------------- -------------------
29 SQL*Net more data to client -1 56 WAITED SHORT TIME
在SQLPLUS中:
SQL> select rowid,tel,name from teluser_tbl where tel='130XXXXXXXX';
ROWID TEL NAME
------------------ -------------------- ---------------------
AAAGHgAAFAAAU27AAA 130XXXXXXXX 某某華南工程有限公
顯然這裡"公"後面少了"司"
用PL/SQL Developer的"SQL視窗"查詢,"司"是一個亂碼:
TEL NAME
-------------------- ---------------------
130XXXXXXXX 某某華南工程有限公¢
用PL/SQL Developer的"命令視窗"查詢,"公"後面是個"?"
問題就出在這個NAME欄位上!如果把"公"後面的亂碼刪掉就沒有任何問題了!系統就一切正常了!
我用dbms_system.set_sql_trace_in_session來跟蹤PID=5309的session不會產生任何trace檔案,也就是他的SQL已經執行完了,正在給Web返回資料,而Web始終接收不到這個資料,server process卻懸在這裡並且佔用著大量的CPU資源。
我開7個IE視窗查詢同樣的手機號碼,TOP如下:
7:46pm up 16:40, 4 users, load average: 7.81, 7.77, 7.62
105 processes: 96 sleeping, 9 running, 0 zombie, 0 stopped
CPU0 states: 96.4% user, 3.1% system, 0.0% nice, 0.0% idle
CPU1 states: 99.0% user, 1.0% system, 0.0% nice, 0.0% idle
Mem: 1031044K av, 1021452K used, 9592K free, 0K shrd, 36812K buff
Swap: 2097112K av, 42588K used, 2054524K free 881416K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
3438 oracle 25 0 9592 9592 8988 R 32.7 0.9 21:29 oracle
3431 oracle 25 0 11900 11M 11300 R 32.3 1.1 21:57 oracle
3436 oracle 25 0 9136 9136 8552 R 30.9 0.8 20:51 oracle
3454 oracle 25 0 11916 11M 11260 R 24.5 1.1 14:39 oracle
3433 oracle 25 0 25028 24M 24376 R 23.5 2.4 16:06 oracle
3442 oracle 25 0 9092 9092 8508 R 23.5 0.8 15:51 oracle
3452 oracle 25 0 9056 9056 8472 R 23.5 0.8 14:36 oracle
這七個進程的EVENT全部是:SQL*Net more data to client
而WEB端不會有任何現在,也就是資料根本無法傳遞給WEB!!!
相應的SQL很簡單,執行計畫很好,成本也很低,我用TOP-PID+v$session、v$process、v$sqlarea抓到SQL,執行計畫如下:
select rowid, t.tel,t.password,t.language,t.isfree,t.status,
t.enddate,t.endprocess,t.province,t.city,t.code,t.name,
t.address,t.postcode,t.companytel,t.familytel,t.fax,t.qq,
t.email,t.regdate,t.unregdate,t.logincount
from teluser_tbl t where t.tel ='133XXXXXXXX'
Elapsed: 00:00:00.45
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TELUSER_TBL'
2 1 INDEX (UNIQUE SCAN) OF 'SYS_C001736' (UNIQUE)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
1628 bytes sent via SQL*Net to client
425 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
顯然這是個不錯執行計畫。
把這個記錄的block dump出來:
SQL> select file_id from dba_extents where segment_name='TELUSER_TBL';
FILE_ID
---------------
5
SQL> set serveroutput on
SQL> declare
2 a number;
3 begin
4 a:=dbms_rowid.rowid_block_number('AAAGHgAAFAAAU27AAA');
5 dbms_output.put_line(a);
6 end;
7 /
85435
PL/SQL 過程已成功完成。
SQL> declare
2 a number;
3 begin
4 a:=dbms_rowid.rowid_row_number('AAAGHgAAFAAAU27AAA');
5 dbms_output.put_line(a);
6 end;
7 /
0
PL/SQL 過程已成功完成。
SQL> alter system dump datafile 5 block 85435;
System altered.
TRACE檔案如下(只貼row 0的):
[oracle@sxdb udump]$ less ora_5301.trc
/opt/ora/admin/sxit/udump/ora_5301.trc
Oracle8i Enterprise Edition Release 8.1.7.0.1 - Production
With the Partitioning option
JServer Release 8.1.7.0.1 - Production
ORACLE_HOME = /opt/ora/8.1.7
System name: Linux
Node name: sxdb
Release: 2.4.18-3smp
Version: #1 SMP Thu Apr 18 07:27:31 EDT 2002
Machine: i686
Instance name: sxit
Redo thread mounted by this instance: 1
Oracle process number: 45
Unix process pid: 5301, image: oracle@sxdb (TNS V1-V3)
*** 2005-01-20 13:11:49.539
*** SESSION ID:(19.25087) 2005-01-20 13:11:49.538
Start dump data blocks tsn: 4 file#: 5 minblk 85435 maxblk 85435
buffer tsn: 4 rdba: 0x01414dbb (5/85435)
scn: 0x0000.058b81ce seq: 0x01 : 0x02 tail: 0x81ce0601
frmt: 0x02 chkval: 0x0000 type: 0x06=trans data
Block header dump: 0x01414dbb
Object id on Block? Y
seg/obj: 0x61e0 csc: 0x00.58b81cd itc: 1 : - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01
Itl Xid Uba Flag Lck Scn/Fsc
0x01 xid: 0x0003.031.0000aa4c uba: 0x00c003de.10f4.48 --U- 1 fsc 0x0000.058b81ce
data_block_dump
===============
tsiz: 0x1fb8
hsiz: 0x58
pbl: 0x09724644
bdba: 0x01414dbb
flag=-----------
ntab=1
nrow=35
frre=-1
fsbo=0x58
fseo=0x312
avsp=0x389
tosp=0x389
0xe:pti[0] nrow=35 offs=0
0x12:pri[0] offs=0x1edf
0x14:pri[1] offs=0x1e13
0x16:pri[2] offs=0x1d47
0x18:pri[3] offs=0x1c75
0x1a:pri[4] offs=0x1ba9
0x1c:pri[5] offs=0x1add
0x1e:pri[6] offs=0x1a11
0x20:pri[7] offs=0x1945
0x22:pri[8] offs=0x1879
0x24:pri[9] offs=0x17ad
0x26:pri[10] offs=0x16e2
0x28:pri[11] offs=0x1617
0x2a:pri[12] offs=0x154c
0x2c:pri[13] offs=0x1481
0x2e:pri[14] offs=0x13b6
0x30:pri[15] offs=0x12eb
0x32:pri[16] offs=0x1220
0x34:pri[17] offs=0x1155
0x36:pri[18] offs=0x108c
0x38:pri[19] offs=0xfc1
0x3a:pri[20] offs=0xef6
0x3c:pri[21] offs=0xe2d
0x3e:pri[22] offs=0xd62
0x40:pri[23] offs=0xc99
0x42:pri[24] offs=0xbce
0x44:pri[25] offs=0xb05
0x46:pri[26] offs=0xa39
0x48:pri[27] offs=0x96d
0x4a:pri[28] offs=0x8a5
0x4c:pri[29] offs=0x7d9
0x4e:pri[30] offs=0x70d
0x50:pri[31] offs=0x645
0x52:pri[32] offs=0x57d
0x54:pri[33] offs=0x3e2
0x56:pri[34] offs=0x312
block_row_dump:
tab 0, row 0, @0x1edf
tl: 217 fb: --H-FL-- lb: 0x0 cc: 50
col 0: [11] 31 33 30 3X 3X 3X 3X 3X 3X 3X 3X (X-是我隱藏的,避免泄露客戶資訊)
col 1: [ 6] 32 30 30 35 30 31
col 2: [ 4] b9 e3 b6 ab
col 3: [ 4] b7 f0 c9 bd
col 4: [ 4] 30 37 36 35
col 5: [19] c2 b7 c7 c5 bb aa c4 cf b9 a4 b3 cc d3 d0 cf de b9 ab cb ——這就是NAME欄位一共儲存了19個位元組,這行記錄的NAME是"某某華南工程有限公"一共9個字元,顯然資料庫多儲存了一個位元組!而這個位元組造成了亂碼!
col 6: *NULL*
col 7: *NULL*
col 8: *NULL*
col 9: *NULL*
col 10: *NULL*
col 11: *NULL*
col 12: *NULL*
col 13: [14] 32 30 30 35 30 31 30 35 31 37 31 34 33 32
col 14: [14] 32 30 30 35 30 31 30 35 31 37 31 34 33 32
col 15: [ 2] c1 02
col 16: [ 1] 80
col 17: [ 1] 80
col 18: [ 2] c1 02
col 19: [14] 32 30 30 35 30 31 30 35 31 37 31 34 33 32
col 20: [14] 32 30 30 35 30 31 30 35 31 37 31 34 33 32
col 21: [ 1] 80
col 22: [ 1] 80
col 23: [ 2] c1 04
col 24: [ 1] 80
col 25: [ 2] c1 06
col 26: [ 2] c1 02
col 27: [ 4] 30 30 30 30
col 28: [ 1] 80
col 29: *NULL*
col 30: [14] 32 30 30 35 30 31 30 35 31 37 31 34 33 32
col 31: [ 1] 80
col 32: *NULL*
col 33: [ 1] 80
col 34: [ 1] 80
col 35: *NULL*
col 36: *NULL*
col 37: *NULL*
col 38: *NULL*
col 39: *NULL*
col 40: *NULL*
col 41: [ 1] 80
col 42: [ 1] 80
col 43: [ 1] 80
col 44: [ 1] 80
col 45: [ 1] 80
col 46: [ 1] 80
col 47: [ 1] 80
col 48: [14] 32 30 30 35 30 33 30 37 31 37 31 34 33 32
col 49: [ 2] c1 0c
由於這個欄位不是很重要,最後用update teluser_tbl set name=rtrim(name,substr(name,-1));節斷了最右端的一個字元,系統復原正常。
後來開發人員詳細檢查程式,發現C語言在讀入資料的時候少讀了一個位元組!
那為什麼這個位元組會造成如此嚴重的問題呢,目前唯一的解釋是ZHS16GBK字元集的Bug,這個位元組在ZHS16GBK字元集中對應的字元與OS底層通訊代碼衝突,造成通訊故障,從而引發資料庫故障。
先到這裡,等有空再把這個問題詳細研究一下。