[oracle@node1 crsd]$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
[oracle@node1 crsd]$ crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
[root@node1 crs]# ps -ef|grep crs
root 3926 1 0 17:46 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run
root 29408 25855 0 22:09 pts/1 00:00:00 grep crs
[root@node1 bin]# ./racgvip
There is no VIP name
[root@node1 crsd]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Device or resource busy] [16]
Shutdown has begun. The daemons should exit soon.
[root@node1 crsd]# raw -qa
/dev/raw/raw1: bound to major 8, minor 17
/dev/raw/raw2: bound to major 8, minor 33
[root@node1 crsd]# ls -al /dev/raw/raw2
crw-rw---- 1 oracle dba 162, 2 9月 15 17:45 /dev/raw/raw2
[root@node1 bin]# ./crsctl query css votedisk
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Device or resource busy] [16]
[root@node1 bin]# ./ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
[root@node1 ~]# ll /etc/oracle/ocr.loc
-rw-r--r-- 1 root oinstall 45 2012-01-17 /etc/oracle/ocr.loc
[root@node1 bin]# more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
[root@node1 ~]# dd if=/dev/raw/raw1 of=/opt/oracle/ocr_raw.bak
dd: 開啟 ‘/dev/raw/raw1’: 裝置或資源忙
lsof|grep /dev/raw/raw1
沒人佔用
想把RAW1對應的分區格式化掉. 格式化中發現SDB1居然是10.7GB 不是裸裝置100M
由於系統管理員過來幫忙,
FDISK SDB 後導致開機檔案系統出了問題.因此在啟動輸入root使用者密碼後可以重新fdisk sdb
並把sdb 10.7GB分區為sdb1 把裸裝置分區為sdc1 然後mkfs.ext3 /dev/sdb1 格式化.
這樣就進入了系統.並且修改 /etc/sysconfig/rawdevices的 符合串連
再度重啟後發現 DD 可以備份/DEV/RAW/RAW1的內容 不再報錯誤了
[root@node1 tmp]# dd if=/dev/zero of=/dev/raw/raw1 bs=512 count=2048
讀入了 2048+0 個塊
輸出了 2048+0 個塊
[root@node1 tmp]# dd if=/dev/zero of=/dev/raw/raw2 bs=512 count=2048
讀入了 2048+0 個塊
輸出了 2048+0 個塊
裸裝置正常使用中…
/tmp 沒有產生新錯誤
停掉CRS
[root@node1 ~]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage
Shutdown has begun. The daemons should exit soon.
執行OCR恢複
ocrconfig -restore /opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr
沒反應
去看OCR日誌
Cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/client
[root@node1 client]# cat ocrconfig_6090.log
Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2005 Oracle. All rights reserved.
2012-09-19 10:51:08.056: [ OCRCONF][3086915264]ocrconfig starts...
2012-09-19 10:51:08.109: [ OCROSD][3086915264]utopen:12:Not enough space in the backing store
2012-09-19 10:51:08.109: [ OCROSD][3086915264]utopen:10:None of the OCR devices are usable
2012-09-19 10:51:08.109: [ OCRRAW][3086915264]phy_rec:1:could not open OCR device
2012-09-19 10:51:08.109: [ OCRCONF][3086915264]Failed to restore OCR from [/opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr]
2012-09-19 10:51:08.109: [ OCRCONF][3086915264]Exiting [status=failed]...
估計是許可權問題
[root@node1 client]# ll /dev/raw/raw*
crw-rw---- 1 root disk 162, 1 9月 18 18:41 /dev/raw/raw1
crw-rw---- 1 root disk 162, 2 9月 18 18:41 /dev/raw/raw2
是為了避免OCR一直運行沒完 dd無法讀取裸裝置而忙的原因才把許可權修改了
臨時屏蔽CRSD自啟動
[root@node1 opt]# vi /etc/inittab
# Run xdm in runlevel 5
x:5:respawn:/etc/X11/prefdm -nodaemon
#h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null
#h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null
#h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null
經同事提醒: 分區還存在問題
Disk /dev/sdc: 107 MB, 107374080 bytes
64 heads, 32 sectors/track, 102 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 102 102 1024 83 Linux
Disk /dev/sdd: 107 MB, 107374080 bytes
64 heads, 32 sectors/track, 102 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Device Boot Start End Blocks Id System
/dev/sdd1 * 1 102 104432 83 Linux
重新分區 fdisk /dev/sdc
重新匯入裸檔案 過去匯出來的raw1.file
重新 ocrconfig -restore /opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr
沒反應.氣死也!
重新系統 也沒用….
第二天 想下把節點2搞搞. 因為節點2也報同樣的錯誤,那就是增加磁碟到了SCSCI0號匯流排上導致盤符發生變化
那麼它沒有經曆過兩位同事的操刀手.
節點2啟動了
修改 /etc/sysconfig/rawdevices
[root@node2 ~]# cat /etc/sysconfig/rawdevices
# This file and interface are deprecated.
# Applications needing raw device access should open regular
# block devices with O_DIRECT.
# raw device bindings
# format: <rawdev> <major> <minor>
# <rawdev> <blockdev>
# example: /dev/raw/raw1 /dev/sda1
# /dev/raw/raw2 8 5
/dev/raw/raw1 /dev/sdc1
/dev/raw/raw2 /dev/sdd1
[root@node2 ~]# service rawdevices restart
後OCR沒有效,重啟系統 結果好了
Ocrconfig check crs 三個都OK了
Crs_stat –t 節點2的都OK 了.
本來想通過節點2自動回復OCR盤的內容,節點1的OCR可以讀取正確內容而成功啟動.
關閉了節點2
Crsctl stop crs 虛擬機器比較忙
開啟節點1 一切照舊,老樣的 OCR不寫日誌在/TMP和client目錄下 而CRS日誌也沒.
真氣人 難道破壞了OCR的程式,不會吧 把節點2啟動起來 對檔案一一比對.
Ll /dev/raw/raw* 許可權
Cat /etc/sysconfig/rawdevices 盤符.
今天特意帶來大話RAC這本書翻到第6章OCR部分工具 163頁. 看到配置CRS堆棧是否自動啟動
說 crsctl disable crs 命令實際修改下面檔案
/etc/oracle/scls_scr/dbp/root/crsstart
注意dbp換成node1
兩個節點檔案對比一看 節點2 是enable 節點1是disable
記得同事叫我把節點1 CRS不自己啟動 這個操作.好吧 把它改成enable 然後重新啟動節1
PS查看下 不再是 /etc/init.d/init.crsd run 而是一大堆
[root@node1 ~]# ps -ef | grep crs*
root 3392 1 0 15:38 ? 00:00:00 crond
root 3427 1 0 15:38 ? 00:00:00 anacron -s
root 4045 1 0 15:38 ? 00:00:00 /bin/su -l oracle -c sh -c 'ulimit -c unlimited; cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/evmd; exec /opt/oracle/crshome/product/10.2.0/db_1/bin/evmd '
root 4052 1 1 15:38 ? 00:00:08 /opt/oracle/crshome/product/10.2.0/db_1/bin/crsd.bin reboot
oracle 4773 4045 0 15:39 ? 00:00:01 /opt/oracle/crshome/product/10.2.0/db_1/bin/evmd.bin
root 4890 4752 0 15:39 ? 00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/cssd;
[root@node1 ~]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
[root@node1 ~]# su - oracle
[oracle@node1 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....C1.inst application ONLINE ONLINE node1
ora....C2.inst application ONLINE ONLINE node2
ora.MYRAC.db application ONLINE ONLINE node2
ora....SM1.asm application ONLINE ONLINE node1
ora....E1.lsnr application ONLINE ONLINE node1
ora.node1.gsd application ONLINE ONLINE node1
ora.node1.ons application ONLINE ONLINE node1
ora.node1.vip application ONLINE ONLINE node1
ora....SM2.asm application ONLINE ONLINE node2
ora....E2.lsnr application ONLINE ONLINE node2
ora.node2.gsd application ONLINE ONLINE node2
ora.node2.ons application ONLINE ONLINE node2
ora.node2.vip application ONLINE ONLINE node2
總結
1 增加磁碟時候小心盤符發生改變
2 分區命令注意start 和end 建立分區的時候有提示兩個1的時候
3 OCR程式先在CRS前啟動,OCR不能啟動 CRS也不能啟動
4 兩位同事操刀命令熟,速度快.極容易忽悠掉資訊的細節
5 記住不要採用試錯的方式,修改CRS的設定.尤其是在問題還沒有精確定位時.
6 任何改動要人工手記在本子,或者word內.因為不斷地修改和試錯容易造成環境的破壞.
7 這個BUG折騰了1個周的時間,求教了多人,能起到作用的是兩位要好的同事,提供了有效協助.而群裡的人提供的是命令和檔案,讓自己熟悉了linux 一些命令和檔案配置.因此當一個人無法解決的時候,可以洗洗睡睡,或者請教他人.正所謂當局者迷旁觀者清.人久了頭腦會發昏,視覺疲勞,容易放過重要的資訊和提示.
8 還好這是虛擬機器,如果是生產系統,需要短時間處理問題,在嘈雜,壓力,悶熱下,估計是無法解決問題的.或許在壓力下才用試錯法帶來更多的問題.