Environment:
Two HP ML570 Linux AS4.5 Oracle 10g
The two servers perform Oracle RAC and connect to HP MSA1000 through SAN Switch
Fault symptom:
Because the cabinet where one Oracle rac node is located is out of power, two rac nodes are down at the same time, in addition, all partitions in the four ocfs2 partitions mounted on Storage are lost (/dev/sda1 is changed to/dev/sda) and cannot be mounted. Therefore, Oracle services cannot be started.
Fault Analysis and troubleshooting:
Because the customer's DB data is not backed up, be careful when fixing it.
A. First, make sure that the Storage is correct in terms of hardware and connectivity.
B. Check that the OS is normal and the Storage can be accessed normally.
C. Restore the lost Partition Table
Because I used to set the partition, the number and size of the partitions are clear. Therefore, we will re-divide the partitions according to the last partition format to re-create the partition table, data should not be affected because the customer has not backed up the data. Therefore, this operation is highly risky, but this is the only option currently.
D. After fdisk ends, reboot server
A miracle occurred, the data was still there, and the service started normally.
Note: There is no absolute thing in the world, and there is no insurance. Although Oracle RAC is implemented, it can only ensure the redundancy of the two servers and cannot ensure the redundancy of Storage, therefore, we recommend that you implement a feasible backup policy in the future.
However, there is another problem that I have never figured out, that is, a node of RAC experiences a power failure. How can the Partition Table of Public partitions on Storage be lost?