P570, hard disk fault.
The machine has two vg, rootvg and datavg. The rootvg is not mirrored, And the datavg is mirrored. The system errpt and HMC reports that the hard disk has a problem. The error code may be that the hard disk has a bad track (Bad block). To save user data as much as possible, this problem should be solved.
First, we want to remove the datavg image and allocate the empty hdisk2 to rootvg to make mirror for the faulty disk. After a long period of synchronization, When you view the Lv status in rootvg, you can see that all the LV statuses except hd1, namely,/home, and stale are syncd. Then, for the sake of insurance, tar the entire/home partition into a datavg partition, because it takes a long time and may also be due to bad channels, so this operation has been performed for a long time and IO waits very high, the execution of df-g, iostat, vmstat and other commands has been waiting for a long time, and it is totally stuck, so I decided to wait for the next day to continue processing.
Check the status of the machine the next morning. lsvg-l rootvg found that many partitions in addition to/home also changed to stale status, at the same time, lsvg can see that the faulty hdisk0 is already in the missing status and cannot be operated at all. Try again to tar some files and folders under/home into datavg and re-create the/home partition. Failed to try to use rmlv and rmfs, prompting that only the last good disk system could not guarantee the integrity of vg and refused rmlv and rmfs. The rootvg unmirror operation is successful, but the error message fails when rmlvcopy is partitioned to/home. At the same time, we can use the lspv-M command to check that only two LP columns on hdisk1 are in the stale state, and when we view hdisk0, we can see that the two corresponding LP is good, so I want to migrate the two good LP blocks on hdisk0 to hdisk1 directly, and use the mirgratelp command, but the migration process is stuck, ctrl + c stops, the PVs of the/home LV is changed to 3, which is very strange. Try again to reduce hdisk0 from rootvg directly, execute reducevg and prompt the same error above. If you want to change the disk, it will not work. There is no way to restart the machine.
After restart, you can see in the HMC that error code 0552 is reported for the startup of the partition. If you select a new hdisk, the error code 0552 is still reported. Therefore, you can only boot the partition from the network. After the boot, you can no longer see the original hdisk0 hard disk. importvg does not work, prompting you That the VGDA information is faulty. At the same time, the HMC used to manage these small machines also fails and cannot be used. The graphic interface cannot be displayed at all times. After the HMC is restarted, the fault persists, you can only mount this partition to another HMC. Finally, try a variety of methods to restore the system itself, so you can only choose to reinstall AIX.
This case tells us that important data must be backed up. In this case, rootvg does not have an image, a small machine does not have a tape drive, and mksysb has never been used for system backup. Although this is a development and testing machine, oracle Data is directly placed under/home, not to mention using bare devices, at least one LV of another hard disk must be used as the Data Partition of oracle. Hardware faults are inevitable, but you cannot say that you have not backed up the hardware. Therefore, do not rely on the hardware too much. Important data must be backed up.