Fault description
HP FC MSA2000 Storage, due to the presence of 2 hard drives in the RAID5 array damaged and offline, while only one hot spare is successfully activated, resulting in the RAID5 array paralysis, the upper LUN is not working properly, the user contact North Asia data, the entire storage space consists of 8 450GB SAS hard disk , where 7 drives form a RAID5 array, and the remaining 1 blocks are made into hot spares.
The storage is not available because some of the disks in the RAID array are out of line. Therefore, after receiving the disk to do physical detection of all disks, after the detection of no physical failure. Then use the Bad Channel Detection tool to detect the disk bad path, and found that there is no bad way.
Workaround:
1. Backup Data
Taking into account the security and recoverability of the data, it is necessary to make a backup of all the source data before the data is restored, just in case the data cannot be recovered again for other reasons. Use the DD command or the Winhex tool to mirror all the disks into files. Back up some data such as:
650) this.width=650; "src=" https://s1.51cto.com/wyfs02/M01/8C/C9/wKioL1h4fBiwu-YMAAEHz5eqgZ8481.jpg "title=" 1.jpg "alt=" Wkiol1h4fbiwu-ymaaehz5eqgz8481.jpg "/>
2, analyze the cause of the failure
Since the first two steps did not detect a physical failure or a bad path to the disk, it may be inferred that some disk read-write instability caused the failure to occur. Because the HP MSA2000 controller checks the disk policy is very strict, once some disk performance is not stable, the HP MSA2000 controller is considered to be a bad disk, will be considered to be bad disks to kick out the raid group. Once the raid group is dropped to the limit of the raid level, the raid group becomes unusable and the upper layer of the RAID group-based LUN becomes unavailable. At present, the initial understanding of the situation is based on the raid group of 6 LUNs, are allocated to the Hp-unix small machine, the upper-level LVM logical volume, the important data for the Oracle database and OA server.
3. Analyze RAID Group structure
The LUNs that are stored by HP MSA2000 are based on RAID groups, so the information for the underlying RAID group needs to be analyzed before the original RAID group is reconstructed based on the information analyzed. Analysis of each piece of data disk, found that the data of the 4th plate is not the same as other data disk, the preliminary thought may be hot spare disk. It then analyzes other data disks, analyzes the distribution of the Oracle database pages on each disk, and draws important information about the RAID group's stripe size, disk order, and data orientation, based on the distribution of the data.
4. Analysis of RAID group off-line disk
Based on the raid information analyzed above, an attempt was made to virtualize the original RAID group through a RAID virtual program developed by North Asia. However, due to the total drop of two disks in the entire RAID group, it is necessary to analyze the order in which the two drives are dropped. Careful analysis of the data on each piece of hard disk, found that a hard disk on the same strip of data and other hard disk obviously different, so initially determine that the hard disk may be the first to drop the line, through the North Asia self-developed RAID verification program to check this strip, found that the hard disk analysis just analyzed the data is the best, So you can identify the hard drive that was first dropped.
5. Analyzing LUN information in a RAID group
Because the LUNs are based on RAID groups, the latest state of the raid group needs to be virtualized based on the information analyzed above. The LUN is then analyzed for allocation in the RAID group, as well as the data block map for the LUN allocation. Since there are 6 LUNs at the bottom, it is only necessary to extract the data block distribution map for each LUN. The corresponding program is then written for this information, parsing the data map of all LUNs, and then exporting data for all LUNs based on the data map.
650) this.width=650; "src=" https://s3.51cto.com/wyfs02/M02/8C/CD/wKiom1h4fDCRjqqAAAFKlgV7l_E177.jpg "title=" 2.jpg "alt=" Wkiom1h4fdcrjqqaaafklgv7l_e177.jpg "/>
6. Parsing LVM Logical Volumes
Parses all the generated LUNs and discovers that all LUNs contain Hp-unix LVM logical volume information. Attempt to parse the LVM information on each LUN, found that there are three sets of LVM, of which 45G of LVM divided into a lv, which stores the OA server-side data, 190G of LVM divided into a lv, which stored temporary backup data. The remaining 4 LUNs constitute a 2.1T or so of LVM, and only one LV, which contains the Oracle database files. Write a program that interprets LVM and try to interpret the LV volumes in each set of LVM, but find that the interpreter has an error.
7. Repairing LVM Logical Volumes
Carefully analyze the cause of the program error, arrange for the Development Engineer Debug Program error location, and also arrange the Advanced File System engineer to detect the recovered LUN, detect whether LVM information can be caused by storage paralysis LMV logical Volume information corruption. After careful testing, it was found that the LVM information was corrupted due to storage paralysis. Try to manually repair the damaged area and modify the program synchronously to re-parse the LVM logical volume.
8. Parsing the VxFS file system
Build the Hp-unix environment, map the interpreted LV volume to Hp-unix, and try the mount file system. The result of Mount file system error, try to use "fsck–f vxfs" command to repair the VxFS file system, but the repair results can not be mounted, suspect that the underlying VXFS file system some of the metadata may be corrupted, need to be repaired manually.
9. Repair VxFS File System
Carefully analyze the parsed LV and verify that the file system is complete based on the underlying structure of the VxFS file system. Analysis found that the underlying VXFS file system is a problem, originally the storage was paralyzed at the same time this file in the system is performing IO operations, resulting in some file system meta-files are not updated and corrupted. Manual repair of these corrupted meta-files ensures that the VxFS file system can parse properly. Once again, mount the repaired LV volume onto the Hp-unix, try mount file system without error, and mount it successfully.
10. Recover all user files
After the mount file system on the Hp-unix machine, all user data is backed up to the specified disk space. All user data size is around 1.2TB. Some of the file directories are as follows:
650) this.width=650; "src=" https://s3.51cto.com/wyfs02/M01/8C/CD/wKiom1h4fEXS340oAAC4LOt9Vgk017.jpg "title=" 3.jpg "alt=" Wkiom1h4fexs340oaac4lot9vgk017.jpg "/>
11, test database file is complete
Use the Oracle Database file Detection Tool "DBV" to detect the integrity of each database file and to discover that there are no errors. The use of North Asia's self-developed Oracle Database Detection Tool (more stringent), found that some database files and log file check inconsistencies, scheduling advanced database engineers to repair such files, and in the second check, until all the file checksum passed completely.
12. Start the Oracle Database
Since the HP-UNIX environment we provide does not have this version of Oracle data, we coordinate with the user to bring the original build environment to the North Asia Data Recovery Center and then attach the restored Oracle database to the Hp-unix server in the original production environment, attempting to start the Oracle database. Oracle database started successfully. Sections are as follows:
650) this.width=650; "src=" Https://s1.51cto.com/wyfs02/M00/8C/CD/wKiom1h4fFGBmQSjAACHmII5Dc8887.png "title=" 4 copy. png "alt=" Wkiom1h4ffgbmqsjaachmii5dc8887.png "/>
13. Data validation
With the user side, start the Oracle database, start the OA server, and install the OA client in the local notebook. The latest data records and historical data records are verified through the OA client, and users are assigned remote authentication from different department personnel. The final data is verified correctly, the data is complete and the data is restored successfully.
As a result of the failure to save the site environment good, do not do the related dangerous operation, the later data recovery has a great help. Although the whole data recovery process encountered a lot of technical bottlenecks, but also all resolved. Finally, the entire data recovery is completed within the expected time, and the recovered data user is satisfied, and all services such as Oracle database service, OA server and so on can start normally.
This article is from the "Zhang Yu (Data Recovery)" blog, please be sure to keep this source http://zhangyu.blog.51cto.com/197148/1891767
HP Storage RAID5 two block HDD offline LVM VXFS file system Recovery data process