Fault Description:
A hospital in Beijing EMC FC AX-4 storage crashes, due to the failure of 2 hard drives in the RAID5 array, when only one hot spare is successfully activated, resulting in the RAID5 array paralysis, the upper LUN is not working properly, the entire storage space consists of 12 1TB stat of hard disk, Of these, 10 drives form a RAID5 array, and the remaining two blocks are made into hot spares.
Since the first two steps did not detect a physical failure or a bad path to the disk, it may be inferred that some disk read-write instability caused the failure to occur. Because the EMC controller checks the disk's policy is strict, once some disk performance is unstable, the EMC controller is considered to be a bad disk, will be considered as a bad disk to kick out the raid group. Once the raid group is dropped to the limit of the raid level, the raid group becomes unusable and the upper layer of the RAID group-based LUN becomes unavailable. The initial understanding is that only one of the raid group-based LUNs is allocated to the Sun's small machine and the upper-level file system is ZFS.
Resolution process
1. Hard Drive detection
Because the storage is missing because some disks are out of line, the entire storage is not available. Therefore, after receiving the disk to do physical detection of all disks, after the detection of no physical failure. Then use the Bad Channel Detection tool to detect the disk bad path, and found that there is no bad way. 650) this.width=650; "Src=" Http://s3.51cto.com/wyfs02/M00/8C/79/wKiom1htrzjQSwvtAAIK2pmZ3y0501.jpg-wh_500x0-wm_3 -wmp_4-s_1648040797.jpg "title=" 2-1.jpg "alt=" Wkiom1htrzjqswvtaaik2pmz3y0501.jpg-wh_50 "/>
2. Backup Data
Taking into account the security and recoverability of the data, it is necessary to make a backup of all the source data before the data is restored, just in case the data cannot be recovered again for other reasons. Using Winhex to mirror all disks into files, because the source disk has a sector size of 520 bytes, you also need to use special tools to convert all backed up data 520 to 512 bytes.
3. Analyze RAID Group structure
The LUNs that are stored by EMC are based on RAID groups, so you need to analyze the information for the underlying RAID group and then refactor the original RAID group based on the information analyzed. Analysis of each piece of data disk, found 8th and 11th no data, from the management interface can see 8th and 11th are hot Spare, but the hot Spare 8th plate replaced the 5th plate of the bad disk. It is therefore possible to determine that although the hot spare of 8th is successfully activated, because the raid level is RAID5, a hard disk is missing from the raid group, resulting in data not being synced to drive 8th. Continue analyzing the other 10 hard drives, analyzing the pattern of data distribution on the hard disk, the size of the raid bands, and the order of each disk.
4. Analysis of RAID group off-line disk
Based on the raid information analyzed above, an attempt was made to virtualize the original RAID group through a RAID virtual program developed by North Asia. However, due to the total drop of two disks in the entire RAID group, it is necessary to analyze the order in which the two drives are dropped. Careful analysis of the data on each piece of hard disk, found that a hard disk on the same strip of data and other hard disk obviously different, so initially determine that the hard disk may be the first to drop the line, through the North Asia self-developed RAID verification program to check this strip, found that the hard disk analysis just analyzed the data is the best, So you can identify the hard drive that was first dropped.
5. Analyzing LUN information in a RAID group
Because the LUNs are based on RAID groups, the raid group needs to be reconstituted based on the information analyzed above. It then analyzes the LUN's allocation information in the RAID group, as well as the data block map of the LUN allocation. Since there is only one LUN at the bottom, it is OK to parse only one LUN. Then use the North Asian Raid Recovery (datahf.net) program based on this information to interpret the LUN's data map and export all of the LUN's data.
6. Interpreting the ZFS file system and repairing
Using North Asia Data Recovery (datahf.net self-developed ZFS file system interpreter to do file system interpretation of the generated LUN, the Discovery program in the interpretation of some file system meta-file error. Quickly arrange the development engineer to do debug debugging, analysis program error reasons. The file system engineer is then scheduled to analyze whether the ZFS file system is not supported because of version reasons. After up to 7 hours of analysis and debugging, it was found that the ZFS file system caused some of these meta-files to become corrupted due to sudden storage paralysis, which prevented programs explaining the ZFS file system from being interpreted properly.
The above analysis makes it clear that the ZFS file system is corrupted due to storage paralysis and that some file system meta files are damaged, so you need to fix these corrupted file system meta files to parse the ZFS file system properly. Parsing of corrupted meta-Files found that some file system meta-files were not updated and corrupted due to the simultaneous storage paralysis of the ZFS file while the IO operation was in progress. Manually fix these corrupted meta-files to ensure that the ZFS file system can parse properly.
7. Export all data
Use the program to parse the repaired ZFS file system and parse all the file nodes and directory structures. Some of the file directories are as follows:
650) this.width=650; "Src=" Http://s2.51cto.com/wyfs02/M01/8C/75/wKioL1htr17TIPHCAAPZbJwQAKw196.jpg-wh_500x0-wm_3 -wmp_4-s_1144177779.jpg "title=" 2-2.jpg "alt=" wkiol1htr17tiphcaapzbjwqakw196.jpg-wh_50 "/>8, verifying the latest data
Because the data are both text types and DCM images, you need to build too many environments. By the user side of the engineer to verify some data, verify that the results are not a problem, the data are complete. Some of the files are verified as follows:
650) this.width=650; "Src=" Http://s5.51cto.com/wyfs02/M01/8C/79/wKiom1htr3bRmjfAAASkq9QHEF4516.jpg-wh_500x0-wm_3 -wmp_4-s_136357263.jpg "style=" Float:none; "title=" 2-3.jpg "alt=" Wkiom1htr3brmjfaaaskq9qhef4516.jpg-wh_50 "/>
650) this.width=650; "Src=" Http://s5.51cto.com/wyfs02/M02/8C/79/wKiom1htr3eT6HBbAAET7rHNJvk820.jpg-wh_500x0-wm_3 -wmp_4-s_3982947558.jpg "style=" Float:none; "title=" 2-4.jpg "alt=" Wkiom1htr3et6hbbaaet7rhnjvk820.jpg-wh_50 "/>
EMC FC AX-4 storage crashes, RAID5 hard drive corrupted data recovery process