I. fault description of RAID data recovery
Fault description
The total storage space of hp fc msa2000 is composed of eight 450 gb sas hard disks, of which seven are a RAID 5 array, and the remaining one is used as a hot backup disk. Two hard disks are damaged in the RAID 5 array, and only one hot spare disk is activated successfully. As a result, the RAID 5 array is paralyzed and the upper-layer Lun cannot be used normally.
Because some disks in the raid array are offline, the entire storage is unavailable. Therefore, after receiving the disk, the system performs a physical check on all disks. After the detection, no physical faults are detected. Then, we used a bad track detection tool to detect the disk's bad track and found no bad track.
Ii. Raid array data backup
Considering data security and restorability, all source data needs to be backed up before raid data recovery, in case data cannot be restored due to other reasons. Use the DD command or the winhex tool to mirror all disks as files. Part of the data backed up is as follows:
Iii. Raid Data Recovery Fault Analysis
1. analyze the causes of RAID faults
Because the previous two steps did not detect a physical fault or bad track on the disk, It is inferred that some disks may fail due to unstable read/write. Because the HP msa2000 controller has strict disk check policies, once the performance of some disks is unstable, the HP msa2000 Controller considers the disk as a bad disk and will consider the disk as a bad disk to be kicked out of the raid group. Once the dropped disk in the raid group reaches the limit allowed to drop the disk at the raid level, the raid group will become unavailable, the upper-layer raid group-based Luns will also become unavailable. At present, we have initially learned that there are 6 Luns Based on raid groups, which are evenly allocated to HP-Unix small machines and LVM logical volumes created on the upper layer. The important data is the Oracle database and the OA server.
2. Analyze the raid group structure
The Luns stored in HP msa2000 are based on raid groups. Therefore, you need to analyze the information of the underlying raid groups and then reconstruct the original raid groups based on the analyzed information. After analyzing each data disk, we found that the data on disk 4 is not the same as that on other data disks. It is initially considered that the data on disk 4 may be hot spare. Analyze other data disks, analyze the distribution of Oracle database pages in each disk, and obtain the Strip size of the raid group based on the data distribution, important information about raid groups, such as disk sequence and data trend.
3. Analyze the raid group's offline Disks
Based on the raid information analyzed above, try to virtualize the original raid group through the raid virtual program independently developed in North Asia. However, because a total of two disks are dropped in the raid group, we need to analyze the order of the two hard disks. Carefully analyze the data on each hard disk and find that the data on the same disk is significantly different from that on other hard disks. Therefore, it is preliminarily determined that the hard disk may be the first to be dropped, through the raid verification program independently developed in North Asia, we found that removing the data from the hard disk we just analyzed is the best, so we can identify the first hard disk to be dropped.
4. Analyze the Lun information in the raid group
Since the Lun is based on a raid group, you need to virtualize the latest status of the raid group based on the above analysis information. Then, the distribution of the Lun in the raid group and the data block map allocated by the Lun are analyzed. Because there are 6 Luns at the underlying layer, you only need to extract the data block distribution map of each Lun. Then write the corresponding program for the information, parse the Data Map of all Luns, and then export the data of all Luns according to the Data Map.
Iv. LVM logical volume and repair of the vxfs File System
1. parsing LVM logical volumes
After analyzing all the Luns generated, it is found that all Luns contain information about the HP-Unix LVM logical volumes. I tried to parse the LVM information in each Lun and found that there were a total of three lvms, of which 45 GB LVM divided into one LV, which stores the data on the OA server, the LVM of 190g divides an LV, which stores temporary backup data. The remaining four Luns constitute an LVM of about TB, and only one LV is divided, which stores Oracle database files. Write a program that explains LVM, and try to explain the LV volume in each LVM, but find that the interpreter has an error.
2. Fix LVM logical volumes
Carefully analyze the cause of program errors, arrange the location where the Development Engineer debug program fails, and arrange senior File System Engineers to check the recovered Lun, checks whether the LVM information is damaged due to storage paralysis. After careful detection, we found that LVM information is damaged due to storage paralysis. Manually repair the damaged area, modify the program synchronously, and re-parse the LVM logical volume.
3. parsing the vxfs File System
Build an HP-Unix environment, map the interpreted LV volume to HP-Unix, and try to mount the file system. Result: An error occurred while mounting the file system. Try to use the "fsck-F vxfs" command to fix the vxfs file system. However, the repair result still cannot be mounted. It is suspected that some metadata of the underlying vxfs file system may be damaged, manual repair is required.
4. Repair the vxfs File System
Analyze the parsed LV carefully and check whether the file system is complete based on the underlying structure of the vxfs file system. The analysis found that there was a problem with the underlying vxfs file system. At the time of storage paralysis, the file was performing I/O operations on the system. As a result, some file system metadata files were not updated or damaged. Manually repair these corrupted meta files to ensure normal parsing of the vxfs file system. Mount the fixed LV volume to the HP-Unix small machine again and try to mount the file system. The file system does not report an error and is mounted successfully.
5. Check Oracle database files and start the database
1. Restore all User Files
After the file system is mounted on an HP-unix machine, all user data is backed up to the specified disk space. The data size of all users is about 1, 2 TB. Some file directories are as follows:
2. Check whether database files are complete
Use the Oracle database file detection tool "DBV" to check whether each database file is complete. No error is found. Use the Oracle database detection tool independently developed by North Asia (more rigorous inspection) to check some database files and log files, and arrange senior database engineers to fix these files, in the next verification until all files are fully verified.
3. Start the Oracle database
Since the HP-Unix environment we provide does not have ORACLE data of this version, we coordinate with users to bring the original generation environment to the North Asia data recovery center, then, attach the recovered Oracle database to the HP-Unix server in the original production environment and try to start the Oracle database. The Oracle database is started successfully. Parts are as follows:
Vi. Data Verification
With the help of users, the Oracle database is started, the OA server is started, and the OA client is installed in the local notebook. Use the OA client to verify the latest data records and historical data records, and have users schedule remote verification by personnel from different departments. The final data verification is correct, the data is complete, and the data is restored successfully.
VII. Data Recovery conclusions
Because the on-site environment is well preserved after a fault occurs, no related dangerous operations are used, which is of great help to the later data recovery. Although many technical bottlenecks have been encountered during the entire data recovery process, they have all been resolved one by one. In the end, the entire data recovery is completed within the expected time, and the restored data users are quite satisfied. Oracle Database Service, OA server, and other services can be started normally.
Server raid array data recovery method/data recovery case