EVA4400 Storage Virtual machine + Database data Recovery success story

Source: Internet
Author: User
Tags create directory dba


First, fault description


The entire EVA storage structure is comprised of a EVA4400 controller, three EVA4400 expansion enclosures and 28 FC 300G hard drives. Some LUNs are missing because the two disk misses cause some LUNs to be unavailable for storage. Because EVA4400 is caused by some disk drop, the entire storage is not available. Therefore, after receiving the disk, North Asia engineers first physical detection of all the disks, and found no physical failure after detection. Then use the Bad Channel Detection tool to detect the disk bad path, and found that there is no bad way. The disk bad track detection log is as follows:
Figure One:


Second, backup data


Taking into account the security and recoverability of the data, it is necessary to make a backup of all the source data before the data is restored, in case the data cannot be recovered again if the operation is improper. Using Winhex to mirror all the disks into a file, the source disk has a large amount of content, which takes a long time to make a backup of the data. Back up some of the data as follows:
Figure II:


Third, fault analysis and recovery process

1, analyze the cause of the failure
Since the first two steps did not detect a physical failure or a bad path to the disk, it may be inferred that some disk read-write instability caused the failure to occur. Because the EVA controller checks the disk's policy is very strict, once some disk performance is unstable, the EVA Controller is considered to be a bad disk, will be considered to be a bad disk disks kicked out of the disk group. This LUN will not be available once the disk that is dropped in the same stripe of a LUN reaches its limit. That is, if all LUNs in Eva contain these dropped disks, all LUNs will be affected. It is normal to lose two disks, which makes the entire stored LUN unusable. The current situation is that there are 8 LUNs existing, 7 LUNs damaged, and 6 LUNs missing. You need to recover data for all LUNs.
2. Analyzing the structure of the LUN
Hp-eva LUNs store data in the form of RAID entries, Eva makes a raid entry of different blocks of each disk, and there are many types of RAID entries. We need to analyze the type of RAID entry that makes up the LUN and which blocks of the disk The RAID entry is made of. This information is stored in the Lun_map, and each LUN has a copy of the Lun_map. Eva stores the Lun_map on separate disks, using a single lead to specify their location. So go to each disk and find this index pointing to Lun_map to find the information for the existing LUN.
3. Analysis of missing LUNs
Although the index to Lun_map is recorded on disk, it only records the existing LUNs, and the missing LUNs are not indexed. Because removing a LUN in EVA only clears the index of the LUN, it does not clear the lun_map of the LUN. At this time need to scan all the disk to find all the LUN_MAP data block, and then eliminate the existing lun_map, the remaining lun_map is not necessarily all deleted, there are some old, but this is not filtered in the Lun_map, only through the program will all Lun_ Map data is restored, manually checking which LUNs are deleted.
4. Analyze off-line disk
In the previous failure analysis, although the disk has no apparent physical failure, there is no disk bad path. However, it will still be detached from the EVA Disk group for performance reasons. These detached disks contain some old data, so it is necessary to exclude these disks when generating the data. But how do you tell which disks are dropped? Because the raid structure of the LUN is mostly RAID5, it is necessary to calculate the check value of the RAID entry of a LUN through the RAID5 checking algorithm, and then compare with the original checksum to determine if there is a drop-off disk in this entry. All lun_map of a LUN can be checked again to see which of the raid entries in this LUN have a drop-off disk. The disk that exists in these RAID entries must be a drop-off disk. Remove the disk and then restore the data for all LUNs according to Lun_map.
5. Write Data recovery Procedures
The above-mentioned failure analysis and solutions will ultimately need to be implemented using programming. Write the scanning Lun_map program Scan_map.exe, scan all lun_map, combined with manual analysis to derive the most accurate lun_map. Write the program that detects the Raid entry Chk_raid.exe, detects the disks that are off-line in all LUNs, and, in combination with manual analysis, excludes the dropped disks. Write the LUN Data Recovery program lun_recovery.exe and restore all LUN data in conjunction with Lun_map.
6. Recover all LUN data
According to the written program to achieve different functions, and finally use Lun_recovery.exe combined with LUN_MAP to restore data for all LUNs. Then manually check each LUN to verify that it is consistent with the description of the party engineer. Data recovery for some LUNs is as follows:
Might

fourth. Validation of data


The data for all LUNs described by party A's engineers can be divided into two parts: a VMware virtual machine, part of a bare device on HP-UX, and an Oracle DBF database in the bare device. Since we are recovering LUNs and cannot see the files inside, we need to manually check which LUNs are the data that holds VMware and which are HP-UX bare devices. The LUN is then mounted in a different authentication environment to verify that the recovered data is intact.
1. Deploying a VMware Virtual machine verification environment
The ESXI5.5 virtual Host environment is installed on a Dell server, and the recovered LUNs are then mounted on the virtual host via iSCSI. However, the VMFS volume is scanned on the VMware VSphere Client and is not found. Later found that the customer's virtual host is the EXSI3.5 version. You may not be able to scan directly to a VMFS volume because of a version, so it will be a different authentication method. Generate all the virtual machine files that are inside the LUNs that meet the VMware virtual machines, and then mount them to the virtual Host through NFS sharing, and then add the virtual machine to the manifest one by one. Some of the recovered virtual machine files are as follows:
Figure IV:

2. Verifying a VMFS virtual machine
After all virtual machines are added to the virtual Host via NFS, all virtual machines are powered on and discovered to start the system. The file in the virtual machine is incomplete because there is no boot password. Later, the party arranged for the engineer to remote to our server, all the virtual machines are turned into the system, verify that the data inside the virtual machine is not a problem. All data for the virtual machine is restored successfully. Some virtual machines are powered on as follows:
Figure V:

3. Deploying an Oracle Database Validation environment
For bare-device recovery testing and later data validation, you need to build your Oracle environment first.
According to the environment information provided by party A's engineer for the HP Small machine Itanium architecture, our HP small machine is RX2660 (Itanium 2), which is a compatible version of the same architecture. It is planned to install Oracle single-instance software on this machine.



Operating system: HP-UX b.11.31
Database: Oracle 10.2.0.1.0 Enterprise edition-64bit for HPUX



The following are simple steps to install the environment:
(1) environment detection
# Uname-all
HP-UX byhpux1 b.11.31 U IA64 1447541358 Unlimited-user License
This machine is IA64 architecture, the operating system is HP-UX, and the version is b.11.31. The
then checks each part of the storage space information to ensure sufficient space.
(2) Detecting installation dependencies
Check the required patch packs for oracle10g according to the installation instructions "B19068.pdf".
Detect:
# swlist-l bundle |grep "GOLD"
# swlist-l patch |grep phne_31097
If it is not detected, it needs to be downloaded and installed on the official website. Install the patch pack:
swinstall-s/patchcd/goldqpk11i-x autoreboot=true-x patch_match_target=true
(3) Create user and group
#groupadd DBA
#useradd-G dba-d/home/oracle Oracle
#passwd Oracle
(4) Create directory and Modify permissions
Create a directory:
#mkdir –p/opt/oracle/ Product/10.2/oracledb
#chown-R oracle:dba/opt/oracle/frombyte.com



Modify Permissions:
#chown oracle:dba/usr/oracle_inst/database/
#chmod 755/usr/oracle_inst/database/
(5) Setting environment variables
Vi/home/oracle/.profile
(6) Installing Oracle
The installation of Oracle requires a graphical interface, so test the image interface to start properly.
#exoprt display=192.168.0.1.0:0
$./runinstaller
After the image interface up the installation is relatively simple, here only install software, do not install the instance.
(7) Test database connection
#su-oracle
$sqlplus/as SYSSDBA


4. Verifying Oracle Database
(1) Mount the bare device
Because some LUNs are bare devices, the LUNs we recover are in the form of files. Therefore, you need to attach a file-form LUN to HP-UX. Install iSCSI Initiator on the HP-UX server with the following installation steps:
Detects if the package is complete
#swlist-D @/tmp/b.11.31.03d_iscsi-00_b.11.31.03d_hp-ux_b.11.31_ia_pa.depot
Install packages
#swinstall-X autoreboot=true-s/tmp/b.11.31.03d_iscsi-00_b.11.31.03d_hp-ux_b.11.31_ia_pa.depot iSCSI-00
To add an iSCSI executable file to the path
#PATH = $PATH:/opt/iscsi/bin/frombyte.com
Detect if iSCSI is installed successfully
#iscsiutil-L
Configure the initiator name for iSCSI
#iscsituil/dev/iscsi-i-N Iqn.2014-10-15:lun
Configuring Mount Target iSCSI Devices
#iscsiutil-I. 10.10.1.9
To delete a target iSCSI device
#iscsiutil-D-10.10.1.9
Verify that the target iSCSI is mounted successfully
#iscsiutil-PD
Target device found
#/usr/sbin/ioscan-h 255
Create a device file for a target
#/usr/sbin/insf-h 255
(2) Import external VG Information
Create a VG node
#mkdir/dev/vgscope/frombyte.com
Create VG Device file name
#mknod/dev/vgscope/group C 0x030000
See if PV is working
#pvdisplay-L/dev/dsk/c2t0d0/frombyte.com
Importing PV into the VG
#vgimport-V/dev/vgscope/dev/dsk/c2t0d0
Activating VG Information
#vgchange-A Y Vgscope
View VG Information
#vgdisplay-V Vgscope
Figure VI:

(3) Change the LV name
Since the VG is rebuilt in the new environment, the PV is then imported into the new VG. So the name of the LV all changed, need to manually go to the name of the LV is changed to the previous.
Figure VII:

Because the original database instance is 2, and is used by the bare device storage. So when you create a DB instance, you follow the original configuration and name.
At the filesystem level, all LV is mounted with the help of the same time, and the permissions are modified.
Figure Eight:

Installs and identifies all bare device files, based on the original configuration, with the assistance of the customer DBA, by installing the DB instance.
Then adjust the configuration parameters, detect the database storage status, prepare to start the database.


First switch to instance scope (most important). , start the database.
Sql>startup Mount;
Sql>select File#,error from V$recover_file; --Check for damaged files.
There are no corrupted files.
Sql>alter DATABASE OPEN;
Start without error, but slow, then query the user, randomly query a user's two tables, the data result set returned to normal. Then the connection is suddenly interrupted, reconnected, and the view status is closed for the database. Starting the database again, or not starting, will force the shutdown.
This issue cannot be fixed after initial detection and general recovery of the library state.



Verifying the NJYY Database
To switch an environment variable to another database Njyy,open database times wrong memory error, unable to open the database. The data files are not damaged by the initial detection.
Sql>startup Mount;
Sql>select File#,error from V$recover_file;
Sql>alter DATABASE OPEN;
Error 4030 detected in background process



5. Repairing Oracle Database
Fault Repair
For the scope database, according to the above operation and the symptom, the preliminary judgment is the undo tablespace or the log aspect has the question. Integrity and consistency detection of data files results in only one undo01.dbf file corruption. Determine that the undo tablespace is corrupted. Remove the corrupted undo tablespace from the command and rebuild it in its original location.
Detects other parts of the file and finds no problem. Restart the database, normal startup, do query data, normal, do the integrity detection, normal.
Then do the IMP database to export, after 3 hours of normal export of the whole database.
For the NJYY database, the detected memory space is not set, after the command adjustment, the database is back to normal, can start normally, normal use.
Finally, the whole database of IMP database is exported, after 4 hours of normal export of the entire library.



Specific validation
Upon completion of the preliminary verification, party a requires its DBA and business personnel to make further specific verification by remote database. The validation environment and the validation of each database are done with the coordination.
Finally verify that the database is full recovery, no problem.
After validating the data, do the data migration. Consider the capacity and recovery time of the database. Choose to use EXPDP to do the export of the whole library data. Because the efficiency of EXPDP is higher than exp.
After writing the export script and testing the test environment without problems, export the scope database first. The error begins 24 minutes after the export starts:
Ora-39171:job is experiencing a resumable wait.
Ora-01654:unable to extend index SYSTEM. Sys_mtable_00003a964_ind_1 by 8 in tablespace SYSTEM
The reason was found because the system table space was full. When exporting with EXPDP, the export record data is added to the System.sys_mtable_00003a964_ind_1 table in the system tablespace. When you export large amounts of data, the amount of data in this table increases, and when you reach the total capacity of the system table space, you get an error. Analysis here, the table space is generally automatically increased capacity, so should not be an error. Finally, the system table space is placed on the bare device, the capacity is 1G, and can not be increased. Therefore, you cannot use the EXPDP tool to do the export. You can only export using the EXP tool, although it will be slower, but there will be no shortage of system table space.
Finally through the exp to the scope to do a full library export, after 6 hours of successful backup completed. Backup files up to 172G.
To the NJYY database, do imp export, after 7 hours of normal export of the entire database, backup files up to 140G. The database backup file was then locally backed up as a secure cold backup.


fifth. Transfer of data


1. Handing over VMware virtual machine files and Oracle dump files
After verifying that all data is not a problem, copy the VMware virtual machine files and Oracle dump files to a 2TB Seagate hard drive. The recovered LUN data is then copied to a single disk of two 3TB. After coming to party A, the VMware virtual machine files and Oracle dump files are given to party a, and party a starts verifying the dump files and VMware virtual machine files.
2. Mirror the LUN data to the EVA4400 storage server of party A
Because party a requires that all LUN data be restored to the original environment, you will need to reconfigure the hp-eva4400 to recreate the LUNs of the same size as before. The recovered LUN data is then mirrored to the EVA new LUN through the Winhex tool. Because of the hp-eva4400 of the controller there are some problems caused debugging for a long time before resetting hp-eva4400. After mirroring all LUN data, party a arranges Oracle database engineer to verify that the recovered Oracle is normal. After detection of two DBF file loss causes the Oracle service failed to start, analysis of the cause of the failure found that because the two lost dbf before the EVA failure is a file, and later in the recovery, restore it into the LV inside. And the engineer of party a did not reconstruct the VG when restoring the LV, but all the LV restored according to the previous vg_map. That's why this problem occurs, the workaround is to recreate the two lv, and then remove the two files from the underlying LUN, and dd them into the newly created LV. Start the Oracle service again and start the normal, problem-solving.
As a result of the failure to save the site environment good, did not do the related dangerous operation, the later data recovery has a great help. Although the whole data recovery process encountered a lot of technical bottlenecks, but also all resolved. Eventually the entire data recovery is completed within the expected time, and the recovered data is quite satisfactory to the party.
Future Data security recommendations
1, arrange staff often patrol room, found alarm information timely processing.
2, the management personnel to operate the storage to be cautious, avoids the mistake operation causes the data loss.
3, the site found that the EVA controller part of the module is not too stable, should be replaced in time.
4, because the EVA storage failure is caused by the disk instability, and this part of the disk should be the same batch of disks. As a result, the performance of these disks is also fast to the limit, if a condition is recommended to replace this batch of disks.



EVA4400 Storage Virtual machine + Database data Recovery success story


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.