After two days of unremitting efforts, finally recovered a mistake to delete production server data. Record this accident process and solution, alert yourself, also suggest others mo make this mistake. Also hope that friends who encounter problems can find a hint of inspiration to solve the problem.
Accident background
Arrange for a sister to install Oracle on a production server, sister side of the research side of the installation, feel loaded incorrectly, ready to uninstall the reinstall. Locate the Uninstall method from the Web, where you want to perform a row of commands to remove the Oracle installation directory, as follows:
RM-RF $ORACLE _base/*
If oracle_base this variable is not assigned, the command becomes a
RM-RF/*
==| |, the sister uses the root account. In this way, the entire disk file is deleted, including the application of Tomcat, MySQL database and so on.
(is the MySQL database running?) Linux can delete the file being executed? Anyway is completely deleted, and finally there is a tomcat log file, estimated that the file is too large, temporarily did not delete success.
Look at the girl's eyes, but also because this thing is I arranged her to do, nor with her clear strong relationship, without any training, responsibility can only a person back, and how can let beauty bear this responsibility?
Call to the computer room, the disk to hang to another server, SSH to see all the files are clear, this server is running a customer production system Ah, has been running for half a year, have to recover as soon as possible. So look for the database of offline backup, found that the backup file only 1kb, there are only a few lines of familiar mysqldump comments (is not the crontab to perform the backup script has a problem), the most backup is December 2013, is really a house leak on the night rain Ah.
Think of a leader said a case: when a production system hung up, found that all backups have problems, burning discs also have scratches, tape machine is also broken (an industry predecessors, estimated before using CD-ROM to do backup), did not think today really fulfilled to my body, how to do?
Department leaders know the situation, has done the worst Plan B: leaders personally led and product AA Sunday to the customer's city, Monday to the leadership communication; BB and CC go to the client admin side to try to convince the customer.
–ext3grep of Straws
Quickly go to the Internet to check the data to recover, and really find a ext3grep can restore through rm-rf deleted files, our disk is ext3 format, and there are many successful cases online. Then lit up a glimmer of hope, quickly to the disk umount, prevent the re-deletion of the deleted file sector. Download Ext3grep, install (compile and install the process hard for the time being not table).
The scan file name command is executed first:
Ext3grep/dev/vgdata/logvol00--dump-names
Print out all the deleted files and paths, ecstasy in mind, do not execute Plan B, the documents are there.
This software cannot recover files by directory, only to perform full restore commands:
Default
1
Ext3grep/dev/vgdata/logvol00--restore-all
Results The current disk space is not enough, no way can only recover files, tried a few files, incredibly partial success partially failed
Ext3grep/dev/vgdata/logvol00--restore-file Var/lib/mysql/aqsh/tb_b_attench. MyD
Heart can not help but a cool, is deleted disk has been written on the file? Recovery is not a good chance ah, can restore a few, perhaps important data files just in the MyD file can be recovered. Then redirect all file names to a file file
Ext3grep/dev/vgdata/logvol00--dump-names >/usr/allnames.txt
Filter out all MySQL database file names saved into, Mysqltbname.txt
To write a script recovery file:
Default
While Read line
Todo
echo "Begin to restore File" $LINE
Ext3grep/dev/vgdata/logvol00--restore-file $LINE
If [$?!= 0]
Then
echo "Restore failed, exit"
# exit 1
Fi
Done </mysqltbname.txt
execution, probably run for 20 minutes, recovered more than 40 files, but not enough ah, we nearly 100 tables, each table frm,myd,myi three files, how to say there are more than 300 AH!! Will retrieve the file attached to the existing database, but also file permissions for 777, restart MySQL, is also a part of the data back, but the customer important attendance data, mobile phone report data (said the customer according to this data to do employee performance) has not been found.
What to do? In the middle again tried another tool extundelete, with Ext3grep Grammar basically consistent, the principle should also be the same, but it is said to be able to restore by directory, okay, try.
Default
1
Extundelete/dev/vgdata/logvol00--restore-directory Var/lib/mysql/aqsh
As expected, I couldn't get it back!!!!!!!! Those files have been corrupted. Report to the leader, execute Plan B ... Helpless to come home from work (weekend, go back to rest, think of ways)
Brainwave: Binlog
The next morning a long time to wake up (in the mind something AH), back to the computer, to the company (this weekend is a reimbursement, not criticized, bulletin, fine, fired on the good, what the weekend AH).
Still run Ext3grep,extundelete, also that a few strokes ah, put the system to test server, see the data can find a way to make up. Mysqldump on the test server, restore the file, overwrite the restored file, add permissions to the file, and restart MySQL.
Wait,wait, is there a binlog? Our service requires the opening of the Binlog, perhaps through the binlog to recover the data?
So from the dump out of the file name found Binlog file, a total of three, mysql-binlog0001,mysql-bin.000009,mysql-bin.000010, restore 0001
Default
1
Ext3grep/dev/vgdata/logvol00--restore-file var/lib/mysql/mysql-bin.000001
It's a failure.
Look at the other two files, mysql-bin.000010 about hundreds of MB, should be more reliable, the implementation of the Restore command, incredibly successful!
Quickly SCP to the test server. Perform a binlog restore.
Default
1
mysqlbinlog/usr/mysql-bin.000010 | Mysql-uroot-p
Enter the password, stuck (good phenomenon), after a long wait, finally ended. Open application, oh, thanks CCTV,MTV, data back!
Postscript
After this accident, although the data is very lucky to be back, but the process is alarming and dynamic. Also for their own mistakes brought about by the consequences of colleagues and leadership brought about by the joint responsibility and fear. I also wish to remember this accident and not make the same mistake again. Reflection of the accident is as follows:
1. The arrangement of the server maintenance mm did not advance her description of the situation, they did not pay attention to management confusion, flow chaos. An online production system, any change must first seek and then move.
2. There was a problem with the automatic backup and no one checked it. Offline backup people never attach importance to downloading 1k of files from the server each time. You need to be clear about your responsibilities in the workplace.
3. After the accident, no timely discovery, resulting in some data to disk, resulting in unrecoverable problems. Need to write application monitoring procedures, service once there is an exception, SMS alarm related responsible person.
According to the comments, add one more:
4. You cannot use the root user to operate. Users with different permission levels should be opened on the server.
Through this accident, several colleagues who had nothing to do with the project and the accident, volunteered to help, look up data, help test, and a colleague helped to do the data recovery test at 1 o'clock. At the same time, product managers in the thought of the huge pressure on customers, without panic and blame the developers and specific operators, and let everyone can calm down to think of solutions. Department leaders are also actively help to find ways to accompany us to work overtime testing, real-time tracking things progress.
Through the joint efforts of everyone, finally things are relatively successful conclusion, the next, Monday morning collective reflection, experience and lessons, such accidents must try to avoid the greatest efforts.
The tools used in this article link:
Default
1.ext3grep:https://code.google.com/p/ext3grep/
There are a lot of compile-and-install dependencies, and you can search the Web for how to install them. Unfortunately, the author of the Howto by the wall, I fq to the PDF document to download down, after reading you will have a further understanding of the Linux file system. Download HOWTO.
This tool has a bug that does not go down after an error ext3grep:init_directories.cc:534:void init_directories (): Assertion ' Lost_plus_found_ Directory_iter!= all_directories.end () ' failed., resulting in recovery failure, the author released a patch, download address: Patch download. Do not understand why the author of the new version does not add this patch.
2.extundelete:http://extundelete.sourceforge.net/
The function is similar with ext3grep, the principle should be similar. Just claiming to be able to restore the catalog, I have no test success here.