Server RAID Disk Bad Track repair practices

Source: Internet
Author: User

The online monitoring system nagios sent an alert message last week, which is roughly a disk array error.

Log on to the alarm server and make the disk array detection tool perform a detailed check. The report shows

Media Error Count: 2

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/140110/042429Cb-0.jpg "title =" 18.jpg" alt = "wKioL1LKY2qgMjG1AADzBR07ros581.jpg"/>

Since it is a warning, it is not a particularly serious error. After confirmation with Dell engineers, it is a bad track in the disk. Because it is an image server and there is a backup, It is not processed in the data center for the time being.

Two days later, another MySQL database server issued the same alarm. The terrible thing was that it passed the detection report.

Media Error Count: 24

Other Error Count: 2

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/140110/0424295R0-1.jpg "title =" 8.jpg" alt = "wKioL1LKY3zCZGETAADnGsEVC0I742.jpg"/>

It seems that servers are experiencing high hardware failures this year. If Dell servers are not purchased by the manufacturer, you should be careful if you are maintenance personnel.

What do you mean!


So I sent an email to the Director and the Development Manager to discuss the fault details and provided the current emergency solutions. The image server made a backup of the file on a different machine, because the database server is in a master-slave structure, there is no need to worry about this. Every day there is a backup plan for local and remote use.) one server disk supports hot swapping, that is to say, there is no problem with disk replacement without stopping services. However, in order to ensure security and stability, it is agreed that it is appropriate at night. In fact, this is completely caused by lack of confidence. In fact, daytime processing has little impact, may lead to higher I/O load), avoiding business peaks and access peaks, which provides us with sufficient time to solve the problem.

The most reassuring thing is that the array configuration of the server is a great benefit of RAID5 + hot spare made through four disks: if any of the three disks in normal RAID 5 is damaged, the RAID array is currently safe.

If a hot backup disk is configured, the hot backup disk is used as a replacement when a disk in the RAID array is hard to calculate, And the RAID array is rebuilt. When the damaged disk is replaced. The hot spare disk identifies and synchronizes its data with the new disk. After the data synchronization is complete, it will be restored to the original role-hot backup disk. Amazing :)

If you understand the working principle of the array, you don't have to worry about it. The next night, follow the planned solution.

First, manually back up important files to keep the latest backups, images, databases, and other backups to remote OK

Then, unplug the faulty disk as planned and immediately Insert a new disk. At this time, the new disk will flash for several seconds. This process is the process of identifying the motherboard, then it indicates that it is always in the static state.

At this time, the Data Reading and Writing lights on the fourth hard disk, that is, the hot spare disk, Flash. You should have guessed that the hot backup disk has been detected to be offline, and then it will be automatically added to the RAID for reconstruction, it took about 30 minutes to reconstruct the GB capacity.

To verify my ideas, I restarted the image server and went to the RAID card configuration tool in the BIOS to view it. At this time, I showed that the hot spare disk was used for Array reconstruction, the status of the new disk to be replaced is READY.

After about 30 minutes, the RAID reconstruction is complete. The status of the new disk is changed to replacing immediately. This process takes about 30 minutes.

In the system, the disk status is cpoyback.

The final normal state is that the role of the hot backup disk is restored to the hot standby mode, and the array is rebuilt and works normally.

The system redetects and the error disappears.

Note: The specific tools mentioned above are as follows:

/Opt/MegaRAID/MegaCli/MegaCli64-PDList-aAll

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/140110/0424294137-2.jpg "style =" float: none; "title =" 8.jpg" alt = "wKioL1LKY_ziQNtrAADYQdYbcLg815.jpg"/>

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/140110/0424291E8-3.jpg "style =" float: none; "title =" 18.jpg" alt = "wKiom1LKZAnyfL1VAADoBNlS8_o638.jpg"/>


This article is from the "Shadow Knight" blog, please be sure to keep this source http://andylhz2009.blog.51cto.com/728703/1348992

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.