Are your RAID 5 arrays secure?

Source: Internet
Author: User
Tags arrays

Many people have encountered the server RAID5 hang out, often after a disk, the second disk also immediately hung up.

Reference: RAID 5 is also a data parity to ensure data security, but it is not a separate hard disk to store data parity bit, but to the data section of the check bit interaction on each hard disk. In this way, any hard drive that is damaged can reconstruct the corrupted data based on the parity bit on the other hard disk. The utilization of the hard disk is n-1. If you hang up two disks, the data is over.

Theoretically, the probability of two hard drives failing at the same time is very low, but why?

Reference: Mathematically, the average time of failure (MTBF) for each disk is approximately 500,000 to 1.5 million hours (that is, hard disk damage occurs every 50-150 years). It is often not possible to achieve this ideal situation, in most thermal and mechanical conditions, will cause the hard drive normal working time significantly reduced. Considering that each disk has a different lifespan, any disk in the array may have problems, statistically speaking, the probability of the failure of N disks in the array is more than n times greater than the probability of a single disk failure. Combined with the above factors, if the number of disks in the array is reasonable and the average failure time (MTBF) for these disks is short, disk failures are likely to occur during the expected lifetime of the disk array (for example, every few months or every few years).

What are the chances of two disks being damaged at the same time ("at the same time" means that another disk is broken when a disk is not fully repaired)? If the MTBF of a RAID 5 array is equivalent to mtbf^2, then this probability occurs once every 10^15 hour (that is, only once in more than 10,000 years), So the probability of this happening is extremely low regardless of the working conditions. From a mathematical point of view, there is such a probability, but in reality we do not use to consider this problem. But sometimes there are two disk damage at the same time, we can not completely ignore this possibility, the actual two disks at the same time the cause of the damage and MTBF basically no relationship.

In this case, the first to introduce a common people do not often contact the concept: BER hard disk error rate, English is ber (Bit Error Rate), is a description of the hard disk performance of a very important parameter, is a measure of the reliability of the hard disk error. This parameter represents the data that you write to the hard disk and the probability of an unrecoverable read error while reading. From a statistical point of view is also relatively rare, in general refers to the number of bits read after a read error.

With the increase of hard disk capacity, the misreading rate of drive reading data increases, and the ratio of BER is kept relative increase while hard disk capacity is soaring. A 1TB drive is more likely to read the entire drive, which is the probability of an error occurring during a raid rebuild than the 300G drive has encountered an error.

How much is the chance of this mistake? Or, how many gigabytes of data do we have to write in order to encounter a 1byte read error? Read this article:
http://lenciel.cn/docs/scsi-sata-reliability/

For different types of hard drives (formerly Enterprise, server, data center-level hard disk with scsi/fiber, commercial, civil level is the IDE; now the corresponding is sas/sata;

Their MRBF (average downtime) is close, but a ber cheap SATA hard drive is much higher than the BER (BER) of an expensive SCSI hard drive.

That is, SATA is much more serious than SCSI when a sector cannot be read. Specific differences on the firmware: When you are not able to read the past, or write a bad way, the home hard drive will take more than 1 minutes to try to correct the error, can not correct the direct use of the spare sector replaced, this time than the array controller can tolerate the limit, so encountered this situation directly off the disk The enterprise-class disk will have this work in the background, without pausing for about 1 minutes, without impacting the array operation. There is no difference in BER on the error rate of a bit.

According to the calculations in the article, a 1TB hard disk, usually you can not read all sector probability reached 56%, so you use cheap high-capacity sata disk, in the event of a hard drive failure to rebuild the raid is the hope is: cannot be achieved.

With a 1TB SATA hard drive to do RAID5, when you encounter a hard drive failure, almost the remaining two or more hard drives (RAID5 the fewest combination is 3) will definitely encounter a hard drive read error, thus rebuilding failed.

Therefore, the previous small hard disk to do RAID5, the basic rarely encountered at the same time hanging off two disk situation; Now the disk is big, the probability of the problem is also getting bigger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.