Analysis of why RAID5 tend to fall off the second disk immediately after losing a disk

Source: Internet
Author: User

Many people have encountered the server RAID5 hanging off, often lost a disk, the second disk immediately hung off.

We all know that RAID5 once allowed a disk to be missing,

RAID 5 is also a data check bit to ensure the security of data, but it is not a separate hard disk to hold the data check bit, but the data segment of the check bit interaction on each hard disk. This way, any one of the hard drives can be damaged to reconstruct the corrupted data based on the parity bits on the other hard drives. The utilization of the hard disk is n-1.

If you hang up two disks, the data is gone.
Theoretically, the probability of two hard drives failing at the same time is very low, but why is it so?

Mathematically speaking, the average time-to-failure (MTBF) per disk is approximately 500,000 to 1.5 million hours (that is, a hard disk damage occurs every 50-150 years). This ideal situation is often not achieved, and in most cooling and mechanical conditions, the hard drive will have a significant reduction in the time of normal operation. Considering that each disk has a different lifespan, any disk in the array can be problematic, and statistically, n disks in the array have a greater chance of failure than a single disk. In combination with the above factors, if the number of disks in the array is reasonable and the average time-to-failure (MTBF) of these disks is short, there is a high likelihood of disk failure during the expected lifetime of the disk array (for example, one failure every few months or every few years).

What are the odds of the two disks being damaged at the same time ("at the same time" means that one disk has not been fully repaired and another disk is broken)? If the MTBF of a RAID 5 array is equivalent to mtbf^2, then this probability occurs once every 10^15 hours (that is, more than 10,000 years), So regardless of working conditions, the probability of this happening is extremely low. From a mathematical point of view, there is this probability, but in reality we do not consider this problem. However, sometimes it happens that two disks are damaged at the same time, we can not completely ignore this possibility, the actual two pieces of disk damage is not related to the cause of MTBF basically.

Today, I was just testing a ZFS array to clearly check the data and see the results, so I just analyzed the reason.

In this case, the first thing to do here is to introduce a concept that is not often accessible to ordinary people:

ber hard drive Ber, English is ber (Bit Error rate), is a very important parameter to describe the performance of hard disk, is a parameter to measure the possibility of hard disk failure.

This parameter represents the data that you write to the hard disk, and the probability of an uncorrectable read error when reading.

(ECC read errors that cannot be recovered) are also relatively rare from a statistical point of view, which generally means that a read error occurs after the number of bits read.

As the capacity of hard disk increases, the rate of misreading of drive read data increases, while the capacity of hard disk is soaring, and the ratio of BER is kept relatively increasing. A 1TB drive is required to read the entire drive more, which is the probability of an error occurring during a raid rebuild that is greater than the chance of a 300G drive encountering an error.
What is the odds of that error? Or, how many gigabytes of data do we write before we encounter a 1byte read error?

Read this article:
http://lenciel.cn/docs/scsi-sata-reliability/

For different types of hard disks (formerly Enterprise, server, data center-level hard disk with scsi/fiber, commercial, civil level is the IDE, now corresponds to the sas/sata;
Their MRBF (average trouble-free time) is close, but the ber of cheap SATA drives is much higher than the bit-error rate (BER) of expensive SCSI drives.
In other words, SATA is much more serious than SCSI when a sector cannot be read.
The difference between these two hard drives (Enterprise Scsi/fc/sas disks)/(Commercial/civil-grade Ide/sata) BER is about 1-2 orders of magnitude.

According to the calculation in the text, a 1TB hard disk, usually you can not read all the sector probability reached 56%, so you use a cheap high-capacity SATA disk, in the case of a hard disk failure to rebuild the raid is: unable to achieve.

Let's go back to the RAID5 situation.
At the beginning of the RAID5, the capacity of the hard disk is not more than 100GB.

In the past, do RAID5 general RAID disk capacity is not small, such as 72GB. The probability that a raid cannot be recovered according to the literature is 1.1% (note that 1.1% is already pretty good because you have to recover the raid after a hard drive failure.) The two probabilities are multiplied.

When the drive capacity rises to 200GB, it is assumed that the probability of failure is linearly increasing [1]. So the failure rate is 11%, and the owner of the store is estimated to be the boss of the bad.

But 56%, that is, you use 1TB SATA hard disk to do RAID5, when you encounter a hard disk failure situation, almost the remaining two more hard disk (RAID5 minimum combination is 3) will definitely encounter a hard disk read error, thus rebuilding failed.

Therefore, the previous small hard disk to do RAID5, rarely encountered at the same time to hang up two disk cases; now the hard drive, the probability of the problem is getting bigger.

Some people will ask: 56%?
In this article, the data still appears to be "not credible". Probably because of the following two reasons, we have not heard so high ber bar.

First of all, we do not run raid on our own hard drive. Although our own hard drive capacity is now very large (now the mainstream desktop SATA hard drive should be 500G), but on this level of hard disk, you often do not write full 500G hard disk, and most of the data you rarely read it. So, in terms of probability, the bad news is much more likely to be present in the less-than-egg movies and music: Most movies and music are not.

But for the users who run the raid, the whole hard drive is read frequently. Even if the system is harmonious enough to know that you are not reporting the bad paths that appear in the files you never read, you just skip the reporting step: it will still find all the bad ones, and 56% will come.

Second, although you have lost some sector, the general situation does not matter. In the case of movies and music, most compression algorithms are designed to allow some or all of these errors. Lost a few sector data, perhaps for you is Pine Island Maple Some nameless mosaic just, you do not know is hard drive.

Now there are so-called monitoring dedicated enterprise-class SATA, the principle is to tamper with the firmware, so that the hard drive even if the data read errors written, and no matter 3,721 skip directly, no longer retry the read (standard hard disk reading is to encounter a sector CRC error automatically re-read until the correct data is read). This is a matter of course for monitoring data (most monitored hard drives are constantly written, but rarely read), unless you are experiencing problems that require you to reproduce the image.
The probability of this is very low, I have dozens of 16-way hard disk camera, the lens is also hundreds, basically a few months to use to play back once, and even if the need to play back, but only a small portion of the data (such as a certain number of cameras, a day of one hours of video).
But it's even worse to do RAID5 with this hard drive, and it's impossible to read the right data when RAID5 reconstruction requires data to be absolutely correct.

We continue to look at today's test data:
I used FreeBSD's Raidz soft array to create a RAID5 of a 6X1TB hard drive. Of course, I do not have hard disk damage, these six hard drives are good;
But we don't care about this, the reason for using Raidz here is that Raidz has data patrol (scrub) function, and then can visually see the results of data patrol.
General hardware array card, that is plugged in the motherboard pci/pcix/pcie/or motherboard integration of the RAID5, there is no such function;
Enterprise-class data storage, but only to the disk array level (such as IBM Ds3000/4000/5000,dell MD3000....ETC) have such features, but you also do not see the results of the inspection, up to the log to see a hard disk CRC failure, and then jump out of the red light, Array Cabinet alarm notifies you to change the drive. You do not want to know whether the hard drive is completely dead, or there is a reading error, or there is a bad way ... In short, two eyes a smear.
One of the benefits of ZFS is here: You play a zpool scrub tank (tank is the name of the raid you created) on the command line, and it starts to faithfully start inspecting the entire array, reading each byte of data above and comparing it to the MD5 value when it was written, and reporting and try to fix it with redundant data. This process is done in the background, and the user can visually see the progress of the patrol, the speed/time required, and the results.

1|  Scan:scrub in progress since Tue Jan 16:19:26       20122| 4.67T scanned out of 5.02T at 332m/s, 0h18m to go3|        620K repaired, 92.88% done4|config:5|6|        NAME             State     READ WRITE cksum7|        FTP              ONLINE       0     0     08|          raidz1-0       ONLINE       0     0     09|            Mfisyspd0p3  Online       0     0     3  (repairing)            mfisyspd5p3  online       0     0     6  (repairing)            mfisyspd1p3  ONLINE       0     0     2  (repairing)            mfisyspd4p3  Online       0     0     8  (repairing)            mfisyspd2p3  ONLINE       0     0     2  (repairing )            mfisyspd3p3  ONLINE       0     0     3  (repairing) errors:no known data errors

Oh, let me explain the meaning of the above data. For the sake of convenience, I have numbered 1-9 lines in front.
Second row, data volume/array capacity, time remaining. Because it was a test, I was very BT to almost fill the array with data;
The third line: Read the error data volume//patrol progress;

7,8, is the array state,
Line 9th is the number and status of the 6 hard drives.
It is important to note that the last column of line 6 after the beginning of line 9th, that is, the cksum data before the repairing brackets are indicated.
See the ink? Have you got any ink? Have you got any ink?

Each hard disk has a different number of read errors, at least two disks each two, up to a disk of 8.

Ok. After a full four-hour scan, our array data was repaired 620K.
The data was just written in. Not long after, the earliest data was not more than a week.

Scan:scrub repaired 620K in 4h25m with 0 errors on Tue Jan 20:45:12 2012config:        NAME State     READ WRITE Cksu M        ftp              online       0     0     0          raidz1-0       online       0     0     0            mfisyspd0p3  Online       0     0     3            mfisyspd5p3  online       0     0     6            mfisyspd1p3  Online       0     0     2            mfisyspd4p3  online       0     0     8            mfisyspd2p3  online       0     0     2            mfisyspd3p3  ONLINE       0     0     3

But the reason for this fix is RAID5, and each hard drive is a good state, with enough CRC redundancy.

Don't underestimate the 620k--. When your RAID5 hangs a disk (according to the MTBF principle, if you have 50 hard disks, you can damage at least one per year);

This 620K data is deadly:
The remaining 5 hard drives have not been able to provide enough CRC checksum data, unless each byte can be read smoothly, otherwise this RAID5 is useless,
As long as there are 1 sector errors, you will not be able to rebuild the RAID5 even if you replace the missing hard drive with a new one.
The direct consequence of an array card that is common on the market is that it is rebuilt with a hot spare, or even has not yet begun to rebuild
(RAID5 not hot and not in time to find the hard drive hangs, or at hand, the warehouse does not have cold spare disk, procurement, etc.) RAID5 sick Run,
Need to read the data, but unfortunately read the damn BER sector, so the BER sector is located on the hard drive directly jump off ... It's red again .....

Well, the above data has explained the reason why RAID5 tend to hang two at a time--not the user RP problem, from the BER point of view, is the hard disk actually early bad bird, we did not find it. When a hard disk because of the MTBF cause the entire hanging off, the problem of the BER sector began to jump out of the stops me, so RAID5 is dead birds.

We can also summarize the probability of encountering RAID5 once to hang two disks:

1. Using the larger capacity of the hard disk to do RAID5, the probability of encountering BER sector is greater, such as 100G hard disk to do RAID5 than with 1TB security;
2. Using more disk drives to do RAID5, the probability of encountering a BER sector is greater, such as 3 disks made of RAID5, than 6 disk RAID5 security;
3. Use the cheaper hard disk to do RAID5, the probability of encountering BER sector is greater, such as using Scsi/fc/sas disk than with Ide/sata RAID5 security;
4. The more data stored inside the RAID5, the greater the probability of encountering a BER sector, such as the RAID5 security of 100G data than the 1TB data stored;

Fourth also want to look at the array card, some cards rebuid only the sectors that have stored data, some cards regardless of 3,721 to read the entire disk.

From a data perspective, the ZFS soft array has its unique advantage: even in the presence of a BER sector, it can skip bad sectors and continue to read other data, thus controlling the loss to a minimum (only the file in which the BER sector resides is problematic, not the entire array)

Like this example:

Scan:scrub repaired 0 in 9h0m with 4768 errors on Thu Sep 8 08:23:00 2011
Config

NAME State READ WRITE cksum
FTP ONLINE 0 0 39
Da2 ONLINE 0 0 156

Errors:permanent errors has been detected in the following files:

Ftp:<0x7ef04>
Ftp:<0x7ef11>
Ftp:<0x7ef12>
Ftp:<0x7ee1a>
Ftp:<0x7ef31>
Ftp:<0x7ef42>
Ftp:<0x7ee57>
Ftp:<0x7ef5e>
Ftp:<0x7ef6d>
Ftp:<0x7ee70>
Ftp:<0x7ee71>
Ftp:<0x7ef71>
Ftp:<0x7ee87>

10TB partition, after the patrol found 156 errors, but the loss is only 13 files.
It can also point out the location and file name of the 13 files, and the 13 files are not completely unreadable, only some bytes are corrupted.

Perhaps some of these 13 files are audio and video, images, loss of a few bytes harmless; maybe there is another backup, you can re-copy it, perhaps some system files, just reinstall the software or the system can be restored, in short, the use of ZFS loss can be controlled within the tolerable range, This is the advantage of the ZFS software array.

By the way, Zpool scrub commands can be executed on a regular basis, such as running every 1-2 weeks on Saturday night, automatically checking and fixing errors. This Automatic data inspection function I have never seen on any plug-in or on-board array card, only the array cabinet-level storage device, and the inspection report is not accurate to the level of the file level.

If you use a hardware array card, this is tragic .... There's nothing.

OK, the reason has been analyzed, experience has been summed up,
Last section: What to do?

I am a download crazy, must put a lot of data, the amount of data in terabytes, is about to break through three digits;
Is poor, can not afford a lot of money, only with the cheapest SATA disk, preferably the cheapest and large bowl of the west number green disk;
Technology has a gap ah, can't afford ZFS;
Timid and small, afraid of, after reading this article, think of their own raid data at any time will be all Mo, sleep at night can not sleep, hurt ah swollen touch do?

Solution 1. For a large amount of data that is not important, such as audio and video data that can be downloaded, do not array. Directly with a single disk, each disk put a batch of data.

The amount of data is large to a certain extent, at present, the total amount of 2TB above the SATA array, if you do RAID5, not only the data is not guaranteed, in case of a broken disk, you have more than 50% of the probability of a second disk, that is a certain amount of hair is not left a root.
If you're lucky, you can barely restore the array by forcing the second hard drive to go online, and you have to have the same amount of space to maneuvers your data.
Do not say that the amount of spare space for the data is generally not there, even if there is a time to toss back and forth is enough to keep you awake for a week without a An Ming.
Some people say not ah, I this RAID5 read has 300mb/s, 2TB data only two hours.
That is your usual--raid5 lack of disk run, the performance is probably only the usual 1/10-1/20, plus on must have read not the past files need to skip .... Plus, it's not going to go down.
Re-forced to restart the line again--such an event more than once, according to the previous test, 6 1T disks have encountered 24 read errors, that is, your backup progress has to stall 24 times--These things piled together, if not a psychological quality very good, processed n similar event experience in the data center Lao, It is a nightmare experience for the first time that a novice encounters this situation.

Solution 2. Do a RAID1 array. For important and frequently updated data, such as financial data, photos, documentation, it is recommended to use RAID1.
RAID1 spent a disk, but a lot of protection.

Solution 3: Light magnetic medium cold backup. Usually we call it a carved dish, hehe. or hard disk cold standby-I have two hard disk copies of each, pull out a hard disk in the drawer, the most suitable for the picture and other data.

Solution 4:raid6. But this requires a high level of RAID card support, this cartoon is often very expensive, at least 1-2 of the value of 1TB disk;
The same capacity to add a hard disk, the individual feel that within 6 hard drives, relatively RAID1 scheme has no advantage. There is no economic advantage, and no performance advantages-nothing, it is better to honestly do RAID1.
Advantages of course there are: even if you encounter a disk, it is a raid5--performance is not greatly reduced, the data also has redundancy, I slowly wait for the hard drive warranty back.
Note Oh, 6 1T or more disks. That is, your hard drive has a total capacity of less than this number, or you don't have to think about RAID6.

Transferred from: https://blog.hackroad.com/operations-engineer/basics/10247.html

Analysis of why RAID5 tend to fall off the second disk immediately after losing a disk

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.