Security performance for raid

Source: Internet
Author: User

Problem:

Since data recovery is a remedy for data disasters, is there a data catastrophe in a RAID disk array system that is designed to be absolutely secure? Why? What are the common types of failures in the raid data recovery area?

Reply

The raid design was designed for about 3 reasons: Resolving capacity issues, resolving IO performance issues, and resolving storage security (redundancy) issues. From the point of view of data recovery, we are not talking about capacity and IO performance, just storage security.

Organization Scheme for storage security in RAID common RAID1, RAID5 and its deformation, basic design ideas are similar, are able to pass a certain algorithm, with a number of hard disk algorithm maintenance to ensure that when some of the data is abnormal, can be restored by a specific algorithm. Take the RAID5 design method to look, give a simple example to illustrate, if we're going to record two numbers, then we can get a record of redundancy by recording them more and more, just as we record 3 and 5, and then record 8 (for 3+5), so if we don't remember exactly how many and 5, It only takes 8-5 to figure out the missing number, and the rest is the case. In a disk array, the same algorithm is used to save data. When a set of 3-piece RAID5 is working properly, all data written to raid is correctly written to a specific disk address, and is regenerated to a specific computed value (usually called a checksum), which is the best read and write efficiency. But when one of the disks fails, the original data stored on this failed disk will be recovered from the data of other hard disks, of course the controller (hard raid as RAID card, soft raid is actually a driver) will be responsible for this work, and the controller will ensure the normalization of storage Will not let the operating system think the hard drive system is out of the question.

From the above principle, RAID provides storage security there are some vulnerabilities that are not easy to avoid, although unlikely, but the value of the data stored on the raid may not be evaluated, the slightest failure can lead to a large information disaster.

To get to the point, the usual failure possibilities for RAID are:

1, in the degraded state, not in time Rebuild:raid is to provide the data security redundancy by the extra part of the storage space, but when some of the disk failed offline, RAID can no longer provide this storage redundancy, if the administrator does not replace the disk in time, rebuild the entire volume, At this point the rest of the hard drive fails again, and the raid volume does not work properly. This type of failure is fairly high in RAID data recovery, and server maintenance management is not easy to come by.

2, CONTROLLER failure: The controller is connected to the physical hard disk and the operating system between the data storage link, and because of the composition of the raid is not a natural convention (specific), the size of the hard disk capacity, the number of hard disk, RAID composition level, logical disk segmentation, block size, Factors such as checksums are grouped into different raid information (RAID metadata), which are sometimes written on the array card, sometimes on the hard drive, and on both. If the controller fails, in many cases replacing the new controller and not having a raid information restore, the lower-end controller is much more vulnerable to cost considerations. And even if you remember the original RAID structure, rebuilding again is the wrong way to recover data (see related articles).

3, firmware algorithm defects: RAID creation, reconstruction, demotion, protection and other work in the implementation of the controller is a very complex algorithm, of course, this complexity is more to provide as far as possible foolproof algorithm, although manufacturers will not easily admit the controller bug, but no doubt, These problems cannot be avoided on any of the controllers. Because bugs on the firmware algorithm can cause a lot of unexplained failures. For example, in some server data recovery cases, there are some early production of the Dell 2950 server, there will be raid a disk offline failure disk and alarm lights inconsistent situation, causing customers to replace the failure disk rebuild when the wrong disk, the entire RAID group crashes.

4, IO Channel blocking caused RAID swap: RAID controller in the design for the absolute security of the data, will be as far as possible to avoid writing data to the unstable storage media, so that when the controller and physical hard disk IO, if time exceeds a threshold, or does not meet the calibration relationship, It is considered that the corresponding storage device is no longer capable of continuous work, but it will be forced offline, notify the administrator to resolve the problem as soon as possible. This design is very good, it is also the correct way to design, but for such as physical link line loose, or because the hard disk machinery work when the response timeout (may be hard disk or intact) and other random reasons for the controller can not tell whether the device has the same stability as before, so do not care about some of the small links, Can cause a RAID volume to fail, and this type of failure can occur with great probability and cannot be avoided. This is also most of the raid failure, the hard drive has not failed, many of our data recovery services customers will therefore query the server manufacturers, in fact, there is suffering, to a certain extent, the more the design of a secure controller, the more such phenomenon occurs.

5, the stability of the controller: RAID controller in the online state (no offline disk) work is the most stable, relatively, when some of the hard disk damage (may be a logical failure) off-line, the controller will work in a more laborious state, This is also the reason why a lot of low-end raid controllers have a sharp decline in read and write performance after an offline disk. The overload of the controller greatly increases the likelihood of IO retention in data throughput, resulting in raid offline as mentioned in the 4th above. A controller that does not have a high speed hardware processing chip and does not have a high speed buffer has a much higher probability of such failure. To avoid the business pauses and overhead associated with data recovery after a failure, try not to select such a disk array controller.

6, Bad hard disk: This kind of situation is very interesting, many people will think that the normal working raid will not have bad hard drive, because as long as the hard drive, RAID will let his bad hard drive offline, replace the new hard drive after rebuild is a good hard drive. In practice, however, this type of situation is unavoidable because a set of RAID volumes will rarely read all the disk space on the physical hard disk after a long period of time, which is even more impossible. In some cases, the hard drive will produce bad results in areas that are not read, or where the previous read is good, which is good for the controller because it is not read or written. The most immediate hazard to producing this bad track is in the rebuild process. When a physical hard drive is offline, usually all the technical staff and official information will be written as soon as possible, but if the other hard disk exists in this kind of usual rebuild bad track, rebuild and all do full synchronization, will certainly read and write to those bad way, this time rebuild unfinished, new disk can not be online , because of the bad track found in the old disk, it will lead to raid and some of the offline hard drive, which may cause raid failure, unable to recover the data itself.

7, human error operation: Data recovery is a significant part of the disaster can be avoided, but there will always be a situation: unrelated people mistakenly pull the raid in the hard drive, did not prepare spare parts disk, not in time to change the disk, to raid dust to forget the original order, accidentally deleted the original RAID configuration.

8. Other reasons I can't remember at the moment.

These disaster reasons, in addition to man-made reasons, most difficult to directly avoid, only through the combination of backup, building an overall storage security solution. Other articles will refer to the reasons, as well as security recommendations to throw away data recovery topics.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.