Analysis of ZFS raidz Technology

Source: Internet
Author: User

RAID (Redundant Array of cheap disks) initially promises that it will use cheap disks to provide fast and reliable storage. The point is cheap, but somehow
We finally get this result. Why?
(And other data/parity schemes, such as RAID-4, raid-6, parity, and row-diagonal parity) have never fully fulfilled the raid commitment and cannot be fulfilled, this is because there is a raid-5
Critical defects of the Write Vulnerability. Check must also be updated whenever a RAID storage entry is updated so that all disks are unique or 0. This equation allows you to refactor data when a disk fails. The problem is that two or more disks cannot be updated in an atomic way, So raid
The storage record may be damaged when the power supply is disconnected.

To identify this issue, assume that you have power failure after writing data blocks but before writing the corresponding verification blocks. Currently, the data of this storage bar is inconsistent with the parity, and they will always be inconsistent (unless a full-storage write operation overwrites the old data at a time ). Therefore, if the disk is faulty, the raid reconstruction process generates garbage when you next read any block on the storage bar. Worse, it does so without any prompts, and it does not know that the corrupted data is provided to you.
To solve this problem, there have been some emergency solutions for software only, but they are very slow, so software raid
Has vanished in the market. All current raid products are executed in hardware
Raid logic so that they can use NVRAM to cope with power outages. This is indeed useful, but it is costly.

The existing raid solution has a bad performance problem. When performing write operations on some storage entries, that is, when the updated data is less than a single raid
The RAID system must read the old data and perform the parity check to calculate the new parity. This is a huge performance loss. Full-storage write operations perform all write operations asynchronously, while some write operations must be started after synchronous read operations.

Once again, the expensive hardware provides a solution: the raid array can buffer some storage records and write operations in NVRAM when the disk read operation is complete, in this way, read latency can be hidden from users. Of course, this method is only effective before the NVRAM buffer is used up. No problem! Your storage vendor will say. Just pay more in cash and buy more NVRAM
That's all. The problem cannot be solved without your wallet.

Some storage write operations have raised another issue in the transaction file system (such as ZFS. Some storage write operations will inevitably modify valid data, which violates the guarantee of transaction semantics.
Rules. (If the power is down when writing in full storage, there is no problem. Likewise, if the power is down during any other write operation in ZFS, there is no problem: however, none of the blocks you are writing are valid .)
Hopefully we don't need to perform these annoying part of storage write operations ......

Enter the world of RAID-Z.

The q0 RAID-Z is a data/parity scheme, such as RAID-5, but it uses a dynamic storage bar width. Each block is its own RAID-Z
Storage entries, regardless of the block size. This means that each RAID-Z write operation is a full storage entry write operation. When
This completely eliminates raid
Write Vulnerability. The RAID-Z is also faster than traditional raid because it never executes read-change-write.

Wow, wow, wow -- that's it? Variable storage Bar Width? Oh, my God, that's too simple. If this is really a good bet, why not everyone is doing this?
The tricky thing here is RAID-Z refactoring. Because the size of storage entries varies, there is no simple formula like "all disks are different or 0. You must traverse the file system metadata to determine the RAID-Z
Ry. Note that if file systems and raid arrays are mutually independent products, this will not be possible, that is why today's storage market does not have a RAID-Z such as the East West. To solve this problem, you really need a view that integrates the data logic and physical structure.

Wait, you said: Is it too slow? Isn't the cost of traversing all the metadata very high? In fact, this is a compromise. If your storage pool is almost full, it is indeed slow. If this is not the case, the metadata-driven refactoring is actually faster, because it only copies valid data, without wasting time copying unallocated disk space.

But more importantly, traversing metadata means that ZFS can follow its 256
Bit checksum and verify each block. Traditional raid products do not do this; they just blindly share the same or data.
The coolest thing that brings us is self-healing data. In addition to processing a full disk failure, the RAID-Z can also detect and correct unindicated data corruption. Whenever read RAID-Z blocks, ZFS
Will be compared with the checksum. If the data disk does not return the correct answer, ZFS reads the parity and then performs a combination reconstruction to determine which disk has returned bad data. Then, ZFS
Fix the damaged disk and return data to the application. Zfs also uses solarisfma
Report the accident so that the system administrator can know that one of the Disks has failed without prompt.

Finally, note that the RAID-Z does not require any special hardware. It does not require NVRAM
Correction, no write buffer is required to obtain good performance. Using RAID-Z, ZFS
The original raid commitment was well fulfilled: it uses cheap commodity disks to provide fast and reliable storage.

For real examples of RAID-Z detection and correction of no data corruption prompt on a flake hard disk, see Eric
Lowe's SATA saga.
The current RAID-Z algorithm is single parity, but the RAID-Z
The concept is applicable to any raid type. The dual-parity version is in preparation.

The last thing programmers would be grateful for is the entire RAID-Z.
Only 599
Line.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.