Zfs-sun's latest File System

Source: Internet
Author: User
Tags prefetch

* ZFS Overview
* ZFS data integrity and security
O transaction-based write Replication
End-to-End checksum
* ZFS scalability
O 128-bit architecture
O dynamic metadata
O file system performance
* Resources

Zfs Overview

Sun recently added ZFS to the Solaris 10 06/06 operating system, a thoroughly innovative redesign of traditional UNIX file systems. Sun engineers and members of open source code organizations draw on some common best practices on the market (such as network appliances snapshots, Veritas object-based storage management, transactions, and checksum ), and combine your own ideas and expertise to develop a new, streamlined and compact approach to design file systems. Although ZFS is still in its infancy, it has had such a huge impact on other UNIX vendors, as a result, manufacturers keen on open source code have announced plans to port ZFS to their own operating systems (see porting ZFS to other platforms on the opensolaris site ).

With ZFS, Sun solves important problems that have plagued other UNIX file systems, such as integrity, security, scalability, and management difficulties. This article is divided into two parts, which will analyze the work of ZFS in the background and how these work save time and money for the enterprise. In the first section, we will discuss data integrity and security models and ZFS scalability. The second part will introduce manageability and the ever-increasing ZFS.

Zfs data integrity and security

Data integrity and security are the most important components of any file system. It is critical to protect disk information from bit damage, without prompt damage, or even malicious or accidental tampering. In the past, file systems encountered various problems in coping with such challenges and providing reliable and accurate data.

File systems built based on earlier technologies (such as earlier versions of UFS) will overwrite blocks when modifying data in use. If a power failure occurs during the write process, the data will be damaged, and the file system may lose the block pointer pointing to important data. To solve this problem, The fsck command will try its best to find the dirty blocks and reconnect from where the information can be reconnected. Unfortunately, fsck needs to view the entire file system, which usually ranges from several seconds to several hours, depending on the size of the file system. For key business systems, every minute of downtime means a lot of money. To reduce the time required to recover from power failures, many file system implementations (including newer versions of UFS) Add the logging feature. If a corrupt log appears, you can still use fsck to repair the file system. However, even the enhanced version of UFS, and most other file systems with logging functions, will not record user data because of the high overhead.

For reliability, some type of volume management software has been used for disk or file system images. If two halves of the image are inconsistent due to power failure, half of them need to be synchronized to the other half again, even if some blocks have problems. In the re-synchronization process, not only does the I/O performance decline, but the computer cannot always accurately predict which data copy is not damaged. Sometimes, it selects the image to be trusted, which leads to overwriting the correct data. To solve performance problems, some volume manager introduces the so-called dirty region logging (dirty region logging, DRL ). Now, only the areas that are being written when the power is down need to be re-synchronized. This method is quite effective in mitigating performance problems, but it still does not solve the problem of how to detect which half of the images contain valid data.

Zfs can perform write replication modification based on transactions, and always perform verification and Calculation for each block in use in the file system, thus solving the above problem.

Transaction-based write Replication

Zfs combines the file system with the volume manager. Because the storage pool virtualization is adopted, the file system-level commands do not need the concept of the underlying physical disk. All advanced interactions are conducted through data management units (DMU). Data management units are similar to Memory Management Units (MMU, it applies only to disks, not ram. All transactions committed through DMU are atomic operations, so data will never be inconsistent.

In addition to the features of a transaction-based file system, ZFS only performs write replication. This means that the block containing data in the disk will never be modified. The changed information is written to the slave block. The block pointer to the data in use is moved only after the transaction is written. This applies to the entire file system block structure (from the bottom up to the top-level block called uberblock.

1. The transaction selects unused blocks to write modified data, and only changes the position pointed to by the blocks after this operation.

Write copy transaction Diagram

Figure 1: Write replication transaction

The pointer to "correct" data is not moved until the entire write operation is completed, so no damage will occur even if the computer loses power during data writing. (Note: Only the pointer pointing to the data moves .) This eliminates the need to record the file system when the computer unexpectedly restarts, and does not need to call fsck or re-Synchronize the image.

End-to-End checksum

To avoid unexpected data corruption, ZFS provides a memory-based end-to-end checksum. Most file systems that have the checksum function can only prevent bit corruption because they use the same block and the checksum is stored together with the block. In this case, no external check is required to verify the validity. The checksum of this style cannot capture the following items:

* Virtual write: The write is discarded.
* Misleading reading or writing, that is, the disk accesses an incorrect block.
* The DMA parity error between the array and the server memory, or the DMA parity error from the driver, due to the checksum and verification of the data in the array
* The driver is incorrect. The data is resident in the buffer with the kernel error.
* Accidental overwriting, such as switching to a real-time File System

In ZFS, the checksum is not stored in the block, but is next to the pointer to the block. This keeps going up to the super block. Only the super block has a self-verified SHA-256 checksum. All block checksum is completed in the server memory, so any errors in the entire tree structure will be captured, including the misleading reading and writing, parity errors, and virtual writes described above. In the past, the CPU burden would bring the computer into a pause, but now the high development of CPU Technology and speed is enough to meet the requirements of instant detection of disk transactions. Zfs not only captures these issues, but data can also be self-repaired in an image or RAID-Z configuration environment. (The second part of this article provides more information about the RAID-Z .)

Here is one of the much-respected sun demos that demonstrate self-healing of data using DD, where c0t1d0s5 is half of the image or a RAID-Z File System:

Dd If =/dev/urandom of =/dev/DSK/c0t1d0s5 BS = 1024
Count = 100000

This operation writes useless information into half of the image, but when accessing these blocks, ZFS performs a checksum and identifies that the data is incorrect. Zfs then performs a checksum on another piece of data and finds that it is valid data. Then, it re-synchronizes the faulty block of the damaged part of the image without crashing due to data corruption. In the RAID-Z configuration, ZFS checks blocks on each disk sequentially and compares the parity checksum until a match is found. After the matching item is found, ZFS will know that a valid data block has been found, and then all other wrong disks will be repaired. The re-synchronization process is completely transparent to users, and users will not find any problems.

Zfs always checks for data corruption in the background through a process called "Cleanup. The file system code used for Disk Cleanup is the same as the code used for re-synchronization, image appending, and disk replacement, so that the entire process is tightly integrated. The administrator can also run the zpool scrub command to forcibly check the entire storage pool. (The second part of this article provides more information about the storage pool .)

# Zpool scrub testpool
# Zpool status

Pool: testpool
State: Online
Scrub: Scrub completed with 0 errors on Thu Jun 29 12:47:15 2006

Name State read write cksum
Testpool online 0 0 0
Mirror online 0 0 0
C0t0d0s5 online 0 0 0
C0t1d0s5 online 0 0 0
Mirror online 0 0 0
C0t0d0s6 online 0 0 0
C0t1d0s6 online 0 0 0

After running the DD command described above on the damaged part of the image, the output is similar to the following:

# Zpool scrub testpool
# Zpool status

Pool: testpool
State: Online
Status: one or more devices has experienced an unrecoverable error.
Attempt was made to correct the error. Applications are unaffected.
Action: Determine if the device needs to be replaced, and clear the errors
Using 'zpool online' or replace the device with 'zpool replicase '.
See: http://www.sun.com/msg/ZFS-8000-9P
Scrub: Scrub completed with 0 errors on Thu Jun 29 12:51:29 2006

Name State read write cksum
Testpool online 0 0 0
Mirror online 0 0 0
C0t0d0s5 online 0 0 0
C0t1d0s5 online 0 0 5 2.50 K retries red
Mirror online 0 0 0
C0t0d0s6 online 0 0 0
C0t1d0s6 online 0 0 0

Output current report:

* A device may be damaged.
* The application is not affected by the error. ZFS has corrected the error in the background.
* You may need to change the device.

The output also provides the following information:

* How to replace defective devices?
* How to clear errors
* A url pointing to more information about this type of error and how to further troubleshoot and solve the problem
* Which disk has an error and the successful repair rate?

Run zpool online testpool to clear errors in the cksum column, but continue to show that c0t1d0s5 has been fixed.

For more information, see Jeff bonwick's blog article about ZFS image rescheduling.

Another advantage of ZFS security is that it uses the nfsv4/NT style ACL, including the complete meaning of allow/deny and inheritance. These access control policies based on 17 different attributes are extremely fine.

Zfs scalability

Although data security and integrity are extremely important, a file system must also run well and be able to withstand the test of time, otherwise it will not be of much use. Zfs designers eliminate or greatly reduce the limitations imposed by modern file systems by using a 128-bit architecture and setting all metadata as dynamic metadata. In addition, ZFS implements data pipeline transmission, dynamic block size adjustment, smart prefetch, dynamic striping, and built-in compression to improve performance.

Architecture of the 128-bit architecture

The current industry trend shows that the disk drive capacity will double every nine months to one year. If this trend continues, the file system will require 64-bit addressing capabilities in about 10 to 15 years. The ZFS designer has implemented a 128-bit file system over the long term without planning for 64-bit requirements. This means that ZFS can provide more than 16 billion times the current 64-bit file system capacity. We reference what ZFS chief designer Jeff bonwick says in ZFS: the last word in file systems, that is, "populating 128-bit file systems wocould exceed the quantum limits of Earth-based storage. you couldn't fill a 128-bit storage pool without boiling the oceans. "(if a 128-bit file system is filled up, it will exceed the Earth-based storage range limit. To use up the 128-bit storage pool, you must first evaporate the sea .) Jeff also discussed the mathematical theory based on the above sentence in his blog on 128-bit storage. Since there is no technology to create this effect in the mass market, we are still safe for a while.

Dynamic metadata

Zfs is not only 128 bits, but its metadata is also dynamic. Therefore, creating a new storage pool and file system is extremely fast. Only 1% to 2% of disk writes are metadata, which greatly saves the initial cost. For example, there is no static inode, so the only limit is the number of inode in the storage pool disk.

The 128-bit architecture also means there is no actual limit on the number of files and directories. Some Theoretical limitations are listed below, which can be surprising if you can imagine a restricted range:

* The number of snapshots in any file system cannot exceed 248
* The number of Chinese documents in a single file system cannot exceed 248
* The file system cannot exceed million bytes
* Files cannot exceed million bytes
* The attribute cannot exceed million bytes
* The storage pool cannot exceed 3X1023 trillion bytes
* The number of attributes in a file cannot exceed 248
* The number of files in a directory cannot exceed 248
* The number of devices in a storage pool cannot exceed 264
* The number of storage pools per system cannot exceed 264
* The number of file systems in each storage pool cannot exceed 264

File System Performance

Compared with traditional file systems, ZFS's basic design provides a large number of performance enhancements. In terms of starters, ZFS uses a pipeline I/O engine, which is similar in concept to a CPU pipeline. Pipelines act on I/O dependencies and can be sorted based on their priorities and deadlines. This pipeline provides scoreboard, priority, deadline scheduling, out-of-order launch, and I/O aggregation. Zfs also implements an intelligent prefetch algorithm that recognizes linear or algorithm access modes and predicts the next block to be prefetch.

As long as possible, ZFS will use concurrency to increase the speed. The file system supports parallel reading and writing to the same file, and supports parallel constant time directory operations. The locking policy is scalable and runs fast. In addition, operations can be performed in any order in any given transaction. Therefore, DMU performs read/write batch processing to optimize disk operations. Because transactions are performed in write replication mode, you can select continuous blocks for new data instead of randomly accessing the disk. In this way, the disk can run at or close to the disk speed. Zfs also automatically matches the block size (from 512 bytes to 128 K) with the workload for optimal performance.

Zfs can dynamically strip data on all available devices. When another disk or Shard is added to the Strip, ZFS automatically merges the new space and rebalances the write strip to use the newly added space. If a device enters the downgrade mode due to an error, ZFS tries its best not to write the device, but distributes the load to other devices.

Zfs also provides built-in data compression for each file system. Compression not only reduces disk space usage, but also reduces the necessary I/O volume by two to three times. Therefore, enabling compression can actually speed up some I/O-intensive workloads, but not CPU-intensive workloads.

For more information about ZFS benchmarking, see other web pages:

* Bill Moore's blog on ZFS benchmarking
* Roch blog about the ZFS Benchmark Test compared to UFS in one day
* James Dickens blog posts on comparing UFS regions and ZFS regions

In section 2nd, we will introduce more ZFS operations: creating a storage pool and a file system, setting file system parameters, and taking data snapshots and cloning. In addition, we will focus on some more interesting features of ZFS development.


* ZFS-sun's latest File System (Part 1: ease of management and Feature Enhancement)
* About ZFS Solaris 10 OS update 06/06
* Download Solaris Express
* Opensolaris community: ZFS
* Opensolaris ZFS entry page
* ZFS Learning Center
* Port ZFS
* Jeff bonwick's blog post on the Re-Sync of ZFS Images
* Another blog article by Jeff bonwick: You say ETA, I say zetta
* ZFS: the last word in file systems
* Jeff bonwick's blog article on selecting a 128-bit Storage Architecture
* Bill Moore's blog on ZFS benchmarking
* Roch blog about the ZFS Benchmark Test compared to UFS in one day
* James Dickens blog posts on comparing UFS regions and ZFS regions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.