XFS: The future of Linux file systems in Big data environments? XFS developer Dave Chinner recently claimed that he believes more users should consider XFS. XFS is often considered to be a file system for users with huge amounts of data, and the scalability of space allocation is "several orders of magnitude" faster than EXT4. Metadata validation means that metadata is self-describing, protecting the file system from being written in the wrong direction by the storage layer. So, why do we still need EXT4?
AD:
WOT2015 Internet operations and developers Conference selling tickets
"51CTO February 7 Foreign headlines" Linux has a good variety of pieces of system, but often the most attention is two of them: Ext4 and Btrfs. XFS developer Dave Chinner recently claimed that he believes more users should consider XFS. He talked about the work that was done to solve the most serious extensibility problem in XFS, as well as what he thought would be the future. If he's right, we can expect more from XFS over the next few years.
XFS is often considered to be a file system for users with huge amounts of data. Dave says XFS is a great fit for this role; it has always been a good performance for many workloads. Often the problem is in metadata writing, and the lack of strong support for workloads that generate large amounts of metadata write operations has historically been a weak link in this filesystem. In short, metadata writes are slow, poorly scaled, and can only be applied to a single processor.
How slow is the speed? Dave makes several slides showing the fs-mark results of XFS compared to EXT4. Even on a single processor, XFS behaves much worse (only half the speed of ext4); If the number of threads is up to 8, the situation is completely worse, but after more than 8 threads, Ext4 also encounters a bottleneck, slowing down. For input/output-intensive workloads that frequently vary in metadata (an example of unlocking a tarball file), Dave says that EXT4 can be 20 times times faster than XFS by 50 times times. Being so slow is enough to indicate that XFS does have serious problems.
Deferred log
The result shows that the problem is actually on the input/output of the log. XFS generates a lot of log traffic for metadata changes. In the worst case, almost all the actual input/output traffic is used for the log-not for the data that the user is trying to write. Over the years, people have tried to solve this problem in a variety of ways, such as making significant changes to the algorithm and making many significant optimizations and adjustments. What is not needed is a change in format on any disk, but it may be planned for other reasons in the future.
Metadata-intensive workloads may eventually change the same directory block multiple times in a short time, and those changes will generate a record each time, and the records must be written to the log. This is the root cause of huge log traffic. The solution to this problem is conceptually simple: delaying log updates and merging changes for the same directory block into an entry. The practical implementation of this concept in an extensible manner over the years has been a struggle, but progress has been made; the deferred log (delayed logging) will be the only XFS log pattern supported in the 3.3 kernel.
The actual delay log technology is mainly referenced by the ext3 file system. Since this algorithm is known to be practical, it is much shorter to prove that it also applies to XFS. In addition to the performance benefits, this change eventually drives down the number of code. Anyone who wants to learn more about how it works should find what they need in the kernel document tree, Filesystems/xfs-delayed-logging.txt.
The delay log is a big change, but it is by no means the only change. Log space reserved Fast path is a very popular path in XFS, and now it is unlocked, but the slow path still requires a global lock at this stage. The asynchronous metadata writeback code forms a very decentralized input/output, which results in significantly reduced performance. Metadata writeback is now deferred and categorized before it is written out. In the words of Dave, this means that the file system is doing the input/output scheduler work. However, the input/output scheduler processes a request sequence that is typically limited to 128 entries, whereas XFS's deferred metadata writeback sequence can have thousands of entries, so it is necessary to complete the classification operation in the file system before the input/output commit. Activity log Entries (active logs item) This mechanism can accumulate changes and apply changes in batches to improve the performance of the (large) List of categorical log entries. The metadata cache is also moved out of the page cache, where page caches tend to retract pages at inappropriate times. Wait a minute.
How does the file system compare?
So, what is the extensibility of XFS now? With one or two threads, XFS is still a bit slower than EXT4, but it can scale linearly to support up to 8 threads, while EXT4 is worse, and Btrfs is a much worse situation. The extensibility limitations of XFS now appear in the lock on the core of the virtual file system layer, not at all on the code for a particular file system. Now, even for a thread, directory traversal is faster, and for 8 threads, it's much faster. These, he says, are not the kind of results that btrfs developers might show to people.
Space allocation is now more scalable than Ext4 "several orders of magnitude". This is due to the addition of the "Bigalloc" feature in the 3.2 kernel, which can increase the scalability of EXT4 in space allocation by two orders of magnitude if large enough blocks are used. Unfortunately, this feature also increases the amount of space used for small files so that 160GB is needed to store the kernel tree. Bigalloc is not a good fit for EXT4 and requires an administrator to answer complex configuration questions, and when creating a file system, the administrator must consider how the file system will be used throughout its lifetime. Dave says there are architectural deficiencies in EXT4--especially using bitmaps to track space--a typical problem with file systems in the 80 's. It simply cannot be expanded to become a truly oversized file system.
The spatial distribution in btrfs is even slower than EXT4. Dave says the problem is mainly in the walk-through of the idle space cache, which is now processor-intensive. This is not a schema problem in Btrfs, so it should be expected to be resolved, but it needs to be optimized.
The future of Linux file systems
What progress has been made in this regard? At this stage, the performance and scalability of metadata in XFS can be considered an issue that has been resolved. Performance bottlenecks now occur at the virtual file system (VFS) layer, so the next round of work needs to be done in this area. But a major challenge for the future lies in reliability; This may require some considerable variation in the Xfs file system.
Reliability is not just about not losing data--hopefully XFS has done so well in the future, which is actually a scalability issue in the years to come. It is impractical to have a thousands of gigabyte (PT) file system offline and run a file system inspection and Repair tool, which will need to be done online in the future. This requires the integration of proven and reliable fault detection mechanisms into the file system so that the metadata can be validated in real time. Other file systems are also implementing mechanisms for validating data, but this seems to be beyond the scope of XFS. Dave says the data validation effort is best done at the storage array level or at the application level.
Metadata validation means that metadata is self-describing, protecting the file system from being written in the wrong direction by the storage layer. It is not enough to add a checksum technique--a verification and only proof that the existing is being written. Metadata that is self-describing in an appropriate way can detect blocks written to the wrong place and help reassemble completely broken file systems. It also prevents the "reiserfs problem", in which the file system's repair tool is confused by outdated metadata or metadata stored in a file system image in the file system to be repaired.
Making metadata self-descriptive requires a lot of change. Each metadata block will contain the file system's UUID, and the number of blocks and index nodes (Inode) in each block, so that the file system can verify that the metadata came from the expected place. There will be tests and mechanisms in the future to detect corrupted metadata blocks, and an owner identifier to associate metadata with the indexed node or directory belonging to it. A reverse map allocation tree allows the file system to quickly confirm which file any block belongs to.
Needless to say, the current XFS disk format does not provide a mechanism for storing all of this extra data. This means that there is a change in the format on the disk. According to Dave, it is not intended to be compatible with any form of forward or backward format, and the change in format will be truly significant. This is done to facilitate the complete freedom to design a new format for long-term service to XFS users. Although you are changing the format to add the above reliability features, developers also add space for D_type, NFSv4 version counters, index node creation time, and possibly more objects in the directory structure. The largest directory size (currently only 32GB) will be improved.
This brings many advantages: proactively detecting file system impairments, locating and replacing poorly connected blocks, and better online file system remediation. Dave says this means that for a long time in the future, XFS will still be the best file system for big data applications in a Linux environment.
From a btrfs perspective, what does all this mean? Dave says Btrfs is clearly not optimized for file systems that handle metadata-intensive workloads, and there are some serious scalability issues that can be a stumbling block. For a file system in the early stages of development, this is entirely predictable. Some of these issues will take time to overcome, but this may be the case: some of these problems may not be resolved. On the other hand, the reliability features in Btrfs are well developed, and the file system is fully capable of providing the storage capabilities expected over the next few years.
There are architectural extensibility issues with EXT4. According to Dave's results, it is no longer the fastest file system. Several scenarios are available to improve reliability, and their on-disk format exposes old state. EXT4 support for storage requirements in the near future is difficult.
With this in mind, Dave throws a question at the end. Due to its rich functionality, Btrfs will soon replace EXT4 as the default file system in many Linux distributions. At the same time, EXT4 performance is not as good as XFS in handling most workloads, including its traditionally more robust application areas. Some scalability issues even occur on smaller server systems. "A semi-completed project" does not always have a good effect; Dave says Ext4 is not as stable or well-tested as people think. So he asked, "Why do we still need EXT4?"
Some may think that EXT4 developers will come up with a good way to answer this question, but no one has answered it yet.
XFS: The future of Linux file systems in Big data environments?