2018 storage Technology Hotspots and trends summary

Source: Internet
Author: User
Tags data structures garbage collection new set prev volatile value store
2018 Storage Technology Hotspots and trends summary

Over the past six months, I have read more than 30 papers and insisted on writing a Newsletter every 1-2 weeks, most of which are related to storage. Here is a summary today for you as a reference to understand the hotspots and trends in storage technology. This article contains new technical areas, such as Open-channel Ssd,machine Learning for Systems, as well as recent developments in the old topic, such as Nvm,lsm-tree,crash consistency, and industry progress.


Open-channel SSD


Open-channel SSDs are less concerned than people in the country. Compared to traditional SSDs, the Open-channel SSD provides only one of the most streamlined SSDs, containing only NAND chips and controllers, and does not include the Flash translation Layer (FTL). Features in the original FTL, such as Logical Address mapping,wear leveling,garbage Collection, need to be implemented by the upper layer, either the operating system or an application. In other words, the Open-channel SSD provides a bare SSD that allows users to design and implement their own FTL to achieve the best results.



We describe the value of Open-channel SSDs through a specific scenario. As a single-machine storage engine, ROCKSDB is widely used in many distributed storage scenarios. ROCKSDB's data storage takes the form of Lsm-tree + wal, where Lsm-tree is used to store data and indexes, and the WAL is used to guarantee the integrity of the data written to it. (Data Integrity). Because of the current implementation of ROCKSDB, the sstable and WAL in Lsm-tree are a file on the file system, so the data protection mechanism of the file system, such as journaling, can also be triggered in the process of writing to Wal. The file system also triggers the data protection mechanism of the SSD FTL layer when it writes data to Journal. Therefore, a ROCKSDB write request passes through three IO subsystems: Rocksdb,file SYSTEM,FTL. In order to ensure data integrity, each layer subsystem generates write amplification (write amplification), which makes one write magnified dozens of or even hundreds of times. This phenomenon can be described as the phenomenon of "log-on-log" in the image.




In fact, for Rocksdb WAL, and file system Journal, is actually a temporary write, do not need the underlying system additional data protection mechanism. The advent of the Open-channel SSD provides an opportunity to break this phenomenon, and if the ROCKSDB can bypass the filesystem layer and FTL, you can merge the three-layer Log into one layer, avoiding write amplification and maximizing SSD performance.


In addition to avoiding write amplification, in the Lsm-tree data junction, because the sstable is read-only non-modifiable, and the SSD Block is also read-only (if you want to write must first erase), then ROCKSDB can take advantage of this feature of SSDs, so that sstable and block Alignment, merge the delete sstable operation in the Lsm-tree with the SSD block recycle operation to avoid the data copy operation resulting from SSD block recycling, and avoid the GC's performance impact. In the "an efficient Design and implementation of Lsm-tree based Key-value Store on Open-channel SSD", it is implemented to run LevelDB directly in Ope On the N-channel SSD.


In addition to avoiding write amplification, the Open-channel SSD also provides the possibility to implement IO isolation. Due to the physical nature of SSDs, SSD performance is closely related to the physical layout of the data. SSD performance comes from the sum of the performance of each NAND chip. Each NAND chip provides very low IO performance, but because the NAND chip can be parallelized, this makes the overall performance of the SSD very high. In other words, the layout of the data determines the IO performance. However, due to the traditional SSD running on FTL,FTL not only remapping the layout of the data, but also running GC tasks in the background, which makes the performance of the SSD unpredictable and cannot be isolated. The Open-channel SSD exposes the underlying information to the upper-level application by placing the data on different NAND chips, which can be physically isolated at physical levels, while also hitting performance-isolating effects.


In order to facilitate the management and operation of Open-channel SSD,LIGHTNVM emerged. LIGHTNVM is a Subsystem for Open-channel SSDs in Linux Kernel. LIGHTNVM provides a new set of interfaces for managing Open-channel SSDs, as well as performing IO operations. In order to work with the existing IO subsystem in Kernel, there is also a pblk (physical Block Device) layer. He realized the function of FTL on the basis of LIGHTNVM, and exposed the traditional Block layer interface to the upper level, so that the existing file system can run directly on the Open-channel SSD via PBLK. A paper on FAST in 2017: "Lightnvm:the Linux open-channel SSD Subsystem" specifically introduces LIGHTNVM.



At present LIGHTNVM has been merged into the main line of Kernel. For user-state programs, Open-channel SSDs can be operated by LIBLIGHTNVM.


In January 2018, the Open-channel SSD released the 2.0 version of the standard. However, whether it is Open-channel SSD, or LIGHTNVM are still in a very early stage, the current market is difficult to see Open-channel SSD, is not suitable for direct into production. Nonetheless, the benefits of Open-channel SSDs and Host based FTL are enormous. For scenarios that pursue extreme storage performance, the Open-channel SSD + LIGHTNVM is likely to be implemented in the future.


non-volative Memory (NVM)


NVM, or PM (persistent memory), SCM (storage class memory), is actually a meaning that refers to non-volatile memory. NVM has been in academia for many years, and the related research is moving forward.


All along, due to the characteristics of the 2:8 law, the storage of computer systems has always been a hierarchical structure, from top to bottom CPU cache,dram,ssd,hdd. Where the CPU Cache and DRAM are volatile (volatile), SSDs and HDDs are non-volatile (non-volatile). Although SSDs are much faster than HDDs, there is still a gap compared to DDR. SSDs provide 10US levels of response time, while DRAM has only NS levels, which is 10,000 times times the gap. Due to the huge performance gap between DRAM and SSDs, applications need to design IO-related operations very carefully, avoiding IO becoming a performance bottleneck for the system.

And the appearance of NVM make up for this gap. NVM reduces response time to 10ns while maintaining non-volatile, while unit capacity prices are lower than DRAM. In addition, NVM is accessed by byte (byte-addressable), rather than disk-by-block. The advent of NVM has broken the traditional storage hierarchy and will have a great impact on software architecture design.



NVM looks nice, but it's not going to plug and play like a memory or a disk at the moment. In traditional operating systems, Virtual memory Manager (VMM) is responsible for managing volatile memory, which is responsible for managing the storage of the file system. The NVM, like memory, can be accessed through bytes and is non-volatile like a disk. There are two main ways to use NVM:


Use NVM as transactional memory (persistant transactional memories), including Redo Logging,undo Logging, and log-structured management methods.

Use NVM as a disk, providing a block and a file interface. For example, Direct Access (DAX), introduced in Linux, can extend existing file systems so that they can run on NVM, such as Ext4-dax. There are also file systems such as Pmfs,nova, which are specifically tailored for NVM.


Programming for NVM and for traditional memory or disk programming is very different, here we give a very simple example. For example, there is a function to perform a double-linked list insert operation:


void List_add_tail (struct cds_list_head *newp, struct cds_list_head *head) {head->prev->next = NEWP;    Newp->next = head;    Newp->prev = head->prev; Head->prev = NEWP;}


For NVM, however, because it is non-volatile, it is assumed that a power outage occurs after the first line of the function is executed, and when the system recovers, the list is in an abnormal and unrecoverable state. At the same time, due to the CPU cache between the CPU and NVM as well as the performance of the CPU in order to execute the features of the chaos, the NVM needs to use a special programming model, the NVM programming model. By displaying the specified Transaction, the semantics of the atomic operation is achieved, ensuring that no intermediate state occurs when the system resumes.


In distributed scenarios, if the performance of the NVM is to be fully played, it must be combined with RDMA. Due to the high performance of NVM, the access characteristics of Byte addressable, and the way RDMA is accessed, the distributed NVM + RDMA needs a new architecture design, including single-machine data structure, distributed data structure, distributed consistency algorithm and so on. In this regard, the Tsinghua computer department published last year's Octopus provides a way of thinking, through the NVM + RDMA implementation of the Distributed file system, while the implementation of a set of RDMA-based RPC for inter-node communication.


It is awkward, however, that although academia has been studying NVM for decades, there is currently no commercially available NVM product in the industry, and we can only do research based on simulators. Intel and Micro collaborated in 2012 to develop 3D XPoint technology, which is considered to be the closest commercially available NVM product. Intel released a disk product Optane based on 3D XPoint technology in 2017, and the NVM product (codenamed Apache Pass) did not have a definitive release time.


However, even if the NVM products are available, due to the price and capacity of NVM, as well as the complex programming model, in the actual production there will be few pure nvm scenes, more or tiering form, that is, NVM + SSD + HDD combination. In this respect, a paper on Sosp Strata also provides a good idea.


Machine Learning for Systems


Last year Jeff Dean's Google Brain team published a very important paper "The case for learned Index structures". It can be said that from the beginning of this article, the System field has launched a new direction, machine learning and the system combination. Had to marvel at Jeff Dean's influence on computer science.

This article, and Jeff Dean's talk on NIPS17 ML Systems Workshop, has released a strong signal that the computer system contains a large number of heuristics algorithms for making a variety of decisions, such as how large a TCP window should be, Whether the data should be cached, which task should be dispatched, and so on. And each algorithm has performance, resource consumption, error rate, and other aspects of the tradeoff, the need for a large number of labor costs to select and tuning. And these are the places where machine learning can play.


In the "The case for learned index structures" article, the author mentions a typical scenario, the index of the database. Traditional indexes usually use B-trees, or variants of B-trees. However, these data structures are often designed for a common scenario and the worst data distribution, and do not take into account the distribution of data in real-world applications. For many special data distribution scenarios, the B-tree is not able to achieve the optimal time and space complexity. In order to achieve the best results, we need to invest a lot of manpower to optimize the data structure. At the same time, due to the continuous changes in the distribution of data, the work of tuning is continuous. The learned Index proposed by the author is combined with machine learning technology to avoid the overhead of manual tuning.


In this article, the author of the index data structure as a model, the input of this model is a key, the output is the Value of the key corresponding to the location of the disk. A B-tree or other data structure is just one way to implement the model, and there are other implementations of this model, such as neural networks.



Compared with the B-tree, neural networks have great advantages:


Because you do not need to save the key in memory, it takes up a small amount of memory space. Avoid disk access, especially when the index volume is large.

Search speed is faster due to avoidance of conditional judgments introduced by tree traversal


Through the offline model training, the sacrifice of a certain amount of computing resources, can achieve the savings of memory resources, and improve the performance of the effect.


Of course, there are some limitations to this approach. One of the most important of these is that learned index only indexes data with fixed data distribution. When data is inserted, the data distribution is changed, and the original model is invalidated. The solution is for new data, still indexed by traditional data structures, learned index is only responsible for indexing the original data. When new data accumulates to a certain extent, it merges with the original data and trains the new model based on the new data distribution. This method is very feasible, after all, compared with the amount of new data, the full amount of data is very large. If you can optimize the index of full-scale data, the application value is also huge.

Despite some limitations, learned Index has a number of scenarios that are applicable, such as Google has applied it to Big

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.