As Ceph is increasingly used in various storage business processes, its performance and tuning strategy has become a topic for users to pay close attention to, one of the key factors affecting performance is the OSD storage engine implementation; The Ceph base component Rados is a strong consistent, object storage System, The storage engines supported by its OSD are as follows:
The ObjectStore layer encapsulates all IO operations of the underlying storage engine, providing an interface to the upper layer with object (object), transaction (Transaction) semantics,Memstore For memory-based implementation,Keyvaluestore mainly based on the KV database (such as LEVELDB, ROCKSDB, etc.) to implement interface functions, transaction implementation is based on the KV database itself;Filestore Is Ceph's current default storage engine (which is also currently the most used storage engine), and its transaction implementation is based on the journal mechanism (journal file or block device), in addition to supporting transactional features (consistency, atomic , etc.) , journal can also combine multiple small IO writes into sequential write journal to improve performance.
In the community use process, Filestore also exposed a number of problems: (1) The journal mechanism makes a write request on the OSD end into two write operations (synchronous write journal, asynchronously written to object); (2 For the previous issue, the community typically uses specialized devices such as SSDs to be used as journal to decouple Journal and object write operations, but the continuous cyclic write journal reduces the life of the SSD device; (3) Each object written to each one corresponds to a physical file of the OSD local file system, for a large number of small object storage scenarios, the OSD side can not cache all local file Inode metadata, so that read and write operations may require multiple local IO, poor system performance; (4) The file name of the local physical file corresponding to object, which contains information such as Object name, Rados namespaces, object name hash, snapshot, and so on, may exceed the local file system's limit on the file name length.
In the face of these problems, the new storage Engine Newstore (also known as Keyfilestore) appears with its key data structures as shown:
The main features are: (1) decoupling object from the local physical file one by one correspondence, through the index structure (onode) in the object and the local physical file mapping relationship, and using the KV database to store index data; (2 The Create/append/overwrite (fragement aligned) operation without journal support for the object, while guaranteeing transactional characteristics; (3) For the unaligned update operation, the first synchronous write Write-ahead-log (for short, Wal, using KV storage), and then asynchronously written to the corresponding fragement file; (4) The Onode data cache is established on the top of the KV database to speed up the read operation; (5) A single object can have multiple fragement files, and multiple objects can coexist in a fragement file;
The above problems of filestore have been basically solved in the newstore structure; Newstore also uses the following strategies to reduce the performance overhead of Wal : (1) After the update is written to the Fragement file, Immediately remove the corresponding Wal from the Kvdb (Wal has completed its mission without saving), (2) Increase the write buffer of Kvdb, keep the Wal in buffer as much as possible and avoid unnecessary dumps; (3) Forces multiple buffer data to be merged before the write buffer data is dump to disk to avoid unnecessary dumps. In the initial random reading and writing test, Newstore has a 60% performance improvement relative to Filestore.
This blog's Last Post, "Massive small file storage and Ceph practice" from the meta-data management , local storage engine two aspects of the massive small file storage problem is described, and through the object The class interface layer has improved and optimized the Filestore storage structure, but it also optimizes the performance of small file storage to a large extent, while the Newstore is implemented directly at the storage engine layer, decoupling the correspondence between object and local physical file. and allow multiple objects to coexist in a fragement file, but both of the intrinsic idea of the optimization of the small-file local storage engine is the same, that is, merging the storage + index ; But the newstore is still in intensive development, and it will take some time to deploy to the line. And believe that with the gradual deployment and maturity of the Newstore engine, the massive small file storage problem is no longer a problem.
In addition, the code implementation of the improved optimization scheme based on the object class interface layer has been put on GitHub, welcome to test use and criticism.
Reference:
Images and related content from June 2015 Beijing Ceph Day "Newstore" lecture Notes
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23414/focus=23417
http://docs.ceph.com/docs/master/rados/configuration/journal-ref/
http://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
Http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage
Http://www.cnblogs.com/wuhuiyuan/p/ceph-small-file-compound-storage.html
Http://www.wzxue.com/ceph-filestore/
Http://www.wzxue.com/ceph-keyvaluestore/
Https://github.com/yxgup/ceph/tree/omap_indexed_compound
------------------------------------
Http://www.cnblogs.com/wuhuiyuan/p/4907984.html
Personal original, reproduced please indicate the source.
Ceph Newstore Storage Engine Introduction