Thomas is my pseudonym used by the Ceph China Community Translation team, which was first published in the Ceph China community. Now reproduced to my blog, for everyone to circulate
CEPH ObjectStore API Introduction
This article was translated by the Ceph China Community-thomas, Chen School Draft.
English Source: The CEPH objectstore API Welcome to join the translation team
Simple Introduction
The object store is part of the Ceph OSD, which finishes the actual data storage. There are currently three different object stores available:
- Filestore: File system + Log fallback storage
- Keyvaluestore: Based on the KV database (for example: Rocksdb. LevelDB)
- Memstore: Memory as storage (all data in memory Stl::map or bufferlist)
Now there are newstore, or bluestore is being developed.
Related documents
- Ceph performance:interesting things going on
- Object Store Architecture Overview
Code
The object store source code is located in the OS subfolder under the Ceph Source code folder. For convenience: Ceph GitHub repository.
The following descriptive narrative is based on the Ceph commit-ish 6f8b54c from 2015-01-13.
ObjectStore API
The abstract class ObjectStore
is the main API for the OSD to implement storage access.
This is a set of class file system APIs. But it includes operations that convert the state into a transaction. The object is stored, not the file.
An object includes:
- Byte data-Similar to file contents in the file system
- Extended properties-similar to extended properties of files in the file system. is a collection of key-value pairs.
- OMAP-similar in concept to extended attributes, but with different address space sizes and access patterns.
An object is marked by the following two IDs:
- Collection ID-
coll_t
CID (a collection is a set of objects)
- Object ID-
ghobject_t
OID
Operation
The following is not a complete list; it just gives you an impression. Note: Some operations are only available for transactions.
Transactional operations: Take a transaction as a parameter
apply_transaction
queue_transactions
General File System operations:
mount
umount
mkfs
mkjournal
statfs
need_journal
sync
flush
snapshot
Object manipulation: With coll_t cid
and ghobject_t oid
as the number of parameters
exists
stat
read
fiemap
getattr
getattrs
Set operation: As coll_t cid
a number of parameters
collection_getattr
collection_empty
collection_list
list_collections
collection_exists
collections_getattrs
OMAP Operation:
omap_get
omap_get_header
omap_get_keys
omap_get_values
omap_check_keys
Extended Reading
Transaction
Definition: A transaction is a sequence of original change operations. Class definition Objectstore::transaction.
Supported operations (excerpt from ObjectStore.h):
Class Transaction {public:enum {op_nop = 0, Op_touch = 9,//CID, OID OP _write = ten,//CID, OID, offset, len, bl Op_zero = one,//CID, OID, offset, Len Op_truncat E = d,//CID, OID, len Op_remove = +,//CID, oid op_setattr = +,//CID, OID, Attrnam E, bl op_setattrs = $,//CID, OID, attrset op_rmattr = +,//CID, OID, Attrname Op_clo NE = +,//CID, OID, newoid op_clonerange =,//CID, OID, newoid, offset, Len Op_clonerange2 = +,//CID, OID, newoid, Srcoff, len, Dstoff Op_trimcache = approx,//CID, OID, offset, Len **deprecated** Op_mkcoll = +,//CID Op_rmcoll = +,//CID Op_coll_add = $,//CID, OLDCID, OID Op_coll_remove = 23°c,//CID, oid op_coll_setattr =,//CID, attrname, bl op_coll_rmattr = 25, CID, Attrname Op_coll_setattrs = +,//CID, Attrset op_coll_move = 8,//Newcid, oldcid, oid Op_startsync = 27, Start a sync Op_rmattrs = +,//CID, oid Op_coll_rename = $,//CID, Newcid Op_om Ap_clear = To,//cid Op_omap_setkeys = +,//CID, Attrset Op_omap_rmkeys =,//CID, keyset OP _omap_setheader =,//CID, Header op_split_collection = +,//CID, bits, destination op_split_collection 2 = $,/* CID, bits, destination doesn ' t create the destination */Op_omap_rmkeyrange = PNS,//C ID, OID, firstkey, Lastkey op_coll_move_rename = $,//Oldcid, Oldoid, newcid, newoid op_setallochint = 3 9,//CID, OID, object_size, write_size op_coll_hint = +,//CID, type, BL}; // ...}
Each operation has a corresponding function implementation in the transaction class (for example, Op_zero: zero(cid, oid off, len)
).
A transaction can have the following three callbacks, for example:
on_applied
on_commit
on_applied_sync
ObjectStore::Transaction
The object is primarily used to send sequence of operations from the OSD. For example, OSD:mkfs
run the following operation to initialize the Meta collection:
ObjectStore::Transaction t;t.create_collection(META_COLL);t.write(META_COLL, OSD_SUPERBLOCK_POBJECT, 0, bl.length(), bl);ret = store->apply_transaction(t);
ObjectStore::Transaction
The class can also deserialize the sequence of operations from buffer (note: The Transcation object is rebuilt from the byte stream). The log redo mechanism is to redo the transaction in such a way (note: Read the byte stream from the log to reconstruct the Transcation object, and apply the object).
Log
Logs are important for recovery.
The base class ObjectStore
does not implement logging functionality. JournalingObjectStore
added the ability to log in sub-classes.
JournalingObjectStore
For example, the following methods are added:
- Journal_start
- Journal_stop
- Journal_write_close
- Journal_replay
What's more:
- _op_journal_transactions-Join transaction to log
- Do_transactions-Pure virtual function of the application log.
mount
replay_journal
A sample is called in the procedure.
Realize
There is currently only one Journal
implementation:
ObjectStore Implementation Filestore
Because KeyValueStore
it's still in the experimental stage. And a MemStore
lot of other is a reference/demo implementation, FileStore
become the most widely used in the realization.
FileStore
Implements the JournalingObjectStore
class. The class is also implemented accordingly ObjectStore
. This class implements both ObjectStore::Transaction
the operation in and the other member operations.
The transaction operations are _do_transaction
implemented in and distributed to _$OPERATION
the method complete with detailed work. Because of the different characteristics of each file system, some operations and feature checking methods are extracted and placed into abstract classes FileStoreBackend
. A special operation related to file system code is: fiemap
.
About Fiemap
Fiemap agrees to your access to the file extension data. Basically, you ask the Linux system to return the index that points to the file data area. This is very useful for sparse files. Ceph Filestore FileStore::_do_sparse_copy_range
used it in the.
Extended reading:
Filestore back end
The Filestore backend abstracts most of the file system-related optimizations and uncommon features from the implementation of Filestore.
Assuming that the underlying file system supports checkpoints, Filestore fallback uses the file system snapshot feature to implement checkpoints. It also runs feature checks, such as whether the test supports Fiemap.
All of the detailed classes are inherited to a common base class GenericFileStoreBackend
.
Support for specific file systems:
- ZFS-Implementing checkpoints with ZFS snapshots
- XFS-by
set_alloc_hint
setting XFS expansion size
- Btrfs-implements checkpoints with btrfs snapshots. Achieve efficient file cloning with cow
Keyvaluestore
KeyValueStore
is an ObjectStore
adaptive class that is at the same time KeyValueDB
a detailed subclass.
KeyValueDB
is a generic interface class for the KV database, which is also useful elsewhere in the Ceph code. The largest part of the adapter is to map class file system operations that use the collection ID and object ID to a flat KV interface. The key value in the KV database is not the usual random size. So. The KV mapping and the stripe of the key value all need class StripObjectMap
to complete. It is KeyValueStore
part of it.
Object Mappings
GenericObjectMap
is ghobject_t
the coll_t
public base class for the KV mapper. It's a little similar to the KeyValueDB
API, but it's not implemented. Instead, they are KeyValueDB
implemented using proxies.
Stripobjectmap
StripObjectMap
Is KeyValueStore
part of the source code and is implemented GenericObjectMap
. It joins the stripe and cache functions. The default stripe size is 4096 bytes (configurable).
Database back end
Kinetic Seagate Kinetic Client GitHub repository
Leveldb Google leveldb
Rocksdb Facebook Rocksdb
Memstore
Memstore stores everything in memory. Dumps and restores are supported when Mount/umount. To facilitate object and group lookups, it is built around C + + objects and hash tables.
The first commit record associated with its implementation is a parameter implementation of ObjectStore 1 . The implementation is still 1,537 SLOC.
How to join the new ObjectStore
Because there is already code to process the file system and the KV database. It is not always necessary to write a brand-new objectstore.
A rough guide tells you where to start:
What kind of backend do you want to support?
- KV database: From
KeyValueDB
the beginning of an implementation (note: leveldbstore.cc/h and rocksdbstore.cc/h)
- File system:
- Look at
FileStoreBackend
the method in the next detect_features
. Does the file system require special handling?
- Do you support snapshots? The assumption is that the references
BtrfsFileStoreBackend
,ZFSFileStoreBackend
- A completely different implementation? Look at
MemStore
(note: To understand what needs to be implemented, and the simplest prototype implementations of these methods).
#
- Https://github.com/ceph/ceph/commit/aa63d6730a638591b0699c4215ed5cce2917d1c9?
CEPH ObjectStore API Introduction