CEPH ObjectStore API Introduction

Source: Internet
Author: User
Tags map class

Thomas is my pseudonym used by the Ceph China Community Translation team, which was first published in the Ceph China community. Now reproduced to my blog, for everyone to circulate

CEPH ObjectStore API Introduction

This article was translated by the Ceph China Community-thomas, Chen School Draft.

English Source: The CEPH objectstore API Welcome to join the translation team

Simple Introduction

The object store is part of the Ceph OSD, which finishes the actual data storage. There are currently three different object stores available:

    • Filestore: File system + Log fallback storage
    • Keyvaluestore: Based on the KV database (for example: Rocksdb. LevelDB)
    • Memstore: Memory as storage (all data in memory Stl::map or bufferlist)

Now there are newstore, or bluestore is being developed.

Related documents
    • Ceph performance:interesting things going on
    • Object Store Architecture Overview
Code

The object store source code is located in the OS subfolder under the Ceph Source code folder. For convenience: Ceph GitHub repository.

The following descriptive narrative is based on the Ceph commit-ish 6f8b54c from 2015-01-13.

ObjectStore API

The abstract class ObjectStore is the main API for the OSD to implement storage access.

This is a set of class file system APIs. But it includes operations that convert the state into a transaction. The object is stored, not the file.

An object includes:

    • Byte data-Similar to file contents in the file system
    • Extended properties-similar to extended properties of files in the file system. is a collection of key-value pairs.
    • OMAP-similar in concept to extended attributes, but with different address space sizes and access patterns.

An object is marked by the following two IDs:

    • Collection ID- coll_t CID (a collection is a set of objects)
    • Object ID- ghobject_t OID
Operation

The following is not a complete list; it just gives you an impression. Note: Some operations are only available for transactions.

Transactional operations: Take a transaction as a parameter

    • apply_transaction
    • queue_transactions

General File System operations:

    • mount
    • umount
    • mkfs
    • mkjournal
    • statfs
    • need_journal
    • sync
    • flush
    • snapshot

Object manipulation: With coll_t cid and ghobject_t oid as the number of parameters

    • exists
    • stat
    • read
    • fiemap
    • getattr
    • getattrs

Set operation: As coll_t cid a number of parameters

    • collection_getattr
    • collection_empty
    • collection_list
    • list_collections
    • collection_exists
    • collections_getattrs

OMAP Operation:

    • omap_get
    • omap_get_header
    • omap_get_keys
    • omap_get_values
    • omap_check_keys
Extended Reading
    • ObjectStore.h-Gaze part
Transaction

Definition: A transaction is a sequence of original change operations. Class definition Objectstore::transaction.

Supported operations (excerpt from ObjectStore.h):

Class Transaction {public:enum {op_nop = 0, Op_touch = 9,//CID, OID OP _write = ten,//CID, OID, offset, len, bl Op_zero = one,//CID, OID, offset, Len Op_truncat E = d,//CID, OID, len Op_remove = +,//CID, oid op_setattr = +,//CID, OID, Attrnam E, bl op_setattrs = $,//CID, OID, attrset op_rmattr = +,//CID, OID, Attrname Op_clo NE = +,//CID, OID, newoid op_clonerange =,//CID, OID, newoid, offset, Len Op_clonerange2        = +,//CID, OID, newoid, Srcoff, len, Dstoff Op_trimcache = approx,//CID, OID, offset, Len **deprecated**        Op_mkcoll = +,//CID Op_rmcoll = +,//CID Op_coll_add = $,//CID, OLDCID, OID  Op_coll_remove = 23°c,//CID, oid op_coll_setattr =,//CID, attrname, bl op_coll_rmattr = 25,       CID, Attrname Op_coll_setattrs = +,//CID, Attrset op_coll_move = 8,//Newcid, oldcid, oid Op_startsync = 27, Start a sync Op_rmattrs = +,//CID, oid Op_coll_rename = $,//CID, Newcid Op_om Ap_clear = To,//cid Op_omap_setkeys = +,//CID, Attrset Op_omap_rmkeys =,//CID, keyset OP _omap_setheader =,//CID, Header op_split_collection = +,//CID, bits, destination op_split_collection 2 = $,/* CID, bits, destination doesn ' t create the destination */Op_omap_rmkeyrange = PNS,//C ID, OID, firstkey, Lastkey op_coll_move_rename = $,//Oldcid, Oldoid, newcid, newoid op_setallochint = 3    9,//CID, OID, object_size, write_size op_coll_hint = +,//CID, type, BL}; // ...}

Each operation has a corresponding function implementation in the transaction class (for example, Op_zero: zero(cid, oid off, len) ).

A transaction can have the following three callbacks, for example:

    • on_applied
    • on_commit
    • on_applied_sync

ObjectStore::TransactionThe object is primarily used to send sequence of operations from the OSD. For example, OSD:mkfs run the following operation to initialize the Meta collection:

ObjectStore::Transaction t;t.create_collection(META_COLL);t.write(META_COLL, OSD_SUPERBLOCK_POBJECT, 0, bl.length(), bl);ret = store->apply_transaction(t);

ObjectStore::TransactionThe class can also deserialize the sequence of operations from buffer (note: The Transcation object is rebuilt from the byte stream). The log redo mechanism is to redo the transaction in such a way (note: Read the byte stream from the log to reconstruct the Transcation object, and apply the object).

Log

Logs are important for recovery.

The base class ObjectStore does not implement logging functionality. JournalingObjectStoreadded the ability to log in sub-classes.

JournalingObjectStoreFor example, the following methods are added:

    • Journal_start
    • Journal_stop
    • Journal_write_close
    • Journal_replay

What's more:

    • _op_journal_transactions-Join transaction to log
    • Do_transactions-Pure virtual function of the application log. mount replay_journal A sample is called in the procedure.
Realize

There is currently only one Journal implementation:

ObjectStore Implementation Filestore

Because KeyValueStore it's still in the experimental stage. And a MemStore lot of other is a reference/demo implementation, FileStore become the most widely used in the realization.

FileStoreImplements the JournalingObjectStore class. The class is also implemented accordingly ObjectStore . This class implements both ObjectStore::Transaction the operation in and the other member operations.

The transaction operations are _do_transaction implemented in and distributed to _$OPERATION the method complete with detailed work. Because of the different characteristics of each file system, some operations and feature checking methods are extracted and placed into abstract classes FileStoreBackend . A special operation related to file system code is: fiemap .

About Fiemap

Fiemap agrees to your access to the file extension data. Basically, you ask the Linux system to return the index that points to the file data area. This is very useful for sparse files. Ceph Filestore FileStore::_do_sparse_copy_range used it in the.

Extended reading:

    • Lwn:fiemap, an extend mapping IOCTL
    • h=fiemap-copy#n169 ">GNU coreutils Copy Implementation

Filestore back end

The Filestore backend abstracts most of the file system-related optimizations and uncommon features from the implementation of Filestore.

Assuming that the underlying file system supports checkpoints, Filestore fallback uses the file system snapshot feature to implement checkpoints. It also runs feature checks, such as whether the test supports Fiemap.

All of the detailed classes are inherited to a common base class GenericFileStoreBackend .

Support for specific file systems:

    • ZFS-Implementing checkpoints with ZFS snapshots
    • XFS-by set_alloc_hint setting XFS expansion size
    • Btrfs-implements checkpoints with btrfs snapshots. Achieve efficient file cloning with cow
Keyvaluestore

KeyValueStoreis an ObjectStore adaptive class that is at the same time KeyValueDB a detailed subclass.

KeyValueDBis a generic interface class for the KV database, which is also useful elsewhere in the Ceph code. The largest part of the adapter is to map class file system operations that use the collection ID and object ID to a flat KV interface. The key value in the KV database is not the usual random size. So. The KV mapping and the stripe of the key value all need class StripObjectMap to complete. It is KeyValueStore part of it.

Object Mappings

GenericObjectMapis ghobject_t the coll_t public base class for the KV mapper. It's a little similar to the KeyValueDB API, but it's not implemented. Instead, they are KeyValueDB implemented using proxies.

Stripobjectmap

StripObjectMapIs KeyValueStore part of the source code and is implemented GenericObjectMap . It joins the stripe and cache functions. The default stripe size is 4096 bytes (configurable).

Database back end

Kinetic Seagate Kinetic Client GitHub repository

Leveldb Google leveldb

Rocksdb Facebook Rocksdb

Memstore

Memstore stores everything in memory. Dumps and restores are supported when Mount/umount. To facilitate object and group lookups, it is built around C + + objects and hash tables.

The first commit record associated with its implementation is a parameter implementation of ObjectStore 1 . The implementation is still 1,537 SLOC.

How to join the new ObjectStore

Because there is already code to process the file system and the KV database. It is not always necessary to write a brand-new objectstore.

A rough guide tells you where to start:

What kind of backend do you want to support?

    • KV database: From KeyValueDB the beginning of an implementation (note: leveldbstore.cc/h and rocksdbstore.cc/h)
    • File system:
      • Look at FileStoreBackend the method in the next detect_features . Does the file system require special handling?
      • Do you support snapshots? The assumption is that the references BtrfsFileStoreBackend ,ZFSFileStoreBackend
    • A completely different implementation? Look at MemStore (note: To understand what needs to be implemented, and the simplest prototype implementations of these methods).
#
    1. Https://github.com/ceph/ceph/commit/aa63d6730a638591b0699c4215ed5cce2917d1c9?

CEPH ObjectStore API Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.