This article is also published in the Grand game G-Cloud public number, pasted here, convenient for you to check
Ceph, I believe many it friends have heard. Because of the OpenStack ride, ceph fires and gets more and more hot. However, it is not easy to use a good ceph, in the QQ group often hear beginners complain that ceph performance is too bad, not good use. Is that the case? If you use Ceph's default configuration to run your Ceph cluster, the performance will naturally not be as satisfactory. As the saying goes, the jade does not cut, not the device; Ceph also has its temperament, well-configured and optimized ceph performance is good. Under the simple sharing below, Shanda game G Cloud has some practical experience on ceph optimization, if there are errors, please correct me.
The Ceph configuration parameters below are extracted from the Ceph Hammer 0.94.1 version
Ceph Configuration parameter optimization
First look at the interface diagram of the Ceph client and server:
Ceph is a unified and extensible distributed storage, provides Object
, Block
and file system
three kinds of access interfaces, both through the underlying librados
and back-end interaction, OSD
OSD
is Ceph's object storage unit, to achieve data storage functions. Its internal contains many modules, the module between the exchange of messages through the queue, mutual cooperation to complete the processing of IO; Typical modules are: Network module Messenger
, data processing module Filestore
, log processing module and FileJournal
so on.
In the face of numerous modules, Ceph also provides a wealth of configuration options, preliminary statistics Ceph has thousands of configuration parameters, to configure so many parameters, how difficult it can be imagined. G-Cloud internal mainly use Ceph Block Storage
, namely: Ceph rbd
; The following configuration parameter optimizations are also limited to the rbd
client ( librbd
) and the OSD
end.
Let's take a look at the optimization of the client
rbd
Client Configuration Optimizations
When Ceph is used as a virtual machine block store, it Qemu
is through librbd
that the client library interacts with the Ceph cluster; its associated configuration parameters are basically rbd_
prefixed. The current configuration can be obtained using the following command librbd
:
//path/to/socket指向某个osd的admin socket文件#> ceph --admin-daemon {path/to/socket} config show | grep rbd
Some of these configuration parameters are described in more detail below:
rbd cache
: Enables caching, which is enabled by default.
rbd cache size
: Maximum cache size, default 32MB
rbd cache max dirty
: Maximum value of dirty data in cache, used to control write-back, cannot exceed rbd cache size
, default 24MB
rbd cache target dirty
: The dirty data size that begins to perform writeback, cannot exceed rbd cache max dirty
, default 16MB
rbd cache max dirty age
: The maximum cache time for a single dirty data in the cache, avoiding the default 1s if the dirty data has not reached the cache for a long time.
Comments: Opening the cache can significantly improve the read and write performance of sequential Io, the larger the cache performance, the better, if you allow certain data loss, it is recommended to open.
rbd cache max dirty object
: The maximum Object
number of objects, by default, is 0, which is rbd cache size
calculated by librbd
default, the disk image is logically sliced in 4MB, each chunk
object is abstracted as one, and Object
librbd
the cache is Object
managed in units. Increasing this value can improve performance.
Reviews: This value is smaller in ceph-0.94.1 and is recommended to be increased by the calculation formula in the ceph-0.94.4 version, as follows:
obj = MIN (+, MAX (cct->_conf->rbd_cache_size/100/sizeof (Objectcacher::object)));
I configured the time to take sizeof (objectcacher::object) = 128, 128 is the size of my code-based estimation of the Object
object
rbd cache writethrough until flush
: The default is true, which is to be compatible with the Virtio driver prior to linux-2.6.32, to avoid data not being written back because the flush request is not sent, and after this parameter is set, the librbd
writethrough
io is executed, until the first flush request is received before switching to writeback
Way.
Comments: If your Linux client is using 2.6.32, the kernel recommendation is set to true before it can be closed directly.
rbd cache block writes upfront
: If sync io is turned on, the default is False, and librbd
the answer to be received after opening is Ceph OSD
returned.
Reviews: When turned on, the performance is the worst, but the safest.
rbd readahead trigger requests
: The number of consecutive requests that trigger a read-ahead, default is 10
rbd readahead max bytes
: Maximum IO size for a read-ahead request, default 512KB, 0 to turn off read-ahead
rbd readahead disable after bytes
: The maximum amount of data read-ahead cache, default is 50MB, after threshold, librbd
will turn off the read-ahead function, the Guest OS processing pre-read (prevent duplicate cache), if 0, it means that the cache is not restricted.
Reviews: If sequential read IO is the main, it is recommended to open
objecter inflight ops
: Client flow control, the maximum number of unsent IO requests allowed, exceeding the threshold will clog the application Io, 0 means unrestricted
objecter inflight op bytes
: Client flow control, maximum allowable non-sending dirty data, exceeding threshold will clog application Io, 0 means unrestricted
Reviews: Provides simple client-side flow control to prevent network congestion and, in the case of a host network bottleneck, rbd cache
may be flooded with large amounts 处于发送
of state Io, which in turn affects IO performance. There is no need to modify the value if there is no special need, but if the bandwidth is sufficient, you can increase the value as needed.
rbd ssd cache
: If the disk cache is turned on, it is turned on by default
rbd ssd cache size
: Maximum size of the cache, default 10G
rbd ssd cache max dirty
: Maximum value of dirty data in cache, used to control write-back, cannot exceed rbd ssd cache size
, default 7.5G
rbd ssd cache target dirty
: The dirty data size that begins to perform writeback, cannot exceed rbd cache max dirty
, default 5G
rbd ssd chunk order
: Cache file Shard size, default 64KB = 2^16
rbd ssd cache path
: The path where the cache file resides
Reviews: This is the grand game G-Cloud self-developed with the RBD cache, the first four parameters similar to the foregoing rbd cache *
meaning, the rdb ssd chunk size
definition of the cache file Shard size, is the minimum allocation/recovery unit of the cache file, the Shard size directly affects the efficiency of the cache file usage; The librbd
shard size is also dynamically calculated based on the IO size and applied to the cache file when appropriate.
Above is the grand game G-Cloud in the process of using Ceph RBD, some experience in the client-side optimization, if there are errors, please be more critical, also welcome to add. Continue to look at OSD
the tuning
OSD
Configuration Optimizations
Ceph OSD
The end contains a number of configuration parameters, all the configuration parameters are defined in the src/common/config_opts.h
file, and of course you can view the cluster's current configuration parameters through the command:
#> ceph --admin-daemon {path/to/socket} config show
Due to limited capacity, only a few common configuration parameters are analyzed below:
osd op threads
: Number of threads handling requests such as peering
osd disk threads
: Number of threads handling snap Trim,replica trim and scrub, etc.
filestore op threads
: Number of IO threads
Reviews: The higher the number of threads, the higher the concurrency, the better the performance; if there are too many threads, frequent thread switching can also affect performance, so when you set the number of threads, you need to consider the CPU performance of the nodes, the number of OSD and the nature of the storage media. Usually the first two parameters are set to a smaller value, and the last parameter is set to a larger value to speed up IO processing. The associated values can be dynamically adjusted when an exception such as peering occurs.
filestore op thread timeout
: Alarm time when IO line blocks until those
filestore op thread suicide timeout
: IO thread suicide time, when a thread has not responded for a long time, ceph terminates the thread and causes the OSD process to exit
Reviews: If the IO thread is timed out, it should be combined with the relevant tool/command analysis (e.g.: ceph osd perf
), whether OSD
there is a bottleneck on the disk, or a media failure, etc.
ms dispatch throttle bytes
: Messenger
flow control, controlling DispatcherQueue
queue depth, 0 means unrestricted
Reviews: Messenger
at OSD
the forefront, its performance directly affects IO processing speed. In the hammer version, the IO request is added to the OSD::op_shardedwq
return, and some other requests are added directly to DispatchQueue
it; to avoid Messenger
becoming a bottleneck, you can set the value to a larger point
osd_op_num_shards
: The default is 5, the OSD::op_shardedwq
number of queues in the storage IO
osd_op_num_threads_per_shard
: Default is 2, the OSD::op_shardedwq
number of IO distribution threads allocated for each queue in
Reviews: OSD::op_shardedwq
The number of bus threads acting on is: osd_op_num_shards
*, the osd_op_num_threads_per_shard
default is 10;io request through Messenger
Enter, add to OSD::op_shardedwq
, by the above distribution line Cheng to the back end filestore
processing. Depending on the front-end network (for example: 10Gbps) and back-end media performance (e.g. SSD), consider a higher value.
filestore_queue_max_ops
: Controls the depth of the Filestore Squadron column, the maximum number of outstanding IO
filestore_queue_max_bytes
: Controls the depth of the Filestore Squadron column and the maximum amount of outstanding IO data
filestore_queue_commiting_max_ops
: If OSD
the back-end file system supports checkpoints, then filestore_queue_max_ops
+ filestore_queue_commiting_max_ops
as the maximum depth of the Filestore Squadron column, indicating the maximum outstanding IO number
filestore_queue_commiting_max_bytes
: If the OSD
back-end file system supports checkpoints, then filestore_queue_max_bytes
+ filestore_queue_commiting_max_bytes
as the maximum depth of the Filestore Squadron column, indicating the maximum amount of outstanding IO data
Reviews: Filestore After receiving the IO submitted by the distribution thread, the processing process is affected first by the depth of the Filestore queue, and the request will be blocked if the queue does not complete the IO beyond the set threshold. So it is not a wise choice to increase this value.
journal_queue_max_ops
: Controls the depth of the Filejournal Squadron column, maximum outstanding log io
journal_queue_max_bytes
: Controls the depth of the Filejournal Squadron column and the maximum amount of outstanding log IO data
Reviews: Filestore received the IO submitted by the distribution thread, also affected by the filejournal queue, if the queue does not complete the IO exceeds the set threshold, the request will also be blocked; In general, it is a good choice to increase the value, in addition, the use of independent log disk, There will be a lot of improvements in IO performance
journal_max_write_entries
: Filejournal The maximum number of entries that can be processed for an asynchronous log IO
journal_max_write_bytes
: Filejournal The maximum amount of data that can be processed by an asynchronous log IO
Reviews: These two parameters control the maximum number of IO that the log asynchronous IO can handle each time, usually to set a reasonable value based on the performance of the disk on which the log file resides.
filestore_wbthrottle_enable
: The default is true to control OSD
back-end file system refreshes
filestore_wbthrottle_*_bytes_start_flusher
: Xfs/btrfs file system starts to perform brush-back dirty data
filestore_wbthrottle_*_bytes_hard_limit
: Xfs/btrfs file system maximum allowable dirty data to control Filestore IO operation
filestore_wbthrottle_*_ios_start_flusher
: The number of IO requests that the Xfs/btrfs file system started to perform the brush back
filestore_wbthrottle_*_ios_hard_limit
: The maximum number of outstanding IO requests allowed by the Xfs/btrfs file system to control the IO operation of the Filestore
filestore_wbthrottle_*_inodes_start_flusher
: The number of objects that the Xfs/btrfs file system started to perform the brush back
filestore_wbthrottle_*_inodes_hard_limit
: Xfs/btrfs file system maximum allowable dirty object, used to control Filestore IO operation
Reviews: The *_start_flusher parameter defines a flush Xfs/btrfs file system Dirty data threshold to update the disk cache to disk, and the *_hard_limit parameter affects Filestore IO operations, blocking filestore op thread
threads. So setting a larger value will improve performance.
filestore_fd_cache_size
: Object file Handle Cache size
filestore_fd_cache_shards
: Object file Handle cache number of shards
Comments: Cache file handles can speed up the file access speed, personal suggestions to cache all file handles, of course, please remember to adjust the system handle limit, so as not to exhaust the handle
filestore_fiemap
: Turn on sparse read and write features
Reviews: Turn on this feature to help speed up cloning and recovery
filestore_merge_threshold
: The minimum number of files merged by the PG subdirectory
filestore_split_multiple
: PG Sub-directory split multiplier, default is 2
Reviews: These two parameters control the merging and splitting of the PG directory, and when the number of files in the directory is less than filestore_merge_threshold
that, the object files of the directory are merged into the parent directory, and if the number of files is greater than the directory filestore_merge_threshold*16*filestore_split_multiple
, the directory is split into two subdirectories. Set a reasonable value to speed up indexing of object files
filestore_omap_header_cache_size
: Extended Property Header Cache
Reviews: Cache object's Extended Properties _ Header
object, reduce access to back-end LEVELDB database, improve lookup performance
Purely to introduce, combined with personal experience, the above gives a number of Ceph configuration parameters set recommendations, I hope to give you some ideas. Ceph configuration is still relatively complex, is a system engineering, the configuration of each parameter needs to take into account the network situation of each node, CPU performance, disk performance and so on factors, welcome to message discussion! :-)
Configuration parameter tuning for Ceph performance optimization