Configuration parameter tuning for Ceph performance optimization

Source: Internet
Author: User

This article is also published in the Grand game G-Cloud public number, pasted here, convenient for you to check

Ceph, I believe many it friends have heard. Because of the OpenStack ride, ceph fires and gets more and more hot. However, it is not easy to use a good ceph, in the QQ group often hear beginners complain that ceph performance is too bad, not good use. Is that the case? If you use Ceph's default configuration to run your Ceph cluster, the performance will naturally not be as satisfactory. As the saying goes, the jade does not cut, not the device; Ceph also has its temperament, well-configured and optimized ceph performance is good. Under the simple sharing below, Shanda game G Cloud has some practical experience on ceph optimization, if there are errors, please correct me.

The Ceph configuration parameters below are extracted from the Ceph Hammer 0.94.1 version

Ceph Configuration parameter optimization

First look at the interface diagram of the Ceph client and server:

Ceph is a unified and extensible distributed storage, provides Object , Block and file system three kinds of access interfaces, both through the underlying librados and back-end interaction, OSD OSD is Ceph's object storage unit, to achieve data storage functions. Its internal contains many modules, the module between the exchange of messages through the queue, mutual cooperation to complete the processing of IO; Typical modules are: Network module Messenger , data processing module Filestore , log processing module and FileJournal so on.

In the face of numerous modules, Ceph also provides a wealth of configuration options, preliminary statistics Ceph has thousands of configuration parameters, to configure so many parameters, how difficult it can be imagined. G-Cloud internal mainly use Ceph Block Storage , namely: Ceph rbd ; The following configuration parameter optimizations are also limited to the rbd client ( librbd ) and the OSD end.
Let's take a look at the optimization of the client

rbdClient Configuration Optimizations

When Ceph is used as a virtual machine block store, it Qemu is through librbd that the client library interacts with the Ceph cluster; its associated configuration parameters are basically rbd_ prefixed. The current configuration can be obtained using the following command librbd :

//path/to/socket指向某个osd的admin socket文件#> ceph --admin-daemon {path/to/socket} config show | grep rbd

Some of these configuration parameters are described in more detail below:

    • rbd cache: Enables caching, which is enabled by default.
    • rbd cache size: Maximum cache size, default 32MB
    • rbd cache max dirty: Maximum value of dirty data in cache, used to control write-back, cannot exceed rbd cache size , default 24MB
    • rbd cache target dirty: The dirty data size that begins to perform writeback, cannot exceed rbd cache max dirty , default 16MB
    • rbd cache max dirty age: The maximum cache time for a single dirty data in the cache, avoiding the default 1s if the dirty data has not reached the cache for a long time.

Comments: Opening the cache can significantly improve the read and write performance of sequential Io, the larger the cache performance, the better, if you allow certain data loss, it is recommended to open.

    • rbd cache max dirty object: The maximum Object number of objects, by default, is 0, which is rbd cache size calculated by librbd default, the disk image is logically sliced in 4MB, each chunk object is abstracted as one, and Object librbd the cache is Object managed in units. Increasing this value can improve performance.

Reviews: This value is smaller in ceph-0.94.1 and is recommended to be increased by the calculation formula in the ceph-0.94.4 version, as follows:

obj = MIN (+, MAX (cct->_conf->rbd_cache_size/100/sizeof (Objectcacher::object)));

I configured the time to take sizeof (objectcacher::object) = 128, 128 is the size of my code-based estimation of the Object object

    • rbd cache writethrough until flush: The default is true, which is to be compatible with the Virtio driver prior to linux-2.6.32, to avoid data not being written back because the flush request is not sent, and after this parameter is set, the librbd writethrough io is executed, until the first flush request is received before switching to writebackWay.

Comments: If your Linux client is using 2.6.32, the kernel recommendation is set to true before it can be closed directly.

    • rbd cache block writes upfront: If sync io is turned on, the default is False, and librbd the answer to be received after opening is Ceph OSD returned.

Reviews: When turned on, the performance is the worst, but the safest.

    • rbd readahead trigger requests: The number of consecutive requests that trigger a read-ahead, default is 10
    • rbd readahead max bytes: Maximum IO size for a read-ahead request, default 512KB, 0 to turn off read-ahead
    • rbd readahead disable after bytes: The maximum amount of data read-ahead cache, default is 50MB, after threshold, librbd will turn off the read-ahead function, the Guest OS processing pre-read (prevent duplicate cache), if 0, it means that the cache is not restricted.

Reviews: If sequential read IO is the main, it is recommended to open

    • objecter inflight ops: Client flow control, the maximum number of unsent IO requests allowed, exceeding the threshold will clog the application Io, 0 means unrestricted
    • objecter inflight op bytes: Client flow control, maximum allowable non-sending dirty data, exceeding threshold will clog application Io, 0 means unrestricted

Reviews: Provides simple client-side flow control to prevent network congestion and, in the case of a host network bottleneck, rbd cache may be flooded with large amounts 处于发送 of state Io, which in turn affects IO performance. There is no need to modify the value if there is no special need, but if the bandwidth is sufficient, you can increase the value as needed.

    • rbd ssd cache: If the disk cache is turned on, it is turned on by default
    • rbd ssd cache size: Maximum size of the cache, default 10G
    • rbd ssd cache max dirty: Maximum value of dirty data in cache, used to control write-back, cannot exceed rbd ssd cache size , default 7.5G
    • rbd ssd cache target dirty: The dirty data size that begins to perform writeback, cannot exceed rbd cache max dirty , default 5G
    • rbd ssd chunk order: Cache file Shard size, default 64KB = 2^16
    • rbd ssd cache path: The path where the cache file resides

Reviews: This is the grand game G-Cloud self-developed with the RBD cache, the first four parameters similar to the foregoing rbd cache * meaning, the rdb ssd chunk size definition of the cache file Shard size, is the minimum allocation/recovery unit of the cache file, the Shard size directly affects the efficiency of the cache file usage; The librbd shard size is also dynamically calculated based on the IO size and applied to the cache file when appropriate.

Above is the grand game G-Cloud in the process of using Ceph RBD, some experience in the client-side optimization, if there are errors, please be more critical, also welcome to add. Continue to look at OSD the tuning

OSDConfiguration Optimizations

Ceph OSDThe end contains a number of configuration parameters, all the configuration parameters are defined in the src/common/config_opts.h file, and of course you can view the cluster's current configuration parameters through the command:

#> ceph --admin-daemon {path/to/socket} config show

Due to limited capacity, only a few common configuration parameters are analyzed below:

    • osd op threads: Number of threads handling requests such as peering
    • osd disk threads: Number of threads handling snap Trim,replica trim and scrub, etc.
    • filestore op threads: Number of IO threads

Reviews: The higher the number of threads, the higher the concurrency, the better the performance; if there are too many threads, frequent thread switching can also affect performance, so when you set the number of threads, you need to consider the CPU performance of the nodes, the number of OSD and the nature of the storage media. Usually the first two parameters are set to a smaller value, and the last parameter is set to a larger value to speed up IO processing. The associated values can be dynamically adjusted when an exception such as peering occurs.

    • filestore op thread timeout: Alarm time when IO line blocks until those
    • filestore op thread suicide timeout: IO thread suicide time, when a thread has not responded for a long time, ceph terminates the thread and causes the OSD process to exit

Reviews: If the IO thread is timed out, it should be combined with the relevant tool/command analysis (e.g.: ceph osd perf ), whether OSD there is a bottleneck on the disk, or a media failure, etc.

    • ms dispatch throttle bytes: Messenger flow control, controlling DispatcherQueue queue depth, 0 means unrestricted

Reviews: Messenger at OSD the forefront, its performance directly affects IO processing speed. In the hammer version, the IO request is added to the OSD::op_shardedwq return, and some other requests are added directly to DispatchQueue it; to avoid Messenger becoming a bottleneck, you can set the value to a larger point

    • osd_op_num_shards: The default is 5, the OSD::op_shardedwq number of queues in the storage IO
    • osd_op_num_threads_per_shard: Default is 2, the OSD::op_shardedwq number of IO distribution threads allocated for each queue in

Reviews: OSD::op_shardedwq The number of bus threads acting on is: osd_op_num_shards *, the osd_op_num_threads_per_shard default is 10;io request through Messenger Enter, add to OSD::op_shardedwq , by the above distribution line Cheng to the back end filestore processing. Depending on the front-end network (for example: 10Gbps) and back-end media performance (e.g. SSD), consider a higher value.

    • filestore_queue_max_ops: Controls the depth of the Filestore Squadron column, the maximum number of outstanding IO
    • filestore_queue_max_bytes: Controls the depth of the Filestore Squadron column and the maximum amount of outstanding IO data
    • filestore_queue_commiting_max_ops: If OSD the back-end file system supports checkpoints, then filestore_queue_max_ops + filestore_queue_commiting_max_ops as the maximum depth of the Filestore Squadron column, indicating the maximum outstanding IO number
    • filestore_queue_commiting_max_bytes: If the OSD back-end file system supports checkpoints, then filestore_queue_max_bytes + filestore_queue_commiting_max_bytes as the maximum depth of the Filestore Squadron column, indicating the maximum amount of outstanding IO data

Reviews: Filestore After receiving the IO submitted by the distribution thread, the processing process is affected first by the depth of the Filestore queue, and the request will be blocked if the queue does not complete the IO beyond the set threshold. So it is not a wise choice to increase this value.

    • journal_queue_max_ops: Controls the depth of the Filejournal Squadron column, maximum outstanding log io
    • journal_queue_max_bytes: Controls the depth of the Filejournal Squadron column and the maximum amount of outstanding log IO data

Reviews: Filestore received the IO submitted by the distribution thread, also affected by the filejournal queue, if the queue does not complete the IO exceeds the set threshold, the request will also be blocked; In general, it is a good choice to increase the value, in addition, the use of independent log disk, There will be a lot of improvements in IO performance

    • journal_max_write_entries: Filejournal The maximum number of entries that can be processed for an asynchronous log IO
    • journal_max_write_bytes: Filejournal The maximum amount of data that can be processed by an asynchronous log IO

Reviews: These two parameters control the maximum number of IO that the log asynchronous IO can handle each time, usually to set a reasonable value based on the performance of the disk on which the log file resides.

    • filestore_wbthrottle_enable: The default is true to control OSD back-end file system refreshes
    • filestore_wbthrottle_*_bytes_start_flusher: Xfs/btrfs file system starts to perform brush-back dirty data
    • filestore_wbthrottle_*_bytes_hard_limit: Xfs/btrfs file system maximum allowable dirty data to control Filestore IO operation
    • filestore_wbthrottle_*_ios_start_flusher: The number of IO requests that the Xfs/btrfs file system started to perform the brush back
    • filestore_wbthrottle_*_ios_hard_limit: The maximum number of outstanding IO requests allowed by the Xfs/btrfs file system to control the IO operation of the Filestore
    • filestore_wbthrottle_*_inodes_start_flusher: The number of objects that the Xfs/btrfs file system started to perform the brush back
    • filestore_wbthrottle_*_inodes_hard_limit: Xfs/btrfs file system maximum allowable dirty object, used to control Filestore IO operation

Reviews: The *_start_flusher parameter defines a flush Xfs/btrfs file system Dirty data threshold to update the disk cache to disk, and the *_hard_limit parameter affects Filestore IO operations, blocking filestore op thread threads. So setting a larger value will improve performance.

    • filestore_fd_cache_size: Object file Handle Cache size
    • filestore_fd_cache_shards: Object file Handle cache number of shards

Comments: Cache file handles can speed up the file access speed, personal suggestions to cache all file handles, of course, please remember to adjust the system handle limit, so as not to exhaust the handle

    • filestore_fiemap: Turn on sparse read and write features

Reviews: Turn on this feature to help speed up cloning and recovery

    • filestore_merge_threshold: The minimum number of files merged by the PG subdirectory
    • filestore_split_multiple: PG Sub-directory split multiplier, default is 2

Reviews: These two parameters control the merging and splitting of the PG directory, and when the number of files in the directory is less than filestore_merge_threshold that, the object files of the directory are merged into the parent directory, and if the number of files is greater than the directory filestore_merge_threshold*16*filestore_split_multiple , the directory is split into two subdirectories. Set a reasonable value to speed up indexing of object files

    • filestore_omap_header_cache_size: Extended Property Header Cache

Reviews: Cache object's Extended Properties _ Header object, reduce access to back-end LEVELDB database, improve lookup performance

Purely to introduce, combined with personal experience, the above gives a number of Ceph configuration parameters set recommendations, I hope to give you some ideas. Ceph configuration is still relatively complex, is a system engineering, the configuration of each parameter needs to take into account the network situation of each node, CPU performance, disk performance and so on factors, welcome to message discussion! :-)

Configuration parameter tuning for Ceph performance optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.