Ceph Performance Optimization Summary (v0.94)

Source: Internet
Author: User

If you want to reprint please indicate the author, original address: http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/

I've been busy with the optimization and testing of ceph storage, and have looked at various materials, but it seems that there is not an article to explain the methodology, so I would like to summarize here, a lot of content is not my original, but to do a summary. If there are any problems, please spray me so that I can improve.

Optimization methodology

Do anything or to have a methodology, "to give people to fish than to teach people to fishing," the truth, the method through, all the problems have a way to solve. By summarizing the analysis of public data, the optimization of distributed storage system can not be separated from the following points:

1. Hardware level
    • Hardware planning
    • SSD Selection
    • BIOS setup
2. Software level
    • Linux OS
    • Ceph configurations
    • PG Number Adjustment
    • CRUSH Map
    • Other factors
Hardware optimization 1. Hardware planning
    • Processor

The CEPH-OSD process consumes CPU resources during the run, so it is common for each CEPH-OSD process to bind to a CPU core. Of course, if you use EC mode, you may need more CPU resources.

The Ceph-mon process does not consume CPU resources very much, so there is no need to reserve excessive CPU resources for the Ceph-mon process.

CEPH-MSD is also very CPU intensive, so it needs to provide more CPU resources.

    • Memory

Ceph-mon and CEPH-MDS require 2G of memory, each CEPH-OSD process requires 1G of memory, and of course 2G is better.

    • Network planning

Million trillion network is now basically running ceph essential, network planning, also try to consider separating cilent and cluster networks.

2. SSD Selection

Hardware selection also directly determines the performance of the Ceph cluster, from the cost considerations, generally choose SATA SSD as Journal,intel? The SSD DC S3500 series is basically the first choice in the scenario you see today. 400G Specifications 4K Random write can reach 11000 IOPS. If a PCIe SSD is recommended for a sufficient budget, the performance will be further improved, but the addition of Journal does not present an imagined performance boost as Journal writes data to the data disk block for subsequent requests. But it does make a big difference to the latency.

How to determine if your SSD is suitable as an SSD Journal, you can refer to Sébastien Han's ceph:how to Test if Your SSD is suitable as a Journal Device?, in which he also listed the common SSD Test results, the Intel S3500 performance is best in terms of the results from the SATA SSD.

3. BIOS settings
    • Hyper-threading (HT)

Basic cloud Platform, VT and HT open are necessary, Hyper-Threading Technology (HT) is the use of special hardware instructions, the two logical core simulation into two physical chips, so that a single processor can use thread-level parallel computing, and thus compatible with multi-threaded operating systems and software, reducing the CPU idle time, Improve the efficiency of the CPU operation.

    • Power off

After the power off, the performance is still improved, so firmly adjust to the performance type (performance). Of course, can also be adjusted at the operating system level, the detailed adjustment process please refer to the link, but do not know whether due to the BIOS has been adjusted for the sake of, so on CentOS 6.6 did not find the relevant settings.

    • Numa

To put it simply, the NUMA idea is to divide the memory and CPU into multiple areas, each called node, and then the node is connected at high speed. The CPU and memory access in node is faster than accessing other node's memory, and NUMA may affect ceph-osd in some cases. Solution, one is to turn off NUMA through the BIOS, and the other is to bind the CEPH-OSD process to a CPU core and memory under the same node through Cgroup. But the second looks more troublesome, so you can turn off NUMA at the system level when you deploy it generally. Under the CentOS system, add Numa=off to turn off NUMA by modifying the/etc/grub.conf file.

kernel /vmlinuz-2.6.32-504.12.2.el6.x86_64 ro root=UUID=870d47f8-0357-4a32-909f-74173a9f0633 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM   biosdevname=0 numa=off
Software optimization 1. Linux OS
    • Kernel PID Max
    • Jumbo frames, the switch side needs to support this function, the System network card settings have effect
ifconfig eth0 mtu 9000

Permanent settings

"MTU=9000"|-a /etc/sysconfig/network-script/ifcfg-eth0/etc/init.d/networking restart
    • Read_ahead, read by data pre-read and recorded to random access memory method to improve disk reads, view default values
cat /sys/block/sda/queue/read_ahead_kb

According to some of Ceph's public shares, 8192 is the ideal value

echo"8192" > /sys/block/sda/queue/read_ahead_kb
    • Swappiness, the main control system for the use of swap, this parameter adjustment is first seen in the Unitedstack public documents, the reason to guess the adjustment is mainly the use of swap will affect the performance of the system.
echo"vm.swappiness = 0"-a /etc/sysctl.conf
    • I/O Scheduler, about I/O scheculder adjustment online already have a lot of information, here no longer repeat, simply say SSD to use Noop,sata/sas deadline.
echo"deadline" > /sys/block/sd[x]/queue/schedulerecho"noop" > /sys/block/sd[x]/queue/scheduler
    • Cgroup

This article seems to be less, yesterday in the process of communication with the Ceph community, Jan Schermer said he was prepared to contribute to some of the scripts in the production environment, but for the time being, he also cited some of the reasons for using Cgroup for isolation.

  • Not moving between process and thread on different cores (better cache utilization)
  • Reduce the impact of NUMA
  • Network and storage controller impact-Small
  • Restricting the Linux dispatch domain by restricting cpuset (not sure is important but best practice)
  • If HT is turned on, it may cause the OSD to be on the Thread1, KVM on Thread2, and the same core. The latency and performance of the core depends on what one other thread does.

This point concrete implementation to be supplemented!!!

2. Ceph Configurations[global]
Name of parameter Description Default Value Recommended Values
Public network Client Access Network
Cluster network Cluster network
Max open files If this option is set, Ceph sets the system's Max Open FDS 0 131072
    • To view the maximum number of open files in the system, use the command
    cat /proc/sys/fs/file-max
Name of parameter Description Default Value Recommended Values
Filestore xattr Use Omap Used with the object Map,ext4 file system for Xattrs, XFS or btrfs can also be used False True
Filestore Max sync interval Maximum sync interval from log to data disk (seconds) 5 15
Filestore min sync interval Minimum sync interval from log to data disk (seconds) 0.1 10
Filestore Queue Max Ops Maximum number of operands accepted by the data disk 500 25000
Filestore Queue Max bytes Data disk operation maximum number of bytes (bytes) << 20 10485760
Filestore Queue committing Max Ops The number of operands that the data disk can commit 500 5000
Filestore Queue committing Max bytes The maximum number of bytes The data disk can commit (bytes) << 20 10485760000
Filestore OP Threads Number of concurrent file system operations 2 32
    • The main reason for adjusting the omap is that the EXT4 file system is only 4 K by default
    • Filestore queue-related parameters have little impact on performance, and parameter tuning does not inherently improve performance optimizations
parameter name description default recommended value
OSD Journal size OSD log size (MB) 5120 20000
jour Nal Max Write bytes Journal maximum number of bytes written at once (bytes) << 1073714824
Journal Max write entries journal maximum number of records written at once + 10000
Journal queue max Ops Journal maximum number of operations in a queue $ 50000
journal queue max bytes journal The maximum number of bytes in a queue (bytes) << 1 0485760000
    • Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD daemons to trim operation S from the journal and reuse the space.
    • The above paragraph means that the Ceph OSD process stops writing while it is brushing data to the data disk.
[OSD]-OSD Config tuning TD>OSD Client message size cap
Name of parameter Description Default Value Recommended Values
OSD Max write size OSD one writable maximum (MB) all up
clients allow maximum data in memory (bytes) 524288000 2147483648
OSD Deep Scrub Stride number of bytes allowed in deep Scrub (bytes) 524288 131072
OSD op threads The number of threads that the OSD process is operating 2 8
OSD Disk thre Ads OSD Intensive operations such as restore and scrubbing threads 1 4
OSD Map Cache Size preserve the cache of the OSD map (MB) $ 1024x768
OSD Map cache bl size OSD process in-memory OSD map cache (MB) 50 $
OSD mount options XFS Ceph OSD xfs mount options rw,noatime,in Ode64 rw,noexec,nodev,noatime,nodiratime,nobarrier
    • Increasing the OSD OP threads and Disk threads brings additional CPU overhead
[OSD]-recovery tuning
Name of parameter Description Default Value Recommended Values
OSD Recovery OP Priority Restore operation priority, take value 1-63, higher value takes up more resources 10 4
OSD Recovery Max Active Number of active recovery requests during the same time 15 10
OSD Max Backfills Maximum number of backfills allowed on an OSD 10 4
[OSD]-Client tuning
Name of parameter Description Default Value Recommended Values
RBD Cache RBD Cache True True
RBD Cache Size RBD cache Size (bytes) 33554432 268435456
RBD Cache Max Dirty Maximum number of dirty bytes allowed when caching to Write-back (bytes), if 0, using Write-through 25165824 134217728
RBD Cache Max Dirty Age Dirty data cache time before being flushed to the storage disk (seconds) 1 5
Close the Debug3. PG number

The number of PG and PGP must be adjusted according to the number of OSD, the formula is as follows, but the final calculated result must be close to or equal to a 2 index.

Total PGs = (Total_number_of_OSD * 100) / max_replication_count

For example, 15 OSD, the number of copies is 3, the result should be calculated according to the formula 500, the closest to 512, so the need to set the pool (volumes) Pg_num and Pgp_num are 512.

4. CRUSH Map

CRUSH is a very flexible way, the adjustment of CRUSH map depends on the deployment of the specific environment, this may need to be analyzed according to the specific situation, here is not to repeat.

5. Impact of other factors

On Ceph Day this year (2015), Yun Jiexun shared a case that resulted in a degraded performance in the cluster due to the presence of a poorly performing disk in the clusters. The OSD Perf can provide the status of the disk latency, while in the operation of the process can also be used as an important indicator of monitoring, it is obvious in the following example, the OSD 8 disk delay is longer, so you need to consider the OSD out of the cluster:

ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)  0                    14                   17  1                    14                   16  2                    10                   11  3                     4                    5  4                    13                   15  5                    17                   20  6                    15                   18  7                    14                   16  8                   299                  329
[Global]fsid =059f27e8-a23f-4587-9033-3e3679D03b31mon_host =10.10.. 102,10.10.. 101,10.10..AuthClusterRequired = Cephxauth Service required = Cephxauth client required = CEPHXOSD Pooldefault size=3OSD Pooldefault min size=1Public network =10.10.. 0/ -ClusterNetwork =10.10.. 0/ -MaxOpen files =131072[Mon]mon data =/var/lib/ceph/mon/ceph-$id[Osd]osd data =/var/lib/ceph/osd/ceph-$idOSD Journalsize=20000OSD MKFS type = xfsosd mkfs options xfs =-ffilestore xattr Use OMAP = TruefilestoreminSync interval =TenFilestoreMaxSync interval = theFilestore QueueMaxOPS =25000Filestore QueueMaxbytes =10485760Filestore Queue CommittingMaxOPS = theFilestore Queue CommittingMaxbytes =10485760000JournalMaxWrite bytes =1073714824JournalMaxWrite entries =10000Journal queueMaxOPS =50000Journal queueMaxbytes =10485760000OsdMaxWritesize= +OSD Client MessagesizeCap =2147483648OSD Deep Scrub Stride =131072OSD OP threads =8OSD Disk threads =4OSD Map Cachesize=1024x768OSD Map Cache blsize= -OSD Mount Options XFS ="Rw,noexec,nodev,noatime,nodiratime,nobarrier"OSD Recovery OP priority =4OSD RecoveryMaxActive =TenOsdMaxBackfills =4[CLIENT]RBD cache = TRUERBD Cachesize=268435456RBD CacheMaxDirty =134217728RBD CacheMaxDirty age =5

Optimization is a long-term iterative process, all methods are others, only in the course of practice to find their own, this article is only a beginning, welcome you to actively add, together to complete a guided article.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Ceph Performance Optimization Summary (v0.94)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.