If you want to reprint please indicate the author, original address: http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
I've been busy with the optimization and testing of ceph storage, and have looked at various materials, but it seems that there is not an article to explain the methodology, so I would like to summarize here, a lot of content is not my original, but to do a summary. If there are any problems, please spray me so that I can improve.
Optimization methodology
Do anything or to have a methodology, "to give people to fish than to teach people to fishing," the truth, the method through, all the problems have a way to solve. By summarizing the analysis of public data, the optimization of distributed storage system can not be separated from the following points:
1. Hardware level
- Hardware planning
- SSD Selection
- BIOS setup
2. Software level
- Linux OS
- Ceph configurations
- PG Number Adjustment
- CRUSH Map
- Other factors
Hardware optimization 1. Hardware planning
The CEPH-OSD process consumes CPU resources during the run, so it is common for each CEPH-OSD process to bind to a CPU core. Of course, if you use EC mode, you may need more CPU resources.
The Ceph-mon process does not consume CPU resources very much, so there is no need to reserve excessive CPU resources for the Ceph-mon process.
CEPH-MSD is also very CPU intensive, so it needs to provide more CPU resources.
Ceph-mon and CEPH-MDS require 2G of memory, each CEPH-OSD process requires 1G of memory, and of course 2G is better.
Million trillion network is now basically running ceph essential, network planning, also try to consider separating cilent and cluster networks.
2. SSD Selection
Hardware selection also directly determines the performance of the Ceph cluster, from the cost considerations, generally choose SATA SSD as Journal,intel? The SSD DC S3500 series is basically the first choice in the scenario you see today. 400G Specifications 4K Random write can reach 11000 IOPS. If a PCIe SSD is recommended for a sufficient budget, the performance will be further improved, but the addition of Journal does not present an imagined performance boost as Journal writes data to the data disk block for subsequent requests. But it does make a big difference to the latency.
How to determine if your SSD is suitable as an SSD Journal, you can refer to Sébastien Han's ceph:how to Test if Your SSD is suitable as a Journal Device?, in which he also listed the common SSD Test results, the Intel S3500 performance is best in terms of the results from the SATA SSD.
3. BIOS settings
Basic cloud Platform, VT and HT open are necessary, Hyper-Threading Technology (HT) is the use of special hardware instructions, the two logical core simulation into two physical chips, so that a single processor can use thread-level parallel computing, and thus compatible with multi-threaded operating systems and software, reducing the CPU idle time, Improve the efficiency of the CPU operation.
After the power off, the performance is still improved, so firmly adjust to the performance type (performance). Of course, can also be adjusted at the operating system level, the detailed adjustment process please refer to the link, but do not know whether due to the BIOS has been adjusted for the sake of, so on CentOS 6.6 did not find the relevant settings.
forindo-f$CPUFREQcontinueecho$CPUFREQdone
To put it simply, the NUMA idea is to divide the memory and CPU into multiple areas, each called node, and then the node is connected at high speed. The CPU and memory access in node is faster than accessing other node's memory, and NUMA may affect ceph-osd in some cases. Solution, one is to turn off NUMA through the BIOS, and the other is to bind the CEPH-OSD process to a CPU core and memory under the same node through Cgroup. But the second looks more troublesome, so you can turn off NUMA at the system level when you deploy it generally. Under the CentOS system, add Numa=off to turn off NUMA by modifying the/etc/grub.conf file.
kernel /vmlinuz-2.6.32-504.12.2.el6.x86_64 ro root=UUID=870d47f8-0357-4a32-909f-74173a9f0633 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM biosdevname=0 numa=off
Software optimization 1. Linux OS
4194303/proc/sys/kernel/pid_max
- Jumbo frames, the switch side needs to support this function, the System network card settings have effect
ifconfig eth0 mtu 9000
Permanent settings
"MTU=9000"|-a /etc/sysconfig/network-script/ifcfg-eth0/etc/init.d/networking restart
- Read_ahead, read by data pre-read and recorded to random access memory method to improve disk reads, view default values
cat /sys/block/sda/queue/read_ahead_kb
According to some of Ceph's public shares, 8192 is the ideal value
echo"8192" > /sys/block/sda/queue/read_ahead_kb
- Swappiness, the main control system for the use of swap, this parameter adjustment is first seen in the Unitedstack public documents, the reason to guess the adjustment is mainly the use of swap will affect the performance of the system.
echo"vm.swappiness = 0"-a /etc/sysctl.conf
- I/O Scheduler, about I/O scheculder adjustment online already have a lot of information, here no longer repeat, simply say SSD to use Noop,sata/sas deadline.
echo"deadline" > /sys/block/sd[x]/queue/schedulerecho"noop" > /sys/block/sd[x]/queue/scheduler
This article seems to be less, yesterday in the process of communication with the Ceph community, Jan Schermer said he was prepared to contribute to some of the scripts in the production environment, but for the time being, he also cited some of the reasons for using Cgroup for isolation.
- Not moving between process and thread on different cores (better cache utilization)
- Reduce the impact of NUMA
- Network and storage controller impact-Small
- Restricting the Linux dispatch domain by restricting cpuset (not sure is important but best practice)
- If HT is turned on, it may cause the OSD to be on the Thread1, KVM on Thread2, and the same core. The latency and performance of the core depends on what one other thread does.
This point concrete implementation to be supplemented!!!
2. Ceph Configurations[global]
Name of parameter |
Description |
Default Value |
Recommended Values |
Public network |
Client Access Network |
|
192.168.100.0/24 |
Cluster network |
Cluster network |
|
192.168.1.0/24 |
Max open files |
If this option is set, Ceph sets the system's Max Open FDS |
0 |
131072 |
- To view the maximum number of open files in the system, use the command
cat /proc/sys/fs/file-max
[OSD]-Filestore
Name of parameter |
Description |
Default Value |
Recommended Values |
Filestore xattr Use Omap |
Used with the object Map,ext4 file system for Xattrs, XFS or btrfs can also be used |
False |
True |
Filestore Max sync interval |
Maximum sync interval from log to data disk (seconds) |
5 |
15 |
Filestore min sync interval |
Minimum sync interval from log to data disk (seconds) |
0.1 |
10 |
Filestore Queue Max Ops |
Maximum number of operands accepted by the data disk |
500 |
25000 |
Filestore Queue Max bytes |
Data disk operation maximum number of bytes (bytes) |
<< 20 |
10485760 |
Filestore Queue committing Max Ops |
The number of operands that the data disk can commit |
500 |
5000 |
Filestore Queue committing Max bytes |
The maximum number of bytes The data disk can commit (bytes) |
<< 20 |
10485760000 |
Filestore OP Threads |
Number of concurrent file system operations |
2 |
32 |
- The main reason for adjusting the omap is that the EXT4 file system is only 4 K by default
- Filestore queue-related parameters have little impact on performance, and parameter tuning does not inherently improve performance optimizations
[OSD]-Journal
parameter name |
description |
default |
recommended value |
OSD Journal size |
OSD log size (MB) |
5120 |
20000 |
jour Nal Max Write bytes |
Journal maximum number of bytes written at once (bytes) |
<< |
1073714824 |
Journal Max write entries |
journal maximum number of records written at once |
+ |
10000 |
Journal queue max Ops |
Journal maximum number of operations in a queue |
$ |
50000 |
journal queue max bytes |
journal The maximum number of bytes in a queue (bytes) |
<< 1 |
0485760000 |
- Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD daemons to trim operation S from the journal and reuse the space.
- The above paragraph means that the Ceph OSD process stops writing while it is brushing data to the data disk.
[OSD]-OSD Config tuning
Name of parameter |
Description |
Default Value |
Recommended Values |
OSD Max write size |
OSD one writable maximum (MB) |
all |
up |
TD>OSD Client message size cap
clients allow maximum data in memory (bytes) |
524288000 |
2147483648 |
OSD Deep Scrub Stride |
number of bytes allowed in deep Scrub (bytes) |
524288 |
131072 |
OSD op threads |
The number of threads that the OSD process is operating |
2 |
8 |
OSD Disk thre Ads |
OSD Intensive operations such as restore and scrubbing threads |
1 |
4 |
OSD Map Cache Size |
preserve the cache of the OSD map (MB) |
$ |
1024x768 |
OSD Map cache bl size |
OSD process in-memory OSD map cache (MB) |
50 |
$ |
OSD mount options XFS |
Ceph OSD xfs mount options |
rw,noatime,in Ode64 |
rw,noexec,nodev,noatime,nodiratime,nobarrier |
- Increasing the OSD OP threads and Disk threads brings additional CPU overhead
[OSD]-recovery tuning
Name of parameter |
Description |
Default Value |
Recommended Values |
OSD Recovery OP Priority |
Restore operation priority, take value 1-63, higher value takes up more resources |
10 |
4 |
OSD Recovery Max Active |
Number of active recovery requests during the same time |
15 |
10 |
OSD Max Backfills |
Maximum number of backfills allowed on an OSD |
10 |
4 |
[OSD]-Client tuning
Name of parameter |
Description |
Default Value |
Recommended Values |
RBD Cache |
RBD Cache |
True |
True |
RBD Cache Size |
RBD cache Size (bytes) |
33554432 |
268435456 |
RBD Cache Max Dirty |
Maximum number of dirty bytes allowed when caching to Write-back (bytes), if 0, using Write-through |
25165824 |
134217728 |
RBD Cache Max Dirty Age |
Dirty data cache time before being flushed to the storage disk (seconds) |
1 |
5 |
Close the Debug3. PG number
The number of PG and PGP must be adjusted according to the number of OSD, the formula is as follows, but the final calculated result must be close to or equal to a 2 index.
Total PGs = (Total_number_of_OSD * 100) / max_replication_count
For example, 15 OSD, the number of copies is 3, the result should be calculated according to the formula 500, the closest to 512, so the need to set the pool (volumes) Pg_num and Pgp_num are 512.
set512set512
4. CRUSH Map
CRUSH is a very flexible way, the adjustment of CRUSH map depends on the deployment of the specific environment, this may need to be analyzed according to the specific situation, here is not to repeat.
5. Impact of other factors
On Ceph Day this year (2015), Yun Jiexun shared a case that resulted in a degraded performance in the cluster due to the presence of a poorly performing disk in the clusters. The OSD Perf can provide the status of the disk latency, while in the operation of the process can also be used as an important indicator of monitoring, it is obvious in the following example, the OSD 8 disk delay is longer, so you need to consider the OSD out of the cluster:
ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms) 0 14 17 1 14 16 2 10 11 3 4 5 4 13 15 5 17 20 6 15 18 7 14 16 8 299 329
Ceph.conf
[Global]fsid =059f27e8-a23f-4587-9033-3e3679D03b31mon_host =10.10.. 102,10.10.. 101,10.10..AuthClusterRequired = Cephxauth Service required = Cephxauth client required = CEPHXOSD Pooldefault size=3OSD Pooldefault min size=1Public network =10.10.. 0/ -ClusterNetwork =10.10.. 0/ -MaxOpen files =131072[Mon]mon data =/var/lib/ceph/mon/ceph-$id[Osd]osd data =/var/lib/ceph/osd/ceph-$idOSD Journalsize=20000OSD MKFS type = xfsosd mkfs options xfs =-ffilestore xattr Use OMAP = TruefilestoreminSync interval =TenFilestoreMaxSync interval = theFilestore QueueMaxOPS =25000Filestore QueueMaxbytes =10485760Filestore Queue CommittingMaxOPS = theFilestore Queue CommittingMaxbytes =10485760000JournalMaxWrite bytes =1073714824JournalMaxWrite entries =10000Journal queueMaxOPS =50000Journal queueMaxbytes =10485760000OsdMaxWritesize= +OSD Client MessagesizeCap =2147483648OSD Deep Scrub Stride =131072OSD OP threads =8OSD Disk threads =4OSD Map Cachesize=1024x768OSD Map Cache blsize= -OSD Mount Options XFS ="Rw,noexec,nodev,noatime,nodiratime,nobarrier"OSD Recovery OP priority =4OSD RecoveryMaxActive =TenOsdMaxBackfills =4[CLIENT]RBD cache = TRUERBD Cachesize=268435456RBD CacheMaxDirty =134217728RBD CacheMaxDirty age =5
Summarize
Optimization is a long-term iterative process, all methods are others, only in the course of practice to find their own, this article is only a beginning, welcome you to actively add, together to complete a guided article.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Ceph Performance Optimization Summary (v0.94)