Ceph Performance Optimization Summary (v0.94)

Last Update:2015-06-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If you want to reprint please indicate the author, original address: http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/

I've been busy with the optimization and testing of ceph storage, and have looked at various materials, but it seems that there is not an article to explain the methodology, so I would like to summarize here, a lot of content is not my original, but to do a summary. If there are any problems, please spray me so that I can improve.

Optimization methodology

Do anything or to have a methodology, "to give people to fish than to teach people to fishing," the truth, the method through, all the problems have a way to solve. By summarizing the analysis of public data, the optimization of distributed storage system can not be separated from the following points:

1. Hardware level

Hardware planning
SSD Selection
BIOS setup

2. Software level

Linux OS
Ceph configurations
PG Number Adjustment
CRUSH Map
Other factors

Hardware optimization 1. Hardware planning

Processor

The CEPH-OSD process consumes CPU resources during the run, so it is common for each CEPH-OSD process to bind to a CPU core. Of course, if you use EC mode, you may need more CPU resources.

The Ceph-mon process does not consume CPU resources very much, so there is no need to reserve excessive CPU resources for the Ceph-mon process.

CEPH-MSD is also very CPU intensive, so it needs to provide more CPU resources.

Memory

Ceph-mon and CEPH-MDS require 2G of memory, each CEPH-OSD process requires 1G of memory, and of course 2G is better.

Network planning

Million trillion network is now basically running ceph essential, network planning, also try to consider separating cilent and cluster networks.

2. SSD Selection

Hardware selection also directly determines the performance of the Ceph cluster, from the cost considerations, generally choose SATA SSD as Journal,intel? The SSD DC S3500 series is basically the first choice in the scenario you see today. 400G Specifications 4K Random write can reach 11000 IOPS. If a PCIe SSD is recommended for a sufficient budget, the performance will be further improved, but the addition of Journal does not present an imagined performance boost as Journal writes data to the data disk block for subsequent requests. But it does make a big difference to the latency.

How to determine if your SSD is suitable as an SSD Journal, you can refer to Sébastien Han's ceph:how to Test if Your SSD is suitable as a Journal Device?, in which he also listed the common SSD Test results, the Intel S3500 performance is best in terms of the results from the SATA SSD.

3. BIOS settings

Hyper-threading (HT)

Basic cloud Platform, VT and HT open are necessary, Hyper-Threading Technology (HT) is the use of special hardware instructions, the two logical core simulation into two physical chips, so that a single processor can use thread-level parallel computing, and thus compatible with multi-threaded operating systems and software, reducing the CPU idle time, Improve the efficiency of the CPU operation.

Power off

After the power off, the performance is still improved, so firmly adjust to the performance type (performance). Of course, can also be adjusted at the operating system level, the detailed adjustment process please refer to the link, but do not know whether due to the BIOS has been adjusted for the sake of, so on CentOS 6.6 did not find the relevant settings.

forindo-f$CPUFREQcontinueecho$CPUFREQdone

Numa

To put it simply, the NUMA idea is to divide the memory and CPU into multiple areas, each called node, and then the node is connected at high speed. The CPU and memory access in node is faster than accessing other node's memory, and NUMA may affect ceph-osd in some cases. Solution, one is to turn off NUMA through the BIOS, and the other is to bind the CEPH-OSD process to a CPU core and memory under the same node through Cgroup. But the second looks more troublesome, so you can turn off NUMA at the system level when you deploy it generally. Under the CentOS system, add Numa=off to turn off NUMA by modifying the/etc/grub.conf file.

kernel /vmlinuz-2.6.32-504.12.2.el6.x86_64 ro root=UUID=870d47f8-0357-4a32-909f-74173a9f0633 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM   biosdevname=0 numa=off

Software optimization 1. Linux OS

Kernel PID Max

4194303/proc/sys/kernel/pid_max

Jumbo frames, the switch side needs to support this function, the System network card settings have effect

ifconfig eth0 mtu 9000

Permanent settings

"MTU=9000"|-a /etc/sysconfig/network-script/ifcfg-eth0/etc/init.d/networking restart

Read_ahead, read by data pre-read and recorded to random access memory method to improve disk reads, view default values

cat /sys/block/sda/queue/read_ahead_kb

According to some of Ceph's public shares, 8192 is the ideal value

echo"8192" > /sys/block/sda/queue/read_ahead_kb

Swappiness, the main control system for the use of swap, this parameter adjustment is first seen in the Unitedstack public documents, the reason to guess the adjustment is mainly the use of swap will affect the performance of the system.

echo"vm.swappiness = 0"-a /etc/sysctl.conf

I/O Scheduler, about I/O scheculder adjustment online already have a lot of information, here no longer repeat, simply say SSD to use Noop,sata/sas deadline.

echo"deadline" > /sys/block/sd[x]/queue/schedulerecho"noop" > /sys/block/sd[x]/queue/scheduler

Cgroup

This article seems to be less, yesterday in the process of communication with the Ceph community, Jan Schermer said he was prepared to contribute to some of the scripts in the production environment, but for the time being, he also cited some of the reasons for using Cgroup for isolation.

Not moving between process and thread on different cores (better cache utilization)

Reduce the impact of NUMA

Network and storage controller impact-Small

Restricting the Linux dispatch domain by restricting cpuset (not sure is important but best practice)

If HT is turned on, it may cause the OSD to be on the Thread1, KVM on Thread2, and the same core. The latency and performance of the core depends on what one other thread does.

This point concrete implementation to be supplemented!!!

2. Ceph Configurations[global]

Name of parameter	Description	Default Value	Recommended Values
Public network	Client Access Network		192.168.100.0/24
Cluster network	Cluster network		192.168.1.0/24
Max open files	If this option is set, Ceph sets the system's Max Open FDS	0	131072

To view the maximum number of open files in the system, use the command

    cat /proc/sys/fs/file-max

[OSD]-Filestore

Name of parameter	Description	Default Value	Recommended Values
Filestore xattr Use Omap	Used with the object Map,ext4 file system for Xattrs, XFS or btrfs can also be used	False	True
Filestore Max sync interval	Maximum sync interval from log to data disk (seconds)	5	15
Filestore min sync interval	Minimum sync interval from log to data disk (seconds)	0.1	10
Filestore Queue Max Ops	Maximum number of operands accepted by the data disk	500	25000
Filestore Queue Max bytes	Data disk operation maximum number of bytes (bytes)	<< 20	10485760
Filestore Queue committing Max Ops	The number of operands that the data disk can commit	500	5000
Filestore Queue committing Max bytes	The maximum number of bytes The data disk can commit (bytes)	<< 20	10485760000
Filestore OP Threads	Number of concurrent file system operations	2	32

The main reason for adjusting the omap is that the EXT4 file system is only 4 K by default
Filestore queue-related parameters have little impact on performance, and parameter tuning does not inherently improve performance optimizations

[OSD]-Journal

parameter name	description	default	recommended value
OSD Journal size	OSD log size (MB)	5120	20000
jour Nal Max Write bytes	Journal maximum number of bytes written at once (bytes)	<<	1073714824
Journal Max write entries	journal maximum number of records written at once	+	10000
Journal queue max Ops	Journal maximum number of operations in a queue	$	50000
journal queue max bytes	journal The maximum number of bytes in a queue (bytes)	<< 1	0485760000

Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD daemons to trim operation S from the journal and reuse the space.
The above paragraph means that the Ceph OSD process stops writing while it is brushing data to the data disk.

[OSD]-OSD Config tuning TD>OSD Client message size cap

Name of parameter	Description	Default Value	Recommended Values
OSD Max write size	OSD one writable maximum (MB)	all	up
clients allow maximum data in memory (bytes)	524288000	2147483648
OSD Deep Scrub Stride	number of bytes allowed in deep Scrub (bytes)	524288	131072
OSD op threads	The number of threads that the OSD process is operating	2	8
OSD Disk thre Ads	OSD Intensive operations such as restore and scrubbing threads	1	4
OSD Map Cache Size	preserve the cache of the OSD map (MB)	$	1024x768
OSD Map cache bl size	OSD process in-memory OSD map cache (MB)	50	$
OSD mount options XFS	Ceph OSD xfs mount options	rw,noatime,in Ode64	rw,noexec,nodev,noatime,nodiratime,nobarrier

Increasing the OSD OP threads and Disk threads brings additional CPU overhead

[OSD]-recovery tuning

Name of parameter	Description	Default Value	Recommended Values
OSD Recovery OP Priority	Restore operation priority, take value 1-63, higher value takes up more resources	10	4
OSD Recovery Max Active	Number of active recovery requests during the same time	15	10
OSD Max Backfills	Maximum number of backfills allowed on an OSD	10	4

[OSD]-Client tuning

Name of parameter	Description	Default Value	Recommended Values
RBD Cache	RBD Cache	True	True
RBD Cache Size	RBD cache Size (bytes)	33554432	268435456
RBD Cache Max Dirty	Maximum number of dirty bytes allowed when caching to Write-back (bytes), if 0, using Write-through	25165824	134217728
RBD Cache Max Dirty Age	Dirty data cache time before being flushed to the storage disk (seconds)	1	5

Close the Debug3. PG number

The number of PG and PGP must be adjusted according to the number of OSD, the formula is as follows, but the final calculated result must be close to or equal to a 2 index.

Total PGs = (Total_number_of_OSD * 100) / max_replication_count

For example, 15 OSD, the number of copies is 3, the result should be calculated according to the formula 500, the closest to 512, so the need to set the pool (volumes) Pg_num and Pgp_num are 512.

set512set512

4. CRUSH Map

CRUSH is a very flexible way, the adjustment of CRUSH map depends on the deployment of the specific environment, this may need to be analyzed according to the specific situation, here is not to repeat.

5. Impact of other factors

On Ceph Day this year (2015), Yun Jiexun shared a case that resulted in a degraded performance in the cluster due to the presence of a poorly performing disk in the clusters. The OSD Perf can provide the status of the disk latency, while in the operation of the process can also be used as an important indicator of monitoring, it is obvious in the following example, the OSD 8 disk delay is longer, so you need to consider the OSD out of the cluster:

ceph osd perf

osd fs_commit_latency(ms) fs_apply_latency(ms)  0                    14                   17  1                    14                   16  2                    10                   11  3                     4                    5  4                    13                   15  5                    17                   20  6                    15                   18  7                    14                   16  8                   299                  329

Ceph.conf

[Global]fsid =059f27e8-a23f-4587-9033-3e3679D03b31mon_host =10.10.. 102,10.10.. 101,10.10..AuthClusterRequired = Cephxauth Service required = Cephxauth client required = CEPHXOSD Pooldefault size=3OSD Pooldefault min size=1Public network =10.10.. 0/ -ClusterNetwork =10.10.. 0/ -MaxOpen files =131072[Mon]mon data =/var/lib/ceph/mon/ceph-$id[Osd]osd data =/var/lib/ceph/osd/ceph-$idOSD Journalsize=20000OSD MKFS type = xfsosd mkfs options xfs =-ffilestore xattr Use OMAP = TruefilestoreminSync interval =TenFilestoreMaxSync interval = theFilestore QueueMaxOPS =25000Filestore QueueMaxbytes =10485760Filestore Queue CommittingMaxOPS = theFilestore Queue CommittingMaxbytes =10485760000JournalMaxWrite bytes =1073714824JournalMaxWrite entries =10000Journal queueMaxOPS =50000Journal queueMaxbytes =10485760000OsdMaxWritesize= +OSD Client MessagesizeCap =2147483648OSD Deep Scrub Stride =131072OSD OP threads =8OSD Disk threads =4OSD Map Cachesize=1024x768OSD Map Cache blsize= -OSD Mount Options XFS ="Rw,noexec,nodev,noatime,nodiratime,nobarrier"OSD Recovery OP priority =4OSD RecoveryMaxActive =TenOsdMaxBackfills =4[CLIENT]RBD cache = TRUERBD Cachesize=268435456RBD CacheMaxDirty =134217728RBD CacheMaxDirty age =5

Summarize

Optimization is a long-term iterative process, all methods are others, only in the course of practice to find their own, this article is only a beginning, welcome you to actively add, together to complete a guided article.

Ceph Performance Optimization Summary (v0.94)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ceph Performance Optimization Summary (v0.94)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ceph Performance Optimization Summary (v0.94)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support