In planning the CEPH distributed storage cluster environment, the choice of hardware is very important, this is related to the performance of the entire Ceph cluster, the following comb to some of the hardware selection criteria, for reference:
1) CPU Selection
Ceph metadata server will dynamically redistribute the load, which is CPU sensitive, so metadata server should have better processor performance (such as a four-core CPU). Ceph Osds runs the Rados service, requires crush to calculate the location of the data, replicate the data, and maintains a copy of the cluster map, so the OSD also needs the appropriate processing performance Ceph Monitors Simple maintenance cluster The skeleton information of map so this is CPU insensitive.
2) RAM Selection
Metadata servers and monitors must be able to provide data quickly, so there must be sufficient memory (e.g., 1GB of RAM per daemon instance). OSDS does not require excessive memory when performing normal operations (e.g., 500MB of RAM per daemon instance), but when performing a recovery operation, a large amount of memory is required (e.g., ~1GB per 1TB of storage per Daem ON). Generally, and the more the better.
3) Data Storage selection
Consider the tradeoff between cost and performance when planning your data storage. Simultaneous OS operation, while multiple background programs read and write to a single drive, can significantly degrade performance. There are also file system limitations: Btrfs is not very stable for production environments, but has the ability to record journal and parallel write data, and XFS and EXT4 will be better.
Tip: Multiple OSD runs on a partition that does not recommend a single disk. Running an OSD and a monitor or metadata service on a partition that does not recommend a single disk.
The storage drives are subject to seek time, access time, read and write time, and total throughput limits. These physical limitations can affect the performance of the entire system, especially during recovery. We recommend using a dedicated drive for the operating system and software, and assigning one drive for each OSD daemon that you run on the host. Most "slow OSD" problems arise from running multiple OSDS and/or multiple logs on the same drive on one operating system.
Because a small portion of the cost of resolving performance issues may exceed the cost of additional disk drives, you can speed up your cluster design planning in order to avoid overloading the OSD storage drive.
Multiple Ceph OSD Daemons are running on each hard drive at the same time, but this can lead to resource contention and lower overall throughput. You may store logs and object data on the same drive, but this may increase the time spent in recording write operations and sending ACK to clients. Before Ceph can ack the write operation, Ceph must write the operation to the log.
Btrfs file System log data and object data can be written at the same time, while XFS and EXT4 cannot. Ceph's recommended practice is to separate the operating system, OSD data and OSD logs on separate drives.
Original address of this article: https://www.linuxprobe.com/ceph-linux.html
4) Solid-state Drive selection
One of the opportunities for performance improvement is to use solid state drives (SSDs) to reduce random access time, read wait times, and throughput acceleration. SSDs tend to be more than 10 times times the cost per gigabyte of hard drives, but SSDs tend to perform at least 100 times times faster than hard drives.
SSDs do not move mechanical parts, so they do not need to be limited by the same type of hard drive. Although the SSD has obvious limitations. It is important to consider the performance of its continuous reads and writes. When storing multiple logs for multiple OSDS, SSD performance with 400mb/s sequential write throughput is better and faster than the sequential write throughput of the mechanical disk 120mb/s.
SSD OSD Object storage is expensive, and a significant performance boost may be seen on OSDS by storing an OSD log on an individual hard drive SSD and OSD object data. The OSD log configuration defaults to/var/lib/ceph/osd/$cluster-$id/journal. You can mount this path to an SSD or SSD partition, storing log files and data files on separate disks.
5) Networks selection
Recommended a minimum of two gigabit network card per machine, now most ordinary hard drive swallowed amount can reach 100mb/s, Nic should be able to handle the total throughput of the OSD hard disk, so recommend a minimum of two gigabit network cards, respectively, for public network and Cluster_network. A clustered network (preferably not connected to the Internet) is used to handle the additional load generated by data replication and helps prevent denial of service * * *, which interferes with the data co-locating group so that it cannot return to the Active+clean state when the OSD data is copied. Consider deploying a gigabit network adapter. Replicating 1TB data over a 1Gbps network takes 3 hours, while 3TB (a typical drive configuration) requires 9 hours, in contrast, when using 10Gbps replication time can be reduced to 20 minutes and 1 hours, respectively.
In a petabyte cluster, the OSD disk fails to be the norm, not an exception; at a reasonable cost, the system administrator wants PG to recover from the degraded (degraded) state to the Active+clean state as soon as possible. Using a 10G NIC is worth considering. Each network's top-level rack router-to-core router communication should have faster throughput, for example, 40gbps~100gbps.
6) Other Precautions:
You can run multiple OSD processes on each host, but you should ensure that the total throughput of the OSD hard disk does not exceed the network bandwidth required by the client to read or write to the data. You should also consider the rate at which data is stored on each host. If the percentage on a particular host is large, it can cause problems: to prevent data loss, Ceph stops the operation.
When running multiple OSD processes on each host, it is also necessary to ensure that the kernel is up to date. When multiple OSD processes are running on each host (such as >20), many threads are generated, especially for recovery and relalancing operations. Many Linux kernel default thread limits are smaller than the maximum number (for example, 32k). If you are experiencing this problem, consider setting the Kernel.pid_max to a higher point. The theoretical maximum value is 4,194,303.
A simple introduction to CEPH distributed storage clusters