Several new methods and ideas for Ceph performance optimization (SH Ceph day parametric sense)

Source: Internet
Author: User

A week ago, Shanghai CephDay was co-hostedby Intel and Redhat on the lunar calendar. At this meeting, a number of experts made more than 10 very wonderful speeches. In this paper, we try to summarize the knowledge and methods of Ceph performance optimization mentioned in these speeches.

0. General Ceph Performance Optimization method

(1). Hardware level

    • Hardware planning: CPU, Memory, network
    • SSD selection: Use SSD as log storage
    • BIOS settings: Turn on Hyper-Threading (HT), turn off power saving, turn off NUMA, etc.

(2). Software level

    • Linux OS:MTU, Read_ahead, etc.
    • Ceph configurations and PG number adjustments: calculated using the PG calculation formula (Total PGs = (TOTAL_NUMBER_OF_OSD *)/Max_replication_count).
    • CRUSH Map

For more information, refer to the following article:

    • Ceph Performance Optimization Summary (v0.94)
    • Measure Ceph RBD Performance in a quantitative
    • Ceph Performance Tuning--journal and Tcmalloc
    • Ceph Benchmarks
    • The official CEPH cuttlefish VS bobtail part 1:introduction and RADOS BENCH
1. Using a layered cache layer-tiered cache

Obviously this is not a new feature of Ceph, and at the meeting there are experts in this area that describe in detail the principle and usage of this feature, and the details of how it is combined with error correcting codes.

Brief summary:

    • Each cache level (tiered cache) uses a RADOS pool, where the cache pool must be a copy (replicated) type, and backing pool can be a copy type or an error correcting code type.
    • At different cache levels, using different hardware media, the cache pool must use a medium that is faster than the media used by the backing pool: for example, use a typical storage medium in backing pool, such as a conventional HDD or SATA SDD; in the cache pool Use fast media, such as PCIe SDD.
    • Each tiered cache uses its own CRUSH rules, allowing the data to be written to the specified different storage media.
    • Librados supports tiered cache internally, and in most cases it knows which tier the client data needs to be placed on, so there is no need to make changes on the RDB,CEPHFS,RGW client.
    • The OSD independently handles the flow of data between two levels: promotion (HDD->SDD) and eviction (SDD-like HDD), but this data flow is costly (expensive) and time-consuming (take long "Warm Up").
2. Using a better Ssd-intel NVM Express (NVMe) SSD

In Ceph clusters, SSDs are often used as Journal (log) and Caching (cache) media to improve cluster performance. , a cluster using SSD as a Journal is 1.5 times times faster than a full HDD cluster with 64K Sequential writes, while 4K random write speeds are 32 times times higher.

While the SSD used by the journal and OSD is separate from the same SSD used for both, it can also improve performance. , both are placed on the same SATA SSD, with performance ratios of two SSDs (Journal using the PCIe ssd,osd SATA SSD), 64K sequential write speed down by 40%, and 4K random write speed down 13%.

As a result, more advanced SSDs will naturally improve the performance of the Ceph cluster. SSD developed to now, its media (particles) basically through the three generations, the natural generation is more advanced than the generation, specifically in the density of higher (larger capacity) and read and write data faster. At present, the most advanced is the Intel NVMe SSD, which features the following:

    • Standardized software interface customized for PCI-E drives
    • Customized for SSD (other for PCIe)
    • SSD journal:hdd OSD ratio can be increased from normal 1:5 to 1:20
    • For all-SSD clusters, the full-NVMe SSD disk Ceph cluster has the best natural performance, but it costs too much and performance is often limited to network card/network bandwidth, so in a full SSD environment, the recommended configuration is to use the NVMe SSD to do Journal and use the regular SSD to do the OSD disk 。

At the same time, Intel SSD can be used in conjunction with Intel Cache Acceleration software software, it can intelligently according to the characteristics of the data to put data on SSD or HDD:

Test:

    • Test configuration: Using Intel NVMe SSD as Cache, using Intel CAS Linux 3.0 with hinting feature (will be released by the end of this year)
    • Test Result: 5% cache, which makes throughput (Throughoutput) commit one-fold, latency (Latency) reduced by half
3. Use better network devices-3.1 higher-bandwidth, lower-latency network card devices, such as Mellanox NICs and switches

Mellanox is an Israeli-based company with around 1900 employees worldwide, focusing on high-end networking equipment and 2014 revenue for ¥463.6m. (Today, just on the water wood BBS to see the company in China's branch treatment is also very good). Its main ideas and products:

    • Ceph's scale out feature requires higher network throughput and lower latency for Replicaiton, sharing, and metadata (files)
    • At present, ten GBE (Gigabit Ethernet network) can not meet the requirements of high-performance Ceph cluster (basically 20 SSDs above the cluster will not be satisfied), has begun to fully enter the era of the four Gbe. At present, the 25GbE cost-effective relatively high.
    • Most network equipment companies use high-pass chips, while Mellanox uses self-developed chips, and its latency (latency) is the lowest in the industry (220NS)
    • The Ceph high-speed cluster requires two networks: Public network for client access, Cluster Network for heartbeat, replication, recovery, and re-balancing.
    • Today, the Ceph cluster is widely used in SSDs, and fast storage devices require fast network equipment

Actual test:

(1) test environment: Cluster network using 40GbE switch, public network distribution using the ten GbE and 40GbE devices for comparison

(2) test results: The results show that the throughput of a cluster using a 40GbE device is 2.5 times times the size of the ten GbE cluster, with an increase of 15% IOPS.

At present, some companies have used the company's network equipment to produce all-SSD Ceph server, for example, SanDisk Company's Infiniflash used the company's 40GbE network card, 2 Dell R720 server as the OSD node, the TB SSD, its total Throughput reaches 71.6 GB/s, as well as Fujitsu and Monash University.

3.2 RDMA Technology

Traditionally, access to hard disk storage takes dozens of milliseconds, while network and protocol stacks require hundreds of subtlety. This period often uses 1GB/S's network bandwidth to access local storage using the SCSI protocol and to access the Remote Storage using ISCSI. With SSDs, access to local storage has dropped dramatically to hundreds of microseconds, so if the network and protocol stacks do not improve, they will become a performance bottleneck. This means that the network needs better bandwidth, such as 40gb/s and even 100gb/s, while still using ISCSI to access the remote Storage, but TCP is not enough, when RDMA technology comes into being. The full name of RDMA is Remote Direct Memory Access, which is designed to solve the latency of server-side data processing in network transmissions. It is through the network to the data directly into the storage area of the computer, the data from a system to quickly move to the remote system memory, without any impact on the operating system, so that there is no need to use the number of computer processing functions. It eliminates external memory replication and text exchange operations, thus freeing up bus space and CPU Cycles are used to improve application system performance. Common practice requires the system to parse and mark incoming information before storing it in the correct area.

In this technology, Mellanox is the industry leader. It provides high bandwidth, low CPU usage, and low latency through the implementation of Bypass Kenerl and Protocol offload. Currently, the company has implemented Xiomessager in Ceph, allowing ceph messages to go RDMA without TCP and thereby improve cluster performance, which is available in the Ceph Hammer version.

For more information, refer to:

Http://www.mellanox.com/related-docs/solutions/ppt_ceph_mellanox_ceph_day.pdf

http://ir.mellanox.com/releasedetail.cfm?ReleaseID=919461

What is RDMA?

Mellanox benchmarks Ceph on 100Gb Ethernet

RDMA Baidu Encyclopedia

4. Using better software-Intel SPDK related Technologies 4.1 mid-tier Cache Scheme

This scheme adds a cache layer between the client application and the Ceph cluster, which improves the client's access performance. Features of this layer:

    • Provide ISCSI/NVMF/NFS and other protocol support to Ceph client;
    • Improve reliability with two or more nodes;
    • Added cache to increase access speed
    • Use write log to guarantee data consistency across multiple nodes
    • Connecting back-end ceph clusters using RBD

4.2 Using Intel DPDK and UNS technology

Intel uses this technology to improve Ceph's ISCSI access performance in user space by implementing full DPDK NIC and driver, TCP/IP protocol stack (UNS), iSCSI Target, and NVMe drivers. Benefits:

    • Compared to Linux*-io Target (LIO), its CPU overhead is only 1/7.
    • NVMe driver for user space consumes 90% less CPU than kernel space Vnme drives

The scheme is a major feature of the use of user-configured network card, in order to avoid and the kernel network card collision, in the actual configuration, through the Sriov technology, the physical network card virtual network card, in the allocation to applications such as OSD. The reliance on the kernel version is avoided by the full use of user-state technology.

Intel now offers Intel DPDK, UNS, optimized Storage stacks as a reference solution that requires a use agreement with Intel. User-State NVMe drivers are open source.

4.3 CPU Data storage acceleration-ISA-L technology

The code Libaray uses the new instruction set of the Intel e5-2600/2400 and Atom C2000 product family CPUs to implement the algorithm, maximizes CPU utilization, and greatly increases data access speed, but currently only supports single-core X64 Log strong and Atom CPU. In the following example, the EC speed is increased by dozens of times times, and the overall cost is reduced by 百分之25到30.

5. Tools and methods for using the system-Ceph performance testing and tuning tools Rollup

Several Ceph performance testing and tuning tools were also released at this meeting.

5.1 Intel Cetune

Intel's tool can be used to deploy, test, analyze and tune (deploy, benchmark, analyze and tuning) Ceph clusters, which are now open source, code here. Key features include:

    • The user can configure the Cetune to use its WebUI
    • Deployment module: Deploying Ceph using the Cetune Cli or GUI
    • Performance Test module: Support QEMURBD, FIORBD, Cosbench and other performance testing
    • Analysis module: Iostat, SAR, interrupt, performance counter and other analytical tools
    • Report view: Support configuration download, Icon view
5.2 Common performance testing and tuning tools

Ceph software stack (possible performance points of failure and tuning benefits):

Summary of visibility performance-related tools:

Benchmarking Tools Summary:

Summary of tuning Tools:

6. Comprehensive evaluation

Some of the above methods, compared with traditional performance optimization methods, have some of their innovations,

    • better hardware, including SSDs and network equipment, can naturally bring better performance, but the cost increases correspondingly, and the performance optimization amplitude is inconsistent, so it is necessary to make a comprehensive tradeoff between the application scenario, cost and optimization effect.
    • Better software, mostly not open source, and mostly in the test state, from the production environment to use the distance, but also with Intel's hardware tightly bound;
    • A more comprehensive approach, is the vast number of Ceph professionals need to study, use, in peacetime use can more efficient positioning performance problems and find solutions;
    • Intel has a very large investment on ceph, and customers who have ceph cluster performance issues can send the relevant data to them, and they will provide recommendations accordingly.

Note: All of the above is based on the information presented at this meeting and the information sent after the event. If the content is not suitable for posting in this article, please contact with me. Thanks again to Intel and RedHat for hosting this meeting.

Several new methods and ideas for Ceph performance optimization (SH Ceph day parametric sense)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.