Block Storage System
Distributed storage has excellent performance, can carry a lot of failures, it is easy to scale, so we use ceph to build a high-performance, high-reliability block storage system, and use it to support the public cloud and managed Cloud cloud host, the Cloud Drive service.
Because of the use of Distributed block storage system, the process of copying images is avoided, so the creation time of cloud hosts can be reduced to less than 10 seconds, and the cloud hosts can be quickly and thermally migrated, which facilitates the maintenance of the hardware and software of the physical server.
The most intuitive user experience for block storage systems comes from the cloud Drive service, and now our EVS features are:
Each cloud drive supports up to 6000 iops and up to 95% MB/s throughput, with a delay of 4K random write operations of less than 2ms.
All data are three copies, strong consistency, durability of up to 10 9.
Creation, deletion, mount, and unload are all second-level operations.
Real-time snapshots.
Available in two types of cloud drives, performance and capacity.
hardware and software configuration
After many rounds of selection and testing, and after stepping through countless pits, we chose the right software and hardware.
Software
Hardware
From SATA disks to SSDs, in order to increase IOPS and reduce latency.
From consumer SSDs to enterprise-class SSDs, for increased reliability.
From RAID cards to HBA cards, in order to increase IOPS and reduce latency.
Minimal deployment Architecture
With the upgrading of hardware and software, the adjustment of requirements, our deployment architecture is constantly evolving, and strive to achieve the best balance in cost, performance and reliability.
There are 12 nodes in a minimum scale deployment, and 3 SSDs on each node. There are 2 Wan Shao and one gigabit port on the node, virtual machine network and storage network using Wan Shao, management network using Gigabit port. There are 3 ceph monitor nodes in each cluster.
easily expand
The benefits of cloud computing are extremely scalable, and as the underlying architecture of cloud computing, there is also a need for fast scale-out capabilities. In the deployment architecture of a block storage system, you can scale up to 12 nodes.
Transform OpenStack
Native OpenStack does not support unified storage, cloud host service Nova, Mirror service glance, Cloud Drive service cinder back end storage is different, resulting in serious internal friction. We unify the back end of these three services to efficiently manage the virtual machine creation time and the mirror storm, and allow the virtual machine to wander freely.
Native OpenStack
The transformed OpenStack
It takes 1-3 minutes to create a virtual machine using native OpenStack, and only takes less than 10 seconds to use the transformed OpenStack. This is because Nova-compute no longer needs to download the entire image over HTTP, and the virtual machine can be started by directly reading the mirrored data in Ceph.
We also added two features that OpenStack does not have: QoS and shared cloud drives. Another benefit of cloud computing is the isolation of tenant resources, so the necessary QoS. A shared cloud drive can be mounted to multiple cloud hosts for data processing scenarios.
We also use OpenStack's multi-backend feature to support multiple types of EVs, and now we have a performance-based, capacity-based drive type that can accommodate both database and large file applications.
High Performance
The main performance metrics for storage systems are IOPS and latency. Our optimization of IOPS has reached a hardware bottleneck, unless a faster SSD or flash memory card is replaced, or the entire architecture is changed. Our optimizations for latency are nearing completion and can reach the level of enterprise-class storage.
Complex I/O stacks
The entire block storage system has a long I/O stack, with each I/O request traversing many threads and queues.
Optimizing the operating system
Optimizing the parameters of the operating system can take full advantage of hardware performance.
Cpu
Memory
Turn off NUMA
Set Vm.swappiness=0
Block
FileSystem
Optimizing Qemu
QEMU is a direct consumer of block storage systems, and there are plenty of places to optimize for it.
Throttle: Smooth I/O QoS algorithm
RBD: Supports discard and flush
Burst: Support for burst requests
VIRT-SCSI: Multi-queue support
Optimizing Ceph
Our optimization of Ceph is a big play, and there are a lot of problems that are only exposed after a long time and scale.
High Reliability
Storage requires high reliability to ensure data is available and data is not lost. Because the UPS and NVRAM are not used in our architecture, the data written to the request is dropped to three hard drives before returning, thus maximizing the security of the user's data.
how to calculate persistence
Persistence is the probability of data loss and can be used to measure the reliability of a storage system, commonly known as "how many 9". The placement of data (dataplacement) determines the persistence of data, and Ceph's crush map determines the placement of the data, so the setting of crush map determines the persistence of the data. However, immediately we know that we need to modify the settings of the crush map, but how do we change the settings of the crush map, and how do we calculate the data persistence?
We need a calculation model and a calculation formula, we can construct a computational model and a calculation formula through the following information.
Reliability model
"Crush:controlled, scalable, decentralized Placement of replicated Data"
Copysets:reducing the Frequency of Data Loss in Cloud Storage
"Ceph's Crush Data distribution algorithm Introduction"
The final calculation formula is: P = func (N, R, S, AFR)
P: Probability of losing all replicas
N: The number of OSD in the entire Ceph pool
R: Number of replicas
S: The number of OSD in a bucket
AFR: Average annual failure rate of disks
How does the computational model get the formula? The following is a 4-step process.
The probability of a hard disk failure is calculated first.
Define which case the lost data cannot be recovered.
Calculates the probability that any r OSD fails.
Calculates the probability that Ceph loses pg.
The probability of a hard disk failure is in accordance with the Poisson distribution:
Each PG of Ceph has a copy of R, stored on the R-OSD, when the R-OSD containing this PG fails, the data is inaccessible, and when the R-OSD is damaged, the data is not recoverable.
The method for calculating the associated failure probability of any r OSD in a year is:
Calculates the probability of an OSD failure within a year.
The probability of the (R-1) OSD failure within the recovery time.
The above probability multiplication, is the one year any R OSD occurrence correlation fault probability, assumes is the Pr.
In N OSD, the combined number of any R OSD is C (r, N).
Because this arbitrary R-OSD does not necessarily have a copy of the same PG, this failure of any R-OSD does not cause data to be unrecoverable-that is, it does not necessarily result in data loss.
Assuming that each PG corresponds to a set of OSD (with R-OSD, called Copy Set), it is possible for multiple PG to correspond to the same set of OSD. If there is a different copy Set of M, M is a very important number.
We'll come back to the exact definition of copy set: Copy set has at least one copy of the PG, and when the copy set is broken, all copies of the PG are lost, and all the data on the PG is unrecoverable. So Ceph lost data event is Ceph lost PG, Ceph lost PG is a copy set corruption, the probability of a copy set loss is P = Pr * M/C (R, N).
The persistence formula is a quantitative tool that can indicate the direction of the effort. Let's try sledgehammer first. What is persistence by default?
Let's say we have 3 racks, 8 nodes on each rack, 3 hard drives on each point, and one OSD per hard drive, there are 72 OSD.
The default crush map settings are as follows
Through the persistence formula, we get the following data.
By default, the durability has 8 9, already higher than the average RAID5, RAID10, and RAID6, but not enough to meet the requirements of the public cloud, because the size of the public cloud, the mathematical expectations of failure events will be very large, which forces us to maximize the durability.
There are many ways to improve persistence, such as increasing the number of replicas, using erase code, and so on. However, these methods have drawbacks, increasing the number of copies will inevitably increase the cost; Using erase code causes latency to improve and is not suitable for block storage services. Under the constraints of cost and latency, what other ways can we improve the durability?
Before we get a quantitative formula P = Pr * M/C (R, N), we start with a quantitative formula to improve persistence (that is, to reduce P). To reduce p, you have to lower the PR, M, or raise C (R, N). Since C (R, N) has been determined, we can only lower the PR and M.
Reduce recovery time
From the definition of PR can know the PR and recovery time related, the shorter the recovery time, the lower the value of PR. So what does recovery time have to do with anything?
We need to add more OSD for data recovery in order to reduce recovery time. Currently, the host bucket does not add more OSD, because of the network bandwidth limit and the hard drive slot limit. The solution is to start with crush map, add a virtual bucket:osd-domain, and no longer use the host bucket.
By using the Osd-domain bucket, we have increased the durability by 10 times times and now have 9 9 of durability.
reduce the number of coepy set
How to reduce the number of copy set? Copy sets is related to the mapping of the PG, we start with the rules and conditions of the crush map, reducing the number of copy set. Solution increases the virtual bucket:replica-domain and no longer uses the rack bucket. Each PG must be on a replica-domain, and PG cannot span Replica-domain, which can significantly reduce the number of copy set.
By using Replica-domain, there are now 10 9 durability, which is 100 times times more persistent than the default crush map setting.
automated operation and maintenance
Ceph's operations are more difficult to do, and the entire cloud platform will be affected, so we think the goal of OPS is usability:
Reduce unnecessary data migrations, thereby reducing slow requests and guaranteeing SLAs.
Deployment
Our entire cloud platform was deployed using puppet, so we used puppet to deploy Ceph. The installation of General Ceph is phased:
Install the Ceph monitor cluster.
Format disk, use the file system's UUID to register the OSD, get the OSD ID.
According to the OSD ID to create the data directory, mount disk to the data directory.
Initializes the crush MAP.
Puppet only need to complete the first three steps, the fourth step is generally based on the specific circumstances of the script to execute. Because the OSD ID is obtained during execution, and Puppet is compiled after execution, this is a sad story, so the Puppet-ceph module must be designed to be retry.
Compared to the Puppet-ceph modules of Enovance and Stackforge, the advantages of our Puppet-ceph modules are:
Shorter deployment times
Supports Ceph-all parameters
Supports multiple drive types
Use Wwn-id to replace the drive letter.
Maintenance
The process of upgrading ceph is simple and three commands can be done:
Ceph OSD Set Noout #避免在异常情况下不可控
Ceph OSD down x #提前mark down to reduce slow request
Service Ceph Restart Osd.x
You need to restart the machine when you replace the hardware or upgrade the kernel, and the steps are simple:
Migrating virtual machines on this machine to other machines
Ceph OSD Set Noout
Ceph OSD down x #把这个机器上的OSD都设置为down状态
Service Ceph Stop Osd.x
Restarting the machine
You need to be very careful when you extend a cluster because it triggers data migration:
Set Crush Map
Set recovery options
Trigger data migration at 12 o'clock in the morning
Observe the speed of data migration, observe the bandwidth of each machine's Internet port, and avoid running full
Observe the number of slow requests
You always encounter hard disk damage, you need to be very careful when replacing the hard disk, you have to be careful to set up crush map, you have to make sure that the value of the replica-domain weight in the process of replacing the hard drive is constant, so that data migration is not necessary.
Monitoring
Ceph's own calamari is good, but not practical enough, and its deployment, packaging is not perfect, there are some bugs on CentOS, we can only continue to use the original tools.
Collect: Use diamond to add new Colloctor to collect more detailed data.
Save: Use graphite, set the acquisition accuracy and save precision.
Show: Use Grafana, pick out more than 10 tools, found or Grafana good-looking and useful.
Alarms: Zabbix Agent && Ceph Health
We divide each OSD into a number of throttle layers according to the Ceph software architecture, the following is the throttle model:
With the throttle model, we can monitor each throttle, and we add new collector to the diamond to monitor these throttle and redefine the metric name.
Finally, we can get the monitoring data for each OSD layer throttle. However, the usual focus is on IOPS, throughput, OSD journal latency, read request latency, capacity usage, and more.
Accidents
On the cloud platform has been on the line for almost a year, we have encountered the size of the accident:
SSD GC issues can cause read and write requests to latency very large and soar to hundreds of milliseconds.
A network failure will cause monitor to set the OSD to the down state.
Ceph Bug will cause the OSD process to collapse directly.
XFS bugs will cause all OSD processes in the cluster to collapse directly.
SSD is damaged.
Ceph PG inconsistent.
The network bandwidth is full when ceph data is restored.
In general, Ceph is very stable and reliable.
Build high-performance, highly reliable block storage systems