February 20 Report: As the core of stand-alone operating system, in Aliyun OS, flying large-scale distributed computing platform has played a key role in the connecting link. Flying in a common server cluster across the Internet, the complexity and unreliability of massive hardware is hidden, and other components of the cloud OS provide trusted computing and storage capabilities.
In particular, flying itself is a complex distributed system composed of several components, the core components of which are the following two subsystems.
Computing resource scheduling System (also called Fuxi): Managing and dispatching cluster computing resources, dynamically allocating computing resources among multiple cloud services to meet user's computing needs; automatically detects server failures and migrates services on the failed server.
Distributed File System (also known as Pangu): Manage all the hard drives of the cluster, reasonably arrange the data storage location to balance the performance and data security, detect the disk failure and replicate the data to ensure security.
In the process of implementing the cloud computing platform, engineers are faced with many technical challenges, including:
Provide highly reliable computing and storage capabilities on the basis of unreliable hardware;
provide high availability services;
Low cost operation dimension mass hardware;
Both online and offline applications coexist;
Overcoming the limitation of bandwidth between nodes;
Maximize the use of computing resources, and so on.
Among them, unreliable hardware is the most basic challenge. After the cluster scale reaches thousands of units, the small probability events on the single machine become inevitable and frequent events. Downtime caused by failures such as hard drives, hard disk controllers, CPUs, memory, motherboards, power supplies, and so on, occurs every day. This kind of hardware failure, we call it a "hard" fault (Fail-stop fault). In addition, there is a class of fault phenomenon is not so obvious, called "soft" fault, for example, the disk is accessible but only 1/10 of the normal speed, the server is not down, but the program runs slowly, good time and bad network. This kind of "soft" failure also affects the quality of service, because if slow execution of an online service can cause the client to time out, even if only 1% of the data Processing task is slow for offline operations, it delays the completion of the entire profiling job.
Both hard and soft failures will adversely affect the reliability and even usability of the system, so how to detect and recover in time and effectively is more critical. For hard fault detection industry has a mature program, the first part of this article only focused on the detection of soft faults; the second part of this article will focus on the issues related to the recovery strategy; Finally, we will describe how to meet the low latency requirements for online applications while ensuring data reliability.
Soft fault detection in cloud environment
There are two ways to detect "soft" faults.
One idea is to design the detection method for each specific fault. But "soft" failure can be caused by a lot of reasons, such as slow execution may be a server hardware failure, network failure, disk failure, operating system software failure, and so on, one detection will make the system too complex.
Another idea is to detect from macroscopic phenomena, and look at two examples below.
Example one: The detection job performs exceptionally slowly on a server.
We count the execution times of each job on each server. Because the input data is sliced evenly, the execution time on each server should be roughly the same. If the execution time on a server exceeds three times times the average time, it is marked as "slow." If the various jobs are "slow" on a particular server, there are good reasons to suspect that the server is having problems (but not knowing why). The scheduling system will automatically blacklist this server and no longer use it to perform jobs. Then automatically or manually check the reason for the specific failure of these suspicious servers.
Example two: detecting slow disk reads and writes.
We also count the time of each disk visit in the Distributed File system. If a disk has a large ratio of access time far beyond the system average, it is likely that the disk is about to fail. The file system does three things at this point:
Stop writing new data to this disk to prevent more data from being at risk;
Start adding more copies to the data on this disk;
When all the data on this disk has an extra copy, it can be sent offline for processing.
Strategy for automatic failback
After the failure is detected, there is a need for an automated and timely recovery mechanism. However, failure automatic recovery mechanism will become a double-edged sword once it is not considered. Let's start with the terrible accident at Amazon Cloud services.
Amazon EC2 Mass Shutdown Event
April 21, 2011, Amazon's virtual host services EC2 large-scale downtime, more than two days, affecting Reddit, Foursquare, Quora and many other sites. Amazon later analyzed the accident in detail. The cause of the accident is Amazon's routine maintenance upgrades to the cluster network, and the network traffic is all switched to the standby network, resulting in an overload of the standby network. The automated recovery mechanism detects a network failure and considers the server to be heavily down, starting data replication to replace the data copy on the "downtime" server, triggering a "mirrored storm" (a large number of servers trying to create a data mirror at the same time). The resulting increase in data traffic has exacerbated the network overload, so that the fault in the cluster spread, into a vicious circle. A variety of measures, including a temporary shutdown of the automated recovery system and additional hardware, were finally taken to serve the recovery after two days of failure.
In this case, the strategy for fault detection and recovery is to "replicate data when the data copy resides on the server." This strategy is effective for a small range of common problems such as "one server failure", but can be counterproductive in the context of widespread failures such as "network Overload". In this case, if there is no fail-safe mechanism at all, the fault impact range will not be that large.
In fact, this pattern has repeatedly occurred in the past in large-scale distributed system failures: unexpected, small-and medium-range failures
→ Automatic fault recovery mechanism took the wrong approach
→ Deterioration of the fault, into the vicious circle
Amazon S3 Storage Service 2008 years of failure is only due to the fault detection mechanism in its own state of a bit error, but the fault also quickly spread to the entire system, causing the service in the absence of hardware failure is not available.
In this context, our strategy is to limit the scope of the automatic recovery mechanism:
Under normal circumstances, there is no time in the cluster and only a small proportion of the server failure, at this time the automatic recovery is effective, even if it does not cause disaster;
If a (rare) large-scale failure occurs, the sensible strategy is to reduce the system load as much as possible, since it is virtually impossible to maintain quality of service by automated failback at this time. If at this point the fail-safe mechanism attempts to do a lot of work and exceeds the preset limit, that part of the logic is temporarily banned.
Take the aforementioned hard drive access slowdown as an example: considering the average daily failure rate of the hard drive is less than 1 per thousand, we set the upper limit of the hard disk automatic logoff mechanism mentioned above, for example, only through this mechanism, the total number of 1% of the hard drive can be offline. This restriction prevents disaster recovery mechanisms from occurring in extreme cases, such as problems with a large number of hard disks, or failure of the automatic logoff mechanism itself.
Data reliability and real-time performance optimization
In cloud environment, because the distributed system has the characteristic of hardware fault, it is a challenge to guarantee the data reliability.
The most frequently failed hardware in the actual operation of the cloud computing platform is the hard drive. Hard disk failures accounted for 80% of the total number of Aliyun data center failures. One reason is that the hard disk is the largest number of components, such as a 3000-node cluster has more than 30,000 hard drives, even if the hard disk itself has an average of no fault working time (MTBF) to 1,000,000 hours, A 30000-piece hard drive also means that a hard drive failure occurs every 33 hours. Actual operating data shows that the hard disk manufacturers nominal MTBF value is not reliable, the production environment of hard disk failure rate can be several times to dozens of times times nominal value.
The most direct impact of hard disk failure is Pangu Distributed File system. In order to guarantee the data security, Pangu file system has used many copies for all data. When you create a file, the user can specify the number of copies of the file data, and the file system guarantees that the data is distributed across different nodes and racks so that a single hardware failure does not cause the data to be inaccessible.
Multi-replica technology is widely recognized in the industry to effectively prevent data loss of the technology, usually pipelining to pass write requirements to reduce the load of a single node. However, this can cause a delay in data writes to be increased because a write cannot be completed until all replicas have been written successfully.
Because of the disk read-write feature, the latency of the above multiple copy writes to the disk is typically dozens of milliseconds, sometimes up to 100 milliseconds. Online applications in cloud environments sometimes have higher real-time requirements. Pangu solves this problem by using the memory log file (as Redo log).
The basic idea of a memory log file is based on the fact that the server is more likely to lose memory data because of power loss or downtime (so we write the log file to disk in a stand-alone system to avoid loss of memory data), However, the probability of simultaneous failure of multiple servers can be low enough to meet the requirements of data reliability. For real-time requirements of high applications, Pangu provides the interface, so that data files into the specified number of server memory can be considered to be successful writing; Pangu's background thread then writes the data in memory to the disk in bulk.
Pangu to ensure the reliability of the memory log and low latency on the following considerations.
Ensure that redo log is a copy of multiple copies, to avoid single machine failure caused data damage or loss.
To reduce write latency, ensure that the redo log is successfully written to multiple data server memory buffer, and that the background worker ensures that the memory data is persisted to disk in a very short period of time.
Strictly detect the health status of redo log data, and take remedial measures to ensure the reliability of data in time.
One advantage of distributed systems is the masking of single points of failure: the reliability of data is greatly enhanced by replicating backups between multiple servers. For a single machine, memory data is easy to be lost, but in a multiple-machine environment, if the server is not the same time downtime, supplemented by strict policy assurance, memory data without reducing the reliability of the situation, can greatly improve performance. Aliyun's data center ensures good hardware isolation and redundancy, as well as emergency measures such as UPS, providing a good hardware environment for using memory buffers.
Here are some considerations for the reliability of memory file data.
Write Memory phase
Ensure that multiple data servers receive data successfully and put into memory buffer (this is the design basis of redo log).
Select the data server to fully consider the isolation of the hardware, to avoid the failure of the association.
The data server determines its own health status when data is accepted:
The written disk state is normal and the remaining space is sufficient;
The current workload is in good condition, such as memory and I/O queues are not overloaded.
Memory to disk persistence phase
Limit the maximum time from memory buffer to disk I/O (within 30 seconds).
Once a write timeout is found (such as a slow disk or an overload of I/O request), notify the master server to make a replication backup immediately.
When a write exception (disk is bad or full) is found, the alert is immediately notified to master replication.
Detection and Replication phase
Monitor disk anomalies and background check data integrity, and immediately notify master replication when an exception is found.
As can be seen, the write memory phase of the strategy is a preventive measure; the memory to disk persistence phase is most dangerous, and we ensure that this phase is as short as possible (given the maximum write time in the case of expected performance), and take action in a timely manner after an error has been identified; the detection and replication phase is a typical disk failure but ensures that data is not lost.
Summary
In the design and implementation of the cloud computing platform, engineers have spent a great deal of effort to deal with the reliability challenges posed by massive hardware. This paper describes some design ideas but far from all. Forging a robust large-scale distributed system must require good design, exquisite implementation and rigorous testing. With flying this stable and reliable cloud OS kernel, a variety of rich cloud computing services and applications have a survival, grow up fertile soil. We will then introduce a variety of cloud services, it is the operation built on Aliyun's own research and development of the flying cloud computing platform.
(Responsible editor: Lu Guang)