Briefly describe cloud distributed system

Source: Internet
Author: User
Keywords Cloud computing flying cloud computing

Like a single operating system within the nuclear as Ali OS, a large-scale distributed computing platform, played a key role to undertake. Flying through the Internet through the networked common server cluster, hiding the complexity of a large number of hardware and unreliability, to other components of the cloud OS to provide reliable storage capacity and computing power.

In other words, the fly itself is a complex distributed system composed of multiple components, the core of which is mainly the following two subsystems.

· Computational Resource Scheduling System (also known as Fu Xi): manages and schedules cluster computing resources; dynamically allocates computing resources among multiple cloud services to meet the computing needs of users; automatically detects server failures and migrates services on the failed server.

Distributed file system (also known as Pangu): Management of all the hard disk cluster; Reasonable arrangements for data storage location to take into account performance and data security; Automatic detection of disk failure and copy data for security.

Engineers face many technical challenges in implementing the flying cloud computing platform, including:

Provide unreliable hardware based on high reliability computing and storage capacity;

Provide high availability services

· Low-cost operation and maintenance of massive hardware;

Online applications and offline applications co-exist;

Overcome bandwidth constraints between nodes

Maximize the use of computing resources, and more.

Among them, unreliable hardware is the most basic challenge. After the cluster size reaches thousands of units, the small probability event on a single plane becomes an inevitable and frequent event. Downtime from hard drives, hard disk controllers, CPUs, memory, motherboards, power supplies, and more can happen every day. This kind of hardware failure, we call it "hard" fault (fail-stop fault). In addition, there is another type of failure is not so obvious, called "soft" failure, for example, the disk can only be accessed but the normal speed of 1/10, the server is not down but the program is running slow, good times and bad network. Such "soft" failures also affect quality of service because slow execution of online services can cause the client to time out, while for off-line jobs, only 1% of the data processing tasks are slow, which can delay the completion of the entire data analysis job time.

Hard and soft failures all have an adverse effect on the reliability and even availability of the system. Therefore, how to detect and recover faults timely and effectively becomes more and more important. For the hard fault detection industry has a mature program, the first part of this paper focuses only on the detection of soft faults; the second part of this article will focus on fault recovery strategy related issues; Finally, we will introduce how to ensure data reliability Meet the low latency requirements of online applications.

Soft fault detection in cloud environment

There are two ways to detect "soft" faults.

One idea is to design test methods for each specific problem. However, the cause of the "soft" fault may be many. For example, slow execution may be caused by server hardware failure, network failure, disk failure, operating system software failure, etc. One by one, the system may be overcomplicated.

Another idea is to detect from the macro phenomenon, the following two examples.

Example 1: Detecting Jobs Executed especially slow on a server.

We count the execution time of each job on each server. Because the input data is sliced ​​evenly, the execution time on each server should be about the same. A server is marked as "slow" if it executes more than three times the average time. If different jobs are "slow" on one server, then we have every reason to suspect there is a problem with this server (but we do not know the reason). Dispatching system will automatically blacklist this server, no longer use it to perform the job. Then automatically or manually check the suspicious server for the specific cause of the malfunction.

Example 2: Detection of slow disk read and write situations.

We also in the distributed file system statistics for each disk access time. If a disk has a large ratio of access times far above the system average, then most likely the disk is about to fail. File system will do three things at this time:

Stop writing new data to this disk to prevent more data from being at risk

Begin to add more copies of the data on this disk

When all the data on this disk has an extra copy, it can be offline, pending operation and maintenance.

Fault recovery strategy

After the fault is detected, an automatic and timely fault recovery mechanism is required. However, fail-auto-recovery mechanisms can become a double-edged sword, if not thought out. Let's start with the serious incident at Amazon Cloud Services.

Amazon EC2 Mass Shutdown

On April 21, 2011, Amazon's Web hosting service, EC2, was down for more than two days, affecting websites such as Reddit, Foursquare, Quora and many more. Amazon later made a detailed analysis of the accident. The cause of the accident was caused by an operational error when Amazon upgraded and routinely upgraded the cluster network. All network traffic was switched to the standby network, causing the backup network to overload. The automatic failover mechanism detects a network failure, considers the server to be down, starts data replication right away, and replaces the data copy on the "downtime" server, triggering a "mirror storm" (a large number of servers trying to create data mirroring as well). The increased data traffic exacerbated the network overload, so that the failure of the spread in the cluster into a vicious circle. The eventual adoption of various measures, including the temporary shutdown of an automated failover system and the addition of hardware, served to recover after two and a half days after the failure occurred.

In this case, the strategy for automatic fault detection and recovery is "Replicate data when the server where the data copy is lost". This strategy works well for a small, common problem like "one server failure," but can be counterproductive in the context of a wide range of problems such as "network overload." In this case, if there is no automatic fault recovery mechanism, the impact of the failure but not so big.

In fact, this pattern has been repeated in the past with large-scale distributed system failures: unforeseen, small-to-medium-scale failures

→ Automatic fault recovery mechanism has taken the wrong approach

→ The deterioration of the problem, into a vicious circle

The failure of Amazon S3 Storage Services in 2008 was due only to a single bit error in the fail-safe mechanism's own state, but the same rapid spread to the entire system caused the service to become unavailable without hardware failure.

In this regard, our strategy is to limit the scope of the automatic fault recovery mechanism:

Under normal circumstances, at any time in the cluster and only a small percentage of the server fails, automatic recovery is valid, even if invalid will not cause a disaster;

• In the unlikely event of a (rare) large-scale failure, it is wise to minimize system load because it is virtually impossible to maintain the quality of service by automatically recovering from the failure. In this case, the fault auto-recovery mechanism attempts to perform a large number of operations and exceeds the preset limit, that is, this part of logic is forbidden temporarily.

Take the hard disk access mentioned earlier for example: Taking into account the average daily failure rate of less than one-thousandth of the hard disk, we give the suspect problem of the hard disk automatically set the upper limit of the mechanism, for example, at any time can only be offline through this mechanism A total of 1% of the hard drive. This limit prevents extreme cases such as a large number of hard disk problems, or failure of the automatic offline mechanism itself, the failure recovery mechanism itself will not cause a disaster.

Data reliability and real-time performance optimization

In a cloud environment, due to the fact that distributed systems have the characteristics of multiple hardware failures, it is a challenge to ensure data reliability as a file system.

The most troublesome hardware in actual operation of the flying cloud computing platform is the hard disk. Hard disk failure accounted for 80% of the total number of Aliyun data center failures. One of the reasons is that the hard drive is the most numerous component. For example, a cluster of 3000 nodes has more than 30,000 hard disks. Even if the hard disk itself reaches a mean time between failures (MTBF) of 1,000,000 hours, 30,000 hard disks mean an average of 33 hours There is a hard disk failure occurred. The actual operational data show that the hard disk manufacturers nominal MTBF value is not reliable, the production environment of the hard disk failure rate can be several times to several times the nominal value.

The most direct impact of hard disk failure is Pangu distributed file system. In order to ensure data security, Pangu file system uses multiple copies of all data. When creating a file, the user can specify the number of copies of the file data. The file system ensures that the data is distributed among different nodes and different racks so that a single hardware failure does not cause the data to be inaccessible.

Multi-copy technology is widely accepted in the industry to effectively prevent the loss of data technology, usually using pipelined transfer of write requirements to reduce the load on a single node. However, this can lead to increased data write latency, because a write operation can not be completed until all copies have been successfully written.

Due to the disk read and write characteristics, the above-mentioned multiple copies are usually written on disk in the order of tens of milliseconds, sometimes up to more than 100 milliseconds. Online applications in the cloud environment sometimes have higher real-time requirements. Pangu through the memory log file (in-memory redo log) to solve this problem.

The basic idea behind memory log files is based on the fact that although the probability of a server losing memory data due to a power loss or downtime is higher than the probability of a hard disk corruption (so on stand-alone systems we write log files to disk to avoid memory data loss) However, the probability of multiple servers failing at the same time can be as low as possible to meet data reliability requirements. For applications requiring high real-time performance, Pangu provides an interface that allows data files to be written into the memory of a specified number of servers as written successfully; Pangu's background thread then writes the in-memory data to disk in batches.

Pangu guaranteed memory log in the reliability and low latency on the following considerations.

· Redo log is to ensure that multiple copies, to avoid single failure caused data corruption or loss.

· To reduce write latency, make sure the redo log is written back to multiple data server memory buffers, and the background worker thread ensures that the memory data persists to disk for a short period of time.

· Strictly test the health status of redo log data and promptly take remedial strategies to ensure the reliability of data.

One advantage of distributed systems is the masking of single points of failure: the reliability of the data is greatly enhanced by the duplication of backups across multiple servers. For stand-alone, the memory data is easy to lose; but in a multi-machine environment, if you can ensure that the server is not down at the same time, supplemented by strict policy assurance, memory data without any decrease in reliability can be greatly improved performance. Aliyun's data center ensures good hardware isolation and redundancy, as well as emergency measures such as UPS, providing us with a good hardware environment that uses memory buffering.

Here are some of our considerations in memory data reliability.

Write to memory stage

· Ensure that multiple data servers successfully receive data and put it in the memory buffer (this is the basis of the redo log design).

Select the data server to fully consider the hardware isolation, to avoid the fault association.

Data server to determine their own health status when receiving data:

The state of the disk written is normal and the remaining space is sufficient;

The current workload is good, such as memory and I / O queues are not overloaded.

Memory to disk persistence stage

Limits the maximum amount of time it takes to buffer from memory to disk I / O (within 30 seconds).

· After discovering a write timeout (such as an unusually slow disk or an I / O request overload), immediately notify the master server for a copy backup.

· When a write error (bad or full disk) is found, an alarm is immediately called and the master is informed.

Detection and replication phase

* Monitor disk anomalies and background check data integrity, notify the master immediately after the discovery of abnormal replication.

As you can see, the strategy of writing to the memory stage is a precaution; the most dangerous is the memory-to-disk persistence stage, and we make sure that this stage is as short as possible (giving the maximum write time for the expected performance) After the timely measures to take; detection and replication phase is a typical disk is broken but to ensure that the data is not lost strategy.

summary

In designing and implementing a flying cloud computing platform, engineers have devoted a great deal of effort to addressing the reliability challenges of mass hardware. This article describes some of the design ideas but not all. Tempering a robust large-scale distributed system will require good design, sophistication and rigorous testing. With this stable and reliable cloud OS kernel, all kinds of abundant cloud computing services and applications have a fertile soil for survival and growth. We will then introduce a variety of cloud services, it is based on the operation of Alibaba Cloud developed its own cloud computing platform.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.