Linux high-performance computing cluster-Overview

Last Update:2013-12-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is the first part of the Linux High-Performance cluster series. This section introduces the basic knowledge of the cluster system and explains two main types of clusters: high-availability clusters and high-performance clusters. The subsequent sections of this series will be centered on Beowulf high-performance clusters. The author first introduces the architecture of the Beowulf cluster, then introduces the hardware, network, software and system components of the Beowulf cluster, and finally the cluster system management software.

1 Cluster

1.1 What is a cluster

In short, a cluster is a group of computers that provide users with a set of network resources as a whole. These individual computer systems are the node of the cluster ). An ideal cluster is that users never realize the underlying node of the cluster system. In his/her view, the cluster is a system, not multiple computer systems. In addition, administrators of the cluster system can add or delete nodes of the cluster system at will.

1.2 Why Cluster

Clusters are not a completely new concept. In fact, as early as 1970s, computer manufacturers and research institutions began to research and develop cluster systems. These systems are not well known because they are mainly used in scientific engineering computing. It was not until the emergence of Linux clusters that the concept of clusters was widely spread.

The Research on clusters originated from the excellent scalability of the cluster system ). Increasing CPU clock speed and bus bandwidth is the primary means to provide computer performance. However, this method provides limited system performance. Then, people increase the number of CPUs and memory capacity to improve performance. As a result, vector machines and symmetric multi-processor (SMP) are emerging. However, when the number of CPUs exceeds a certain threshold, the scalability of multi-processor systems such as SMP becomes very poor. The main bottleneck is that the bandwidth used by the CPU to access the memory does not increase as the number of CPUs increases. In contrast to SMP, the performance of the cluster system almost changes linearly with the increase in the number of CPUs. Figure 1 shows the situation.

Figure 1. scalability of several computer systems

This is not only an advantage of the cluster system. The following lists the main advantages of the cluster system:

High scalability: as described above.
High Availability: one node in the cluster fails, and its tasks can be passed to other nodes. It can effectively prevent single point of failure.
High performance: the Server Load balancer cluster allows the system to access more users at the same time.
Cost-effective: high-performance systems can be constructed using inexpensive hardware that meets industrial standards.

1.2.1 classification of Cluster Systems

Although there are multiple classification methods based on different features of the cluster system, we generally divide the cluster system into two categories:

High Availability cluster (HA cluster. These clusters are dedicated to providing highly reliable services.
High-performance Computing (HPC) clusters. These clusters are designed to provide powerful computing capabilities that a single computer cannot provide.

2 high-availability clusters

2.1 What is high availability

The availability of computer systems is measured by the system reliability and maintainability. Generally, MTTF is used to measure the system reliability, and MTTR is used to measure the maintainability of the system. The availability is defined:

MTTF/(MTTF+MTTR)*100%

The industry classifies computer systems into the following categories based on availability:

Available proportion (Percent Availability)	Annual downtime (Downtime/year)	Availability Classification
99.5	3.7 days	Conventional)
99.9	8.8 hours	Available)
99.99	52.6 minutes	High Availability System (Highly Available)
99.999	5.3 minutes	Fault Resilient
99.9999	32 seconds	Fault Tolerant

Table 1. system availability Classification

For critical services, downtime is usually disastrous. The cost of downtime is huge. The following statistics list the losses caused by the downtime of different types of enterprise application systems.

Application System	Loss per minute (USD)
Call Center)	27000
Enterprise Resource Planning (ERP) System	13000
Supply Chain Management (SCM) System	11000
E-commerce system	10000
Customer Service Center System	27000

Table 2. Business Losses Caused by downtime

As enterprises rely more and more on information technology, the loss caused by system downtime also increases.

2.2 high-availability cluster

High-availability clusters use cluster technology to achieve high availability of computer systems. High-availability clusters generally work in two ways:

Fault-tolerant system: it is usually a master-slave server. The slave server checks the status of the master server. When the master service works normally, the slave server does not provide services. However, once the master server fails, the slave server starts to provide services to the customer instead of the master server.
Server Load balancer system: All nodes in the cluster are active and share the workload of the system. Generally, Web Server Clusters, database clusters, and Application Server Clusters belong to this type.

There are a lot of discussions about highly available clusters, so I will not go into detail here.

3. High-performance computing cluster

3.1 What is a high-performance computing cluster

In short, High-Performance Computing is a branch of computer science. It is dedicated to developing supercomputers, researching parallel algorithms, and developing related software. High-performance computing focuses on the following two types of problems:

Large-scale scientific problems, such as weather forecasts, topographic analysis, and biopharmaceuticals;
Storage and processing of massive data, such as data mining, image processing, and gene sequencing;

As the name suggests, high-performance clusters use cluster technology to study high-performance computing.

3.2 High Performance Computing Classification

High-performance computing offers many classification methods. This Section classifies high-performance computing from the perspective of the relationship between parallel tasks.

3.2.1 High-throughput Computing (High-throughput Computing)

There is a type of high-performance computing, which can be divided into several parallel subtasks, and each subtask has no association with each other. Like searching for aliens at HOME, SETI @ Home -- Search for Extraterrestrial Intelligence at HOME) is this type of application. This project uses idle computing resources on the Internet to search for aliens. The server of the SETI project sends a set of data and data modes to the computing nodes that participate in the SETI on the Internet. The computing nodes search for the given data in the given mode, then, send the search result to the server. The server is responsible for integrating the complete data returned from each computing node. Because a common feature of this type of application is to search for certain modes of massive data, this type of computing is called high-throughput computing. Internet computing belongs to this category. Based on Flynn classification, high-throughput computing falls into the SIMDSingle Instruction/Multiple Data category.

3.2.2 Distributed Computing)

Another type of computing is opposite to high-throughput computing. Although they can be divided into several parallel subtasks, the subtasks are closely related and require a large amount of data exchange. According to Flynn classification, distributed high-performance computing falls into the category of MIMDMultiple Instruction/Multiple Data.

3.3 Linux High-Performance cluster system

When talking about Linux high-performance clusters, Beowulf is the first reflection of many people. At first, Beowulf was just a famous scientific computing cluster system. Many clusters in the future adopt similar Beowulf architecture. Therefore, Beowulf has become a widely accepted type of high-performance clusters. Despite the different names, many cluster systems are derivatives of Beowulf clusters. Of course, there are also cluster systems different from Beowulf. COW and Mosix are two other famous cluster systems.

3.3.1 Beowulf Cluster

Simply put, Beowulf is an architecture that can use multiple computers for parallel computing. Generally, the Beowulf system consists of multiple computing nodes and Management Nodes connected through Ethernet or other networks. The management node controls the entire cluster system and provides file services and external network connections for computing nodes. It uses common hardware devices, such as common PCs, Ethernet cards, and hubs. It rarely uses customized hardware and special devices. Beowulf cluster software can also be seen everywhere, such as Linux, PVM, and MPI.

The hardware, network, software, and application architecture of the Beowulf cluster system will be described in detail in the following sections.

3.3.2 Beowulf cluster and COW Cluster

Like Beowulf, COWCluster Of Workstation is also built by the most common hardware devices and software systems. It is usually composed of one control node and multiple computing nodes. The main differences between COW and Beowulf are:

Computing nodes in COW are mostly idle computing resources, such as office desktop workstations, which are common PCs and connected by common LAN. Because these computing nodes are used as workstations during the day, the main cluster computing takes place at night and on weekends. Beowulf's computing nodes are dedicated to parallel computing and performance optimization is performed. They use the message transmission PVM or MPI on Myrinet or Giganet) on the high-speed network for inter-process communication IPC ).
Because the computing nodes in COW primarily aim at desktop applications, they all have peripherals such as displays, keyboards, and mouse. Beowulf's computing nodes usually do not have these peripherals, and access to these computing nodes is usually achieved on the Management node through the network or serial line.
Because the computing nodes connected to the COW are usually common LAN, high-performance applications on the COW are generally SIMD high-throughput computing like SETI @ HOME. Beowulf optimizes the MIMD applications that require frequent data exchange, regardless of hardware, network, and software.

3.3.3 Mosix Cluster

In fact, putting a Mosix cluster in a high-performance cluster is quite far-fetched, but compared with other clusters such as Beowulf, The Mosix cluster is indeed a very special cluster, it is dedicated to implementing the Single System Image SSI (Single System Image) of the cluster System in Linux ). The Mosix cluster connects a computer running Linux on the network to a cluster system. The system automatically balances loads between nodes. Because Mosix is a Cluster implemented in the Linux kernel, user-Mode Applications can run on the Mosix cluster without any modifications. Generally, users seldom notice the differences between Linux and Mosix. For him, the Mosix cluster is a PC running Linux. Despite many problems, Mosix is always a striking cluster system.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux high-performance computing cluster-Overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Linux high-performance computing cluster-Overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support