The advent of the public cloud has brought massive HPC resources to ordinary companies. In many cases, especially for temporary HPC projects, cloud solutions are more cost effective than internal purchases of the necessary computing resources. Prior to the advent of the public cloud, only a handful of companies, such as large financial services companies, had the funds to buy the resources necessary for high-performance computing.
In the past year, we can see that there is a considerable demand for customers in the market, and many companies in many industries are testing software platforms for large-scale HPC clusters. When we first describe requirements to a traditional HPC vendor, we are often asked which industry consortium or government agency is seeking to do this, because it relates to the size of the HPC environment. When we told them it was a private company rather than a large organization, we were challenged with goodwill. Finally, we decided to build the cluster software ourselves, with the goal of running the cluster software on public and private clouds.
When development software faced numerous commercial and open source choices, we found that most of the options were optimized for different general-purpose applications running simultaneously on the cluster. To accommodate such requirements, clusters are hard partitioned based on different operating systems installed on a single compute node, and each machine is reserved for specific applications, regardless of how much computing resources the application actually requires. This results in a fairly low utilization of computing resources, with an average utilization of only about 30%. This is more than enough for people who want to create HPC clusters and then lease them to the public for general resources. However, this is not the best strategy to meet the needs of typical business users.
HPC Solutions
We decided to build HPC software based on different principles. The project was initially carried out by some users who were extremely interested in the solution. Such users seek to reduce costs and shorten time. They don't want to spend more than 1 million dollars on the job, and if it takes weeks to get results, they also think the solution will not be competitive in the market.
When interacting with users who use HPC resources, we note that companies typically build a single HPC application around a common computing platform, or a collection of related applications, and that these users want to have a software platform that minimizes computing time and maximizes the utilization of available resources.
HPC Cluster design differs in that it revolves around the principle of being able to dynamically orchestrate individual cores to ensure maximum resource utilization. The cluster runs a single computing platform and processes related requirements from a single vendor. This takes into account the loose security model, and the code from different computing tasks can share the operating system, allowing us to switch the core between tasks almost in real time.
The cluster is designed to address a subset of the problems related to the HPC domain, rather than trying to build a generic, all-encompassing computing solution. The subset of the problems we choose to deal with is essentially highly parallel. A single calculation requirement is calculated at least one order of magnitude higher than the distribution time, and the problem set and solution results are small enough to be effectively transmitted in the network topology and distributed more than a few orders of magnitude lower than the entire task's elapsed time. Cluster software is designed to integrate into compute-intensive applications at the code level, rather than providing a wide range of common remote interfaces.
Cluster architecture
The cluster is designed to divide work tasks into multiple computing tasks, perform computational tasks efficiently on available hardware resources, and return the results to client applications. The cluster is valid for bare metal, running in OpenStack private cloud as well as public cloud. Different deployment scenarios are designed to address a range of available resource issues. Bare-metal is most effective when users can pay the cost of allocating fixed computing resources for a single application. Private clouds can be well applied to scenarios that allocate internal hardware resources between applications, and it is easy to deploy a different compute node or transfer the compute resources available from the HPC cluster to other requests that require computing resources. The public cloud is a good deployment scenario when the burst load level and temporary requirements make purchasing hardware impractical.
HPC Software uses Apache Libcloud to deploy across multiple hardware platforms. We have become a major contributor to the Libcloud project and have used Libcloud projects in many software projects. The top layer of HPC software is the HPC Cluster component: The Scheduler, the Task interface node, the communication Exchange architecture, and the compute nodes, which are used to efficiently control the execution of tasks.
Component
The Communication Exchange architecture is composed of a set of RABBITMQ nodes. To facilitate reconfiguration and state information, a single instance is assigned to either the Control Panel or the data panel, which is used to transfer tasks to compute nodes and receive results. There is no aggregation RABBITMQ instance because we find that this can severely reduce the speed at which clients reconnect. Instead, customer communications libraries are used to allocate requests between related RABBITMQ instances, which provides an effective extensible mechanism. Usually the problem and the result payload are sent in the band. However, in order to enhance the scalability of RABBITMQ for large payloads, distributed cache clusters have been deployed in advance. The advantage of distributed caching configuration is that the problem set is greater than dozens of KB, and a single task solution collection is greater than hundreds of KB.
The scheduler is used to assign computing resources to a single computing task. The scheduler consists of a series of message queues and the result delivery of different RABBITMQ instances, which notifies a subset of the compute nodes to join the queue through the control Panel. A scheduler is a collection of complex policies that allow customers to reserve computing resources for different users, explain why a resource failed, and work on tasks that require specific hardware such as the GPU, which may appear in a subset of nodes. When the cluster location is relatively close, the rate of scheduling decisions is 100-200 milliseconds, given the low latency traffic.
Task interface nodes provide a way to submit tasks and retrieve results in a cluster through the rest and WSDL interfaces. Task interface nodes provide redundancy by submitting failed tasks. Work tasks are submitted to the compute node through the data Panel Communication Exchange schema. The client application code on the compute node can divide work tasks into computational tasks and submit them to the work task queues built by the scheduler, or continue to be divided into subtasks through a series of steps. The use of HPC resources to break down work tasks can be done in a very short period of time to compute costly work steps. End user-provided client code is optimized to be decomposed into various task types in the most efficient manner, through the interface of the HPC software through a collection of class libraries and a well-defined API. The decomposition chain and the result accumulation are reversed until the final result set is distributed to the interface host for client access. The components in the task chain control the degree of parallelism of the task and can inform the resource requirements of the schedule, and if the policy allows, the scheduler can transfer the previously reserved but underutilized resources to other tasks.
The compute node pulls a single task from the specified message queue, gets the Out-of-band data necessary for task execution, and can perform the final calculation or decompose the task into another task and resubmit it back to the queue. The actual computing software is provided by the client application and distributed to a single node through the orchestration layer. The compute node monitors the utilization of the core assigned to a single task queue, and the compute node crawls the task from another queue if the assigned queue is idle longer or the policy control times out. The cumulative calculation results are returned to the result queue or uploaded to the distributed cache cluster and notified through the data panel.
(Responsible editor: The good of the Legacy)