Optimizing resource management in Super computer with Slurm

Last Update:2014-12-24 Source: Internet

Author: User

Keywords Super Computer Slurm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To provide the most functionality in one architecture, Slurm open Source Job scheduler (by China Tianhe-ia http://www.aliyun.com/zixun/aggregation/13840.html "> Supercomputer) and upcoming IBM® Sequoia supercomputer uses) optimizes resource allocation and monitoring. This article will look at how the workload is Slurm and its parallel clusters.

Supercomputers are a typical example of an arms race. With the increasing performance of modern supercomputers extending to new problem areas, these huge systems are also providing a platform to solve new problems. Supercomputers are a source of national and corporate pride, because companies and countries are committed to improving LINPACK performance. Figure 1 shows the situation of the supercomputer arms race over the past five years, while IBM Sequoia supercomputer is now projected to be the leader for 2012 years. As the picture shows, IBM Roadrunner is the first supercomputer to break the constant petajoules (while IBM Blue GENE®/L has been the top since 2004 to 2008).

Figure 1. Supercomputing Machine Performance: 2008-2012

The early supercomputer was designed to model nuclear weapons. Today, they are more widely used to deal with a large number of computational problems in areas such as climate research, molecular modelling, large physical simulations and even powerful cryptographic decoding.

1964 to present

What is the LINPACK benchmark?

To compare the performance of competing supercomputers, a LINPACK performance benchmark is created. LINPACK measures the execution speed of floating-point operations. Specifically, LINPACK is a set of procedures for solving dense systems of linear equations.

It is generally assumed that the first supercomputer was released in 1964 (designed by Seymour Cray) control Data Corporation (CDC) 6600. 6600 fills four cabinets with hardware, freon cooling systems, and a single CPU capable of completing a floating-point operand of 3 million per second. While not lacking in beauty, its cabinets are clearly visible in a number of colored wires used to connect peripheral unit processors to a single CPU to make them as busy as possible.

Rapid development so far, the current supercomputer leader is Japan's Kei computer (built by Fujitsu). The system, which focuses on brute force computing, uses over 88,000 SPARC64 processors and occupies 864 cabinets. One notable feature of the Kei supercomputer is that it breaks through 10000 trillion of obstacles. Similar to CDC 6600, Kei uses water-cooled plus air cooling.

What is a supercomputer?

A supercomputer is not about any particular architecture, it's just a design that is at the tip of the computational performance. Today, this means that if measured by the LINPACK benchmark, the system can run within the performance range of petajoules (or FLOPS).

Regardless of how supercomputers implement these FLOPS, a low-level goal of any supercomputer architecture is to keep computing resources as busy as possible when there is work to do. Like the peripheral processors that CDC 6600 uses to keep a single CPC busy, modern supercomputers require the same basic performance. Let's look at the implementation of a compute node resource management named Simple linux®utility for Resource Management (Slurm).

Slurm Introduction

Slurm is a highly scalable and fault-tolerant cluster manager and job scheduling system for large compute node clusters. Slurm maintains a queue of work to be processed and manages the overall resource utilization of this work. It also manages the available compute nodes in an exclusive or non-exclusive manner, depending on the resource requirements. Finally, Slurm distributes the job to a set of assigned nodes to perform the work and monitor the parallel job until it completes.

Essentially, Slurm is a robust cluster manager (more focused on the need for functional richness), highly portable, scalable to large node clusters, fault tolerant, and, more importantly, open source. Slurm was originally an Open-source resource manager, developed in collaboration with several companies, including Lawrence Livermore National Laboratory. Today, Slurm has become the leading resource manager used on many of the most powerful supercomputers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More