Slurm Installation and configuration
Slurm Introduction
Slurm is a highly scalable cluster manager and job scheduling system that can be used in large compute node clusters. Slurm maintains a queue of work to be processed and manages the overall resource utilization of this work. Slurm distributes the job to a set of assigned nodes to execute.
Essentially, Slurm is a robust cluster manager that is highly portable, scalable to large node clusters, fault tolerant, and, more importantly, open source.
The architecture for Slurm can refer to http://slurm.schedmd.com/ installation Slurm
The installation here is an example of installing on a CentOS6.5. And because Slurm is used in the cluster, we assume that there are three identical versions of Linux, the machine name is mycentos6x,mycentos6x1 and mycentos6x2, where mycentos6x as the control node. Install Munge
First Slurm need to use Munge to authenticate, so we have to install Munge first.
Download the installation package from the official website of Munge (Https://github.com/dun/munge), where munge-0.5.11.tar.bz2 files are used. Run the following command using the root user
Compiling and installing munge packages
# RPMBUILD-TB--clean munge-0.5.11.tar.bz2
# cd/root/rpmbuild/rpms/x86_64
# rpm--install munge*.rpm
During the compilation of the RPM package and installation may be prompted to require some Third-party software packages, at this time you can use the "Yum install-y xxx" To install, I installed the following is the first installation of the package
# yum install-y rpm-build rpmdevtools bzip2-devel openssl-devel zlib-devel
After the installation is complete, you need to modify the permissions for the following files
# CHMOD-RF 700/etc/munge
# chmod-rf 711/var/lib/munge
# chmod-rf 700/var/log/munge
# CHMOD-RF 0755/var /run/munge
Also need to note is check the/etc/munge/munge.key file, the file owner and group must be munge, otherwise startup will fail.
Once the installation is complete, you can start the Munge service.
#/etc/init.d/munge Start
Finally, you need to copy the/etc/munge/munge.key to the other two machines and make sure that the file permissions are the same as the owner. Install Slurm
First create Slurm User
# useradd Slurm
# passwd Slurm
Visit the Slurm (http://slurm.schedmd.com/) Download installation package, where you use the SLURM-14.11.8.TAR.BZ2 installation package.
Compiling and installing Slurm packages
# Rpmbuild-ta--clean slurm-14.11.8.tar.bz2
# cd/root/rpmbuild/rpms/x86_64
# rpm--install slurm*.rpm
Prompted me to install the following package during the compilation of the RPM package and installation
# yum install-y readline-devel pam-devel perl-dbi perl-extutils-makemaker
After the installation is complete, modify the group of the following commands
# sudo chown slurm:slurm/var/spool
Here, the installation of Slurm is complete, but it can't be started, we still need to do the configuration to start the Slurm service and submit the job. Configure Slurm
Enter the/etc/slurm/directory, copy slurm.conf.example file into slurm.conf, and edit/etc/slurm/slurm.conf file
Here are some of the changes in my file
controlmachine=mycentos6x
controladdr=192.168.145.100
Slurmuser=slurm
selecttype=select/cons_res
Selecttypeparameters=cr_core
Slurmctlddebug=3
slurmctldlogfile=/var/log/slurmctld.log
slurmddebug=3
slurmdlogfile=/var/log/ Slurmd.log
nodename=mycentos6x,mycentos6x1,mycentos6x2 cpus=4 realmemory=500 sockets=2 corespersocket=2 threadspercore=1 state=idle
partitionname=control nodes=mycentos6x default=yes
maxtime=infinite Partitionname=compute nodes=mycentos6x1,mycentos6x2 default=no maxtime=infinite State=UP
Note: This configuration file needs to be deployed to every machine in the cluster.
Save the file, and then start the Slurm service with the following command
#/etc/init.d/slurm Start
Test
After starting the Slurm service, we can use some of the following commands to view the cluster status and submit the job
# sinfo
PARTITION avail timelimit NODES State nodelist
control* up infinite 1 idle mycentos6x
compute up infinite 2 idle mycentos6x1,mycentos6x2
# SControl Show Slurm reports
Active Steps = NONE
Actual CPUs = 2
Actual boards = 1
Actual Sockets = 1
Actual cores = 2
Actual threads per core = 1
Actual real memory = 1 464 MB
Actual temp Disk = 29644 MB
Boot Time = 2015-07-22t09:50:34
Hostname = MYCENTOS6X last
slurmctld msg time = 2015-07-22t09:50:37
slurmd PID = 27755 Slurmd
Debug = 3
SLURMD Logfile =/var/log/slurmd.log
Version = 14.11.8
# scontrol show config
# SControl show Partition
# SControl show Node
# SControl Show Jobs
Submit Job
# srun hostname
mycentos6x
# SRUN-N 3-l hostname
0:mycentos6x
1:mycentos6x1
2:mycentos6x2
# Srun Sleep &
Query job
# squeue-a
jobid PARTITION NAME USER ST time NODES nodelist (REASON)
\ Debug Sleep kongxx R 0:06 1 mycentos6x
Cancel Job
# Scancel <job_id>
Reference:
slurm:http://slurm.schedmd.com/
Munge:https://github.com/dun/munge
Reprint please indicate this address in the form of link
This article address: http://blog.csdn.net/kongxx/article/details/48173829