Brief introduction of TORQUE/MPI dispatching environment

Source: Internet
Author: User
Tags time and date

One, program and document download

http://www.clusterresources.com/

Second, Torque/maui

Torque is a distributed resource manager that can manage resources on batch tasks and distributed compute nodes. Torque is developed on the basis of OPENPBS.

Torque's own Task Scheduler is simpler, and you want to use the Maui plug-in with a complex scheduler

Third, MPI

Message passing interface defines the standards for collaborative communication and computing between computers and can be used in distributed computing and cluster environments. Typically, you use torque to manage the computing resources and batch tasks for MPI.

Iv. Compile and install torque

1,configure reference parameters are as follows

$./configure--enable-docs--enable-mom--enable-server--enable-clients--WITH-SCP

--enable-mom, turn on compute node function
--enable-server, turn on the PBS Management Server node feature
--enable-clients, open the client feature that connects to the PBS server

Mom/pbs Server/client Relationship Please refer to the Administrator manual

2, $make clean; Make

3, $sudo make install

Install the Management Server, compute nodes, and client components on the current node. The Management Server is primarily used for job submission management and typically uses the client connection Management Server for administrative operations. The compute node is the node that actually performs the job calculation. The Management Server node can be either a management node or a compute node at the same time.

4, $make Packages

Make the installation package for the compute node to avoid compiling torque at each compute node. The installation package SCP to the compute node execution, this is an SH script plus compressed data installation package. Only the MOM installation package is required for the compute node. If you are submitting a job from a compute node, install the clients component together.

Five, compile and install Maui

1, Maui patch

Maui there will be compile errors on some platforms, please use this patch

mpbsi.c:177:error:conflicting Types for Get_svrport
/usr/local/include/pbs_ifl.h:684:note:previous Declaration of Get_svrport is here
mpbsi.c:178:error:conflicting Types for OPENRM
/usr/local/include/pbs_ifl.h:685:note:previous Declaration of OPENRM is here
MAKE[1]: * * * [MPBSI.O] Error 1

@@ -174,8 +174,8 @@

extern int Pbs_errno;

-extern int Get_svrport (const char *,char *,int);
-extern int openrm (char *,int);
+//extern int Get_svrport (const char *,char *,int);
+//extern int openrm (char *,int);
extern int Addreq (Int,char *);
extern int closerm (int);
extern int Pbs_stagein (Int,char *,char *,char *);

2, Make & Install

$./configure--WITH-PBS
$make
$sudo make Install

Six, start the torque cluster

1, configure the PBS server node list

/var/spool/torque/server_priv/nodes

# # This is the TORQUE server ' nodes ' file.
##
# # To add a node, enter its hostname, optional processor count (np=),
# # and optional feature names.
##
# # Example:
# # host01 np=8 Featurea Featureb
# # host02 np=8 Featurea Featureb
##
# # For more information, please visit:
##

Host1

Host2

2, initialize PBS server

Execute the script in the source tree, initialize the server

$./torque.setup Root


3, configure the MOM node

/var/spool/torque/mom_priv/config
$pbsserver HOSTNAME # hostname running PBS server
$logevent # Bitmap of which events to log

4, start PBS server

Because it is a torque compiled on the PBS server node, the service script files can be found in the TORQUE-4.1.0/CONTRIB/INIT.D directory under the torque source tree, adding these services to the system. Please refer to the Administrator's Manual for specific actions.

The PBS_SERVER/TRQAUTHD service needs to be started, which is used to authenticate the Pbs_server server.

Start Maui Server

$/usr/local/maui/sbin/maui

5, Start compute node MOM Server

$pbs _mom

6, check node status

$pbsnodes

Seven, submit MPI operations

1, configure the MPI environment on the cluster

2, write MPI program

#include "mpi.h"
#include <stdio.h>

int main (int argc,char *argv[])
{
int i, SUM, N, myID, Numprocs;
Double startwtime = 0.0, endwtime;
int Namelen;
Char Processor_name[mpi_max_processor_name];
Mpi_comm Mycomm;
int Membershipkey;

Mpi_init (&ARGC,&ARGV);
Mpi_comm_size (Mpi_comm_world,&numprocs);
Mpi_comm_rank (Mpi_comm_world,&myid);
Mpi_get_processor_name (Processor_name,&namelen);

fprintf (stdout, "Process%d of%d is on%s\n",
myID, Numprocs, processor_name);
Fflush (stdout);

Membershipkey = myID% 3;

Mpi_comm_split (Mpi_comm_world, Membershipkey, myID, &mycomm);

Mpi_comm_size (Mycomm,&numprocs);
Mpi_comm_rank (Mycomm,&myid);
Mpi_get_processor_name (Processor_name,&namelen);

fprintf (stdout, "after split Process%d of%d was on%s\n",
myID, Numprocs, processor_name);
Fflush (stdout);

n = 10000;
if (myID = 0)
Startwtime = Mpi_wtime ();

Mpi_bcast (&n, 1, mpi_int, 0, Mpi_comm_world);

i = n + 10000;

Mpi_reduce (&i, &sum, 1, Mpi_int, mpi_sum, 0, Mpi_comm_world);

Mpi_comm_rank (Mpi_comm_world,&myid);
if (myID = = 0) {
Endwtime = Mpi_wtime ();
printf ("The sum is%d\n", sum);

printf ("Wall clock time =%f\n", endwtime-startwtime);
Fflush (stdout);
}

Mpi_finalize ();
return 0;
}

3, write the PBS script

PBS Script job.sh

#!/bin/sh
#PBS-O/PBSO
#PBS-E/pbse
#PBS-N Mpijob
#PBS-L nodes=a+b,walltime=00:01:00
#PBS-Q Batch
#PBS-M Abe
#print the time and date
Mpiexec./summe >>/mpiout


Description

Output files #PBS-o configuration program execution
Error output file #PBS-e Configuration program execution

4, submit the job

$qsub./job.sh

View task status

$qstat

5, common error 1, error of permission at execution time

The following command is typically used to initialize the environment, and the root user cannot be used when submitting jobs, and permission errors may occur at execution time
$./torque.setup Root

Solving method

$qmgr

Qmgr:set Server managers + + User@host
Qmgr:set Server Operators + + User@host

6, common error 2, no generator execution output and error files

This is the SCP is not configured to cause, to ensure that any two nodes between the Management Server and compute nodes can use SSH login and copy files, please use the key authentication method

7, common error 3,mpiexec not found

Shell environment problem, simple workaround, use full path

8, view execution results

Cat Mpiout
The sum is 40000
Wall clock time = 0.000190

9, view the execution log

Pbs_server

/var/spool/torque/server_logs/

MOM Node

/var/spool/torque/mom_logs/

Eight, other

1,SSH/SCP Login

Ssh-keygen

Ssh-copy-id-i. Ssh/id_rsa.pub user@10.2.1.2


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.