One, program and document download
http://www.clusterresources.com/
Second, Torque/maui
Torque is a distributed resource manager that can manage resources on batch tasks and distributed compute nodes. Torque is developed on the basis of OPENPBS.
Torque's own Task Scheduler is simpler, and you want to use the Maui plug-in with a complex scheduler
Third, MPI
Message passing interface defines the standards for collaborative communication and computing between computers and can be used in distributed computing and cluster environments. Typically, you use torque to manage the computing resources and batch tasks for MPI.
Iv. Compile and install torque
1,configure reference parameters are as follows
$./configure--enable-docs--enable-mom--enable-server--enable-clients--WITH-SCP
--enable-mom, turn on compute node function
--enable-server, turn on the PBS Management Server node feature
--enable-clients, open the client feature that connects to the PBS server
Mom/pbs Server/client Relationship Please refer to the Administrator manual
2, $make clean; Make
3, $sudo make install
Install the Management Server, compute nodes, and client components on the current node. The Management Server is primarily used for job submission management and typically uses the client connection Management Server for administrative operations. The compute node is the node that actually performs the job calculation. The Management Server node can be either a management node or a compute node at the same time.
4, $make Packages
Make the installation package for the compute node to avoid compiling torque at each compute node. The installation package SCP to the compute node execution, this is an SH script plus compressed data installation package. Only the MOM installation package is required for the compute node. If you are submitting a job from a compute node, install the clients component together.
Five, compile and install Maui
1, Maui patch
Maui there will be compile errors on some platforms, please use this patch
mpbsi.c:177:error:conflicting Types for Get_svrport
/usr/local/include/pbs_ifl.h:684:note:previous Declaration of Get_svrport is here
mpbsi.c:178:error:conflicting Types for OPENRM
/usr/local/include/pbs_ifl.h:685:note:previous Declaration of OPENRM is here
MAKE[1]: * * * [MPBSI.O] Error 1
@@ -174,8 +174,8 @@
extern int Pbs_errno;
-extern int Get_svrport (const char *,char *,int);
-extern int openrm (char *,int);
+//extern int Get_svrport (const char *,char *,int);
+//extern int openrm (char *,int);
extern int Addreq (Int,char *);
extern int closerm (int);
extern int Pbs_stagein (Int,char *,char *,char *);
2, Make & Install
$./configure--WITH-PBS
$make
$sudo make Install
Six, start the torque cluster
1, configure the PBS server node list
/var/spool/torque/server_priv/nodes
# # This is the TORQUE server ' nodes ' file.
##
# # To add a node, enter its hostname, optional processor count (np=),
# # and optional feature names.
##
# # Example:
# # host01 np=8 Featurea Featureb
# # host02 np=8 Featurea Featureb
##
# # For more information, please visit:
##
Host1
Host2
2, initialize PBS server
Execute the script in the source tree, initialize the server
$./torque.setup Root
3, configure the MOM node
/var/spool/torque/mom_priv/config
$pbsserver HOSTNAME # hostname running PBS server
$logevent # Bitmap of which events to log
4, start PBS server
Because it is a torque compiled on the PBS server node, the service script files can be found in the TORQUE-4.1.0/CONTRIB/INIT.D directory under the torque source tree, adding these services to the system. Please refer to the Administrator's Manual for specific actions.
The PBS_SERVER/TRQAUTHD service needs to be started, which is used to authenticate the Pbs_server server.
Start Maui Server
$/usr/local/maui/sbin/maui
5, Start compute node MOM Server
$pbs _mom
6, check node status
$pbsnodes
Seven, submit MPI operations
1, configure the MPI environment on the cluster
2, write MPI program
#include "mpi.h"
#include <stdio.h>
int main (int argc,char *argv[])
{
int i, SUM, N, myID, Numprocs;
Double startwtime = 0.0, endwtime;
int Namelen;
Char Processor_name[mpi_max_processor_name];
Mpi_comm Mycomm;
int Membershipkey;
Mpi_init (&ARGC,&ARGV);
Mpi_comm_size (Mpi_comm_world,&numprocs);
Mpi_comm_rank (Mpi_comm_world,&myid);
Mpi_get_processor_name (Processor_name,&namelen);
fprintf (stdout, "Process%d of%d is on%s\n",
myID, Numprocs, processor_name);
Fflush (stdout);
Membershipkey = myID% 3;
Mpi_comm_split (Mpi_comm_world, Membershipkey, myID, &mycomm);
Mpi_comm_size (Mycomm,&numprocs);
Mpi_comm_rank (Mycomm,&myid);
Mpi_get_processor_name (Processor_name,&namelen);
fprintf (stdout, "after split Process%d of%d was on%s\n",
myID, Numprocs, processor_name);
Fflush (stdout);
n = 10000;
if (myID = 0)
Startwtime = Mpi_wtime ();
Mpi_bcast (&n, 1, mpi_int, 0, Mpi_comm_world);
i = n + 10000;
Mpi_reduce (&i, &sum, 1, Mpi_int, mpi_sum, 0, Mpi_comm_world);
Mpi_comm_rank (Mpi_comm_world,&myid);
if (myID = = 0) {
Endwtime = Mpi_wtime ();
printf ("The sum is%d\n", sum);
printf ("Wall clock time =%f\n", endwtime-startwtime);
Fflush (stdout);
}
Mpi_finalize ();
return 0;
}
3, write the PBS script
PBS Script job.sh
#!/bin/sh
#PBS-O/PBSO
#PBS-E/pbse
#PBS-N Mpijob
#PBS-L nodes=a+b,walltime=00:01:00
#PBS-Q Batch
#PBS-M Abe
#print the time and date
Mpiexec./summe >>/mpiout
Description
Output files #PBS-o configuration program execution
Error output file #PBS-e Configuration program execution
4, submit the job
$qsub./job.sh
View task status
$qstat
5, common error 1, error of permission at execution time
The following command is typically used to initialize the environment, and the root user cannot be used when submitting jobs, and permission errors may occur at execution time
$./torque.setup Root
Solving method
$qmgr
Qmgr:set Server managers + + User@host
Qmgr:set Server Operators + + User@host
6, common error 2, no generator execution output and error files
This is the SCP is not configured to cause, to ensure that any two nodes between the Management Server and compute nodes can use SSH login and copy files, please use the key authentication method
7, common error 3,mpiexec not found
Shell environment problem, simple workaround, use full path
8, view execution results
Cat Mpiout
The sum is 40000
Wall clock time = 0.000190
9, view the execution log
Pbs_server
/var/spool/torque/server_logs/
MOM Node
/var/spool/torque/mom_logs/
Eight, other
1,SSH/SCP Login
Ssh-keygen
Ssh-copy-id-i. Ssh/id_rsa.pub user@10.2.1.2