Here is a list of frequently asked questions that may help you answer any questions you may have before you even have to ask them.
- "Using your account"
- "Login Without Using password"
- "Using a job queuing system"
- "File backup"
- "Using the cluster in a courteous way"
"Using your account"
- Login and file transfer
Telnet, rlogin and FTP have been disabled on abacus for security reasons. user can use SSH or slogin to log into your accounts on Abacus and use SCP to transfer files between different machines. windows users can use "Secure Shell client" for login and "Secure File Transfer client" for file transfer. for those users who have no access to "Secure Shell client" on Windows machines, can download a free SSH client called putty.exe from the putty web page. SFTP can be used instead of FTP for those who prefer using FTP for file transfer.
Head node is the login node. its IP address is abacus. uwaterloo. CA. the following example shows how users can log into the head node. login to compute nodes is not recommended, but they can do so when necessary. users will get a uniform interface for the home directories no matter which node they log.
Example of login to abacus from another Unix/Linux machine: Suppose that you are a user on a unix machine named monolith, you want to log into abacus, you have a user name of "foobar" on Abacus and a password of "Tricky ". you do following (the texts in bold face are the commands you need to type in ),
Monolith :~ %Ssh-l foobar abacus. uwaterloo. caFoobar @ Abacus's password:Tricky[Foobar @ head ~] $
Example of transfer files between Abacus and another Unix/Linux machine: Suppose that you are a user on unix machine monolith, you want to transfer a file named file.txt which is located in the home directory of monolith, to abacus, your user name is "foobar" on Abacus and password is "Tricky ". you do following,
Monolith :~ %SCP file.txt foobar@abacus.uwaterloo.ca:Foobar @ Abacus's password:Tricky
Example of using SFTP: Suppose that you are a user on unix machine monolith, you want to transfer files between monolith and Abacus, your user name is "foobar" on Abacus and password is "Tricky ". you do following,
Monolith :~ %SFTP foobar@abacus.uwaterloo.caFoobar @ Abacus's password:TrickySFTP>
-
- Changing the password
First, login to abacus, then issue the command 'passwd '. the system will prompt you for the old (existing) password and ask you to choose a new password. please follow this guideline in choosing a password,
[Foobar @ head ~] $Passwd
-
- Login to compute nodes from a head node
Supposed you have logged into head, you now want to log into node035 (I. e., quad32g001), you do,
[Foobar @ head ~] $SSH node035
"Login Without Using password"
Users can generate an authentication key to login to abacus from another unix machine without using the password. the authentication key is different for each machine, each pair of machines need to set it up individually. suppose a user named "foobar" wants to login to abacus from another unix machine monolith, follow these steps,
Monolith :~ %Ssh-keygen-T RSAGenerating public/private RSA key pair. enter file in which to save the key (/home/foobar /. SSH/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in/home/foobar /. SSH/id_rsa.your public key has been saved in/home/foobar /. SSH/id_rsa.pub.the key fingerprint is: 0C: 44: 8c: 3E: B9: B4: 20: E3: 83: 4b: 19: D9: 54: Cf: 65: 35 foobar @ monolith
Please note, when the system prompts for passphrase, just enter, don't type any passphrase.
Monolith :~ %CD. SSHMonolith :~ /. Ssh %SCP id_rsa.pub Abacus:
On abacus,
[Foobar @ head ~] $CD. SSH
If the file authorized_keys does not already exist,
[Foobar @ head. Ssh] $Touch authorized_keys[Foobar @ head. Ssh] $Cat ~ /Id_rsa.pub> authorized_keys
Now, user foobar can login to abacus from monolith without typing the password,
Monolith :~ %SSH foobar @ Abacus
"Using a job queuing system"
Torque/PBS and Maui were installed on abacus for batch processing.
The portable batch system, PBS, is a workload management system for Linux clusters. It supplies command to submit, monitor, and delete jobs. It has the following components.
Job server-also called pbs_server provides the basic batch services such as processing ing/creating a batch job, modifying the job, protecting the job against system crashes, and running the job.
Job executor-a daemon (pbs_mom) that actually places the job into execution when it has es a copy of the job from the job server, and returns the job's output to the user.
Job scheduler-a daemon that contains the site's policy controlling which job is run and where and when it is run. PBS allows each site to create its own schedwn. maui scheduler is used on abacus.
Below are the steps needed to run user job:
- Create a job script containing the PBS options.
- Submit the job script file to PBS.
- Monitor the job.
PBS options
below are some of the commonly used PBS options in a job script file. The options start with "# PBS."
Option description ========================#pbs-N myjob assigns a job name. the default is the nameof PBS job script. # PBS-l nodes = 4: PPN = 2 the number of nodes and processors per node. # PBS-Q queuename assigns the queue your job will use. # PBS-l walltime = 01:00:00 the maximum wall-clock time during which thisjob can run. # PBS-O mypath/My. out the path and file name for standard output. # PBS-e mypath/My. err the path and file name for standard error. # PBS-J oe join option that merges the standard error streamwith the standard output stream of the job. # PBS-W stagein = file_list copies the file onto the execution host beforethe job starts. # PBS-W stageout = file_list copies the file from the execution host after thejob completes. # PBS-m B sends mail to the user when the job begins. # PBS-m e sends mail to the user when the job ends. # PBS-m a sends mail to the user when job aborts (with anerror ). # PBS-M Ba allows a user to have more than 1 command with thesame flag by grouping the messages together on 1 line, else only the last command gets executed. # PBS-R n indicates that a job shoshould not rerun if it fails. # PBS-V exports all environment variables to the job.
Job script example
A job script may consist of PBS ctictives, comments and executable statements. a pbs directive provides a way of specifying job attributes in addition to the command line options.
For example, a simple job script, named geo1.bash, contains the following lines:
#! /Bin/bash # PBS-l nodes = 1: PPN = 1 # PBS-vpbs_o_workdir =/home/Huang/tempmyprog = '/home/Huang/software/nwchem-4.7/bin/linux64_x86_64/nwchem 'myargs ='/home/Huang/software /TCE-test/geo-0.98.nw 'CD $ pbs_o_workdir $ myprog $ myargs> & out1
An example to run a job in a specific node, contains the following lines:
#! /Bin/bash # PBS-l nodes = node035: PPN = 1 # PBS-vpbs_o_workdir =/home/Huang/tempmyprog = '/home/Huang/software/nwchem-4.7/bin/linux64_x86_64/nwchem 'myargs ='/home/Huang/software /TCE-test/geo-0.98.nw 'CD $ pbs_o_workdir $ myprog $ myargs> & out1
Another example, a MPI job scept, named geo2.bash, contains the following lines:
#! /Bin/bash # PBS-l nodes = 4: PPN = 4 # PBS-vncpus = 16pbs_o_workdir =/home/Huang/tempcd $ pbs_o_workdircat $ pbs_nodefile>. machinefilemyprog = '/home/Huang/software/nwchem-4.7/bin/linux64_x86_64/nwchem_mpi' myargs = '/home/Huang/software/TCE-test/geo-0.98.nw' mpirun = '/opt/ mpich. PGI/bin/mpirun '$ mpirun-NP $ ncpus-machinefile. machinefile $ myprog $ myargs> & out2
The above job script templates shoshould be modified for the need of the job. You need to change the contents of the variables pbs_o_workdir, myprog and myargs only.
Submitting a job
Use the qsub command to submit the job,
Qsub geo2.bash
PBS assigns a job a unique job identifier once it is submitted (e.g. 70. head ). this job identifier will be used to monitor status of the job later. after a job has been queued, It is selected for execution based on the time it has been in the queue, wall-clock time limit, and number of processors.
Monitoring a job
Below are the PBS commands for monitoring a job:
Command function ================== qstat-a check status of jobs, queues, and the PBS serverqstat-f get all the information about a job, I. e. resources requested, resource limits, owner, source, destination, queue, etc. qdel jobid delete a job from the queueqhold jobid hold a job if it is in the queueqrls jobid release a job from hold
There are some quite useful Maui commands for monitoring a job, too:
Command description = showq show a detailed list of submitted jobsshowbf show the free resources (time and processors available) at the momentcheckjob jobid show a detailed description of the job jobidshowstart jobid gives an estimate of the expected started time of thejob jobid
For example, to check the status of a job,
Qstat-F 70. HeadOrCheckjob 70. Head
"File backup"
File Systems on the head node are backed up to tape drives once a week. incremental backup for the/Home file system to another Linux machine is done daily. users are also encouraged to back up their files to another system or any removable media by themselves for safety. for example, to copy file over to another Unix/Linux machine, users can use rsync or SCP commands. to copy files over to their PCs, users can use 'ssh Secure File Transfer client '.
"Using the cluster in a courteous way"
You might be wondering why your jobs are running slowly sometimes. there are numerous possible explanations for abacus's performance. however, the system load and the NFS file system are the two common issues causing the problem.
- high system load.
Abacus has 37 nodes, 33 of them are dual CPU Systems, 4 of them are quad CPU Systems. in each individual node, if the number of Running jobs are more than 2 on the dual systems or more than 4 on the quad systems, each job is always tively only assigned part of a CPU for computation. therefore, users are recommended to submit a job through a job queuing system rather than logging into a compute node to run a job there directly. the queuing system will balance the load among the nodes automatically.
- I/O intensive jobs.
User home directories are mounted using the NFS file system. no matter which node a user's job is running on, file reading and writing to the/Home file system are taking place on the head node via the NFS mounting. running jobs can be slowed down significantly if percent of them are I/O intensive, since these jobs need to access files on the head node simultaneously. therefore, users are required to use the scratch space local to the compute nodes for the intermediate files created by the running programs.
Briefly saying, users shoshould use the cluster in a courteous way, and shouldn't run too when jobs at one time.