Basic HPC Usage
CONTENTS
- Introduction
- Applications for Windows
- Accessing the HPC
- Modules and Environments
- SLURM
- Activity
- Conclusion
- Notes
Introduction
A high-performance cluster (HPC) is a network of servers that are pooled together to maximize their computational capabilities for specific purposes — often for computationally-intensive requirements such as simulations and modeling, among others. Users should treat the HPC as an extension of their personal computers to provide the extra computational power needed for research. However, it is important for users to exercise prudence on the use of any HPC, i.e., wrong input will produce wrong output.
For novice or first-time HPC users, the COARE Team prepared this basic HPC module for guidance on how to jump-start your HPC journey prior to running actual jobs. This module focuses on the overview and practicalities of using the Saliksik HPC.
An important reminder for all users is that to be able to effectively use any HPC, some basic knowledge of the Linux command-line interface (terminal) is needed. Here are some useful online references that users can study which covers the basics of the Linux terminal:
- https://opensource.com/article/21/8/linux-terminal
- https://ubuntu.com/tutorials/command-line-for-beginners
Of course, the Saliksik HPC also has its share of limitations, such as:
- Non-Linux programs are currently not supported.
- There is very limited support for applications that need graphical user interfaces (GUI). HPCs in general are optimized mostly for terminal applications.
- Jobs are limited to the availability of physical resources upon submission.
After this module, the users are expected to be able to:
- Log in to their HPC accounts;
- Perform file and folder transfers to (upload) and from (download) the HPC;
- Use environment modules;
- Manage their Anaconda environments and packages;
- Create SLURM job scripts; and
- Run and manage their SLURM jobs.
Applications for Windows
For this module, Windows OS users who will opt to not use the built-in PowerShell application should install PuTTY and WinSCP for logging in and file transfers, respectively. The usage for such applications will be discussed further.
Accessing the HPC
The HPC can only be accessed using passwordless SSH, so SSH key(s) need to be appended to the user's account. Every user is responsible for their own account. Account sharing is strictly prohibited as outlined in the COARE Acceptable Use Policy (AUP). This module assumes that the user hasn't logged in yet to their account.
Generating SSH Key Pairs
The SSH key pair consists of a private and public key. The public key is the one appended to the user's account in the HPC which is used to confirm the private key stored in the user's personal computer every time the user logs in to the HPC.
For Windows OS users, follow only either of the methods (either Terminal or Graphical) as a key generated with one method will not work with the other, for example: an SSH key generated using ssh-keygen
in the terminal (as discussed further) cannot be used to log in with the graphical program PuTTY.
Terminal
To generate an SSH key, open your computer's terminal and use this command:
Subsequent prompts for input from the user will be displayed:
Simply pressing the keyboard key without any input to the prompts will use the default options indicated in parentheses. Users may opt to put a passphrase (password) on the key pair, however it is more convenient not to as the passphrase will be asked every time the user logs in.
By default, the key pair will be stored in the ${HOME}/.ssh
folder where the private and public keys are called id_rsa
and id_rsa.pub
, respectively. The $HOME
folder will look like /home/username
for Linux/Unix, /Users/username
for MacOS, and C:\Users\username
for Windows. For all the said platforms, the shorthand for the $HOME
folder is ~/
(tilde symbol).
The private key SHOULD ONLY be accessible to the user for security reasons as this can be used by another person to log in to your account. Here is what a sample RSA-type SSH private key looks like:
And here is its corresponding public key:
The public key will be appended to the user's account so they can log in to the HPC. Make sure it is in the OpenSSH format like above — ssh-rsa <some_long_str_here> <comment>
where <comment>
is optional and usually takes the form <user>@<computer-name>
.
Graphical
Open the PuTTYgen application comes with installing PuTTY. The default parameters at the bottom, i.e., RSA type of key and 2048 bits, is already good to use:
Click on the button, then randomly move over the mouse pointer at the blank area to generate the key:
The PuTTYgen interface will look like this after a key has been successfully generated:
Click the button to save the private key in your computer. The public key which will be appended to your account is inside the box indicated by Public key for pasting into OpenSSH authorized_keys file:
. Send your public key to the COARE Team to have it appended to your account. If you need the OpenSSH-formatted public key from a previously created private key, simply click the button and locate the private key.
Logging In
The user may log in to the HPC after the COARE Team has appended the public key to the user's account.
For Windows OS users, follow only either of the methods (either Terminal or Graphical) from the previously followed method in generating the SSH key as a key generated with one method will not work with the other (as previously explained). Likewise, any setting/configuration in one method will not work with the other, for example: setting up an SSH configuration file (as discussed further) cannot be used when logging in with PuTTY.
Terminal
Interactive Command
To log in to the HPC, use this command in your local machine's terminal:
The -i
option specifies the private key to your path which is ~/.ssh/id_rsa
by default. To print more verbose messages with this command, add the -v
option with more v's to increase verbosity (i.e., -vv
and -vvv
), but the single -v
should suffice. The front end (log in) node has the public domain name saliksik.asti.dost.gov.ph
or IP address 202.90.149.55
. After successfully logging in, the HPC's welcome page will be displayed:
SSH Configuration File
The SSH parameters can be saved into a configuration file for convenient log in every time. This will also come in handy later on when downloading and uploading files.
In Linux and MacOS terminals, use vim
, nano
, or any text editor. If the file doesn't exist, it will automatically be created upon saving:
For Windows PowerShell, a blank file should be created prior to editing using notepad.exe
because notepad.exe
automatically adds a .txt
filename extension which will make the config file unusable:
> notepad.exe ~/.ssh/config
Here is a sample SSH configuration file:
2
3
4
User username
Hostname saliksik.asti.dost.gov.ph
IdentityFile ~/.ssh/id_rsa
The column spacing set above is optional and is only set for better readability, so a single space for each line will suffice. The value set for Host
(in this case, saliksik
) will now be used to shorten the previously full SSH login command into:
Graphical
To log in using PuTTY, the minimum parameters needed are the username, hostname, and private key generated by PuTTYgen. Under the Session
tab (the default tab), in the Host Name or (IP address)
box, key in username@saliksik.asti.dost.gov.ph
(or username@202.90.149.55
):
Then, go to the Connection
> SSH
> Auth
tab, and locate the private key previously created by PuTTYgen in the Private key file for authentication
box:
To save the parameters, go back to Session
tab, then put a name (such as saliksik
) in the Saved Sessions
box, and click . It should be added below Default Settings
. In the future, to use the saved session settings, click on the name of the saved session, then click the button to load the saved parameters.
Finally, click at the bottom portion of the window to log in to the HPC. The following security alert might appear:
If logging in using PuTTY for the very first time, then this is normal as the server's host key is not yet recognized by PuTTY. However, if the server's host key has already been previously cached yet the alert still appeared, then kindly inform the COARE Team as this may be a security concern.
Users are also encouraged to explore the other settings of PuTTY, such as the terminal size, font size and color, etc.
HPC Layout
The Saliksik HPC is composed of the following nodes (servers):
- Front end (login)
- This is where users log in to the HPC. DO NOT run jobs here. Use the
debug
nodes instead (will be explained later). Violators will be subjected to the COARE AUP.
- This is where users log in to the HPC. DO NOT run jobs here. Use the
- Compute nodes
- CPU nodes x 36. Every node has:
- 88 logical CPUs (86 usable)
- 500 GB RAM
- GPU nodes
- P40 nodes x 6. Every node has:
- 24 CPUs (22 usable)
- 1 TB RAM
- NVIDIA Tesla P40 GPU x 1
- A100 nodes x 2. Every node has:
- 128 CPUs (126 usable)
- 1 TB RAM
- NVIDIA Tesla A100 GPU x 8
- P40 nodes x 6. Every node has:
- CPU nodes x 36. Every node has:
Storage Quotas
Each user has the following default storage quotas:
- Home (
/home/username
): 100 GB - Scratch folders (
/scratch[1-3]/username
symlinked to/home/username/scratch[1-3]
): 5 TB for each scratch folder
The Saliksik HPC is regularly undergoing maintenance and streamlining operations, so this may change in the future with prior notice to users.
The home folder is intended for long-term data storage, while the scratch folders are for heavy input and output (I/O) file operations when running jobs. The scratch folders are also significantly faster than the home folder for read and write operations, so jobs should only be performed using the scratch folders and users are prohibited from running their jobs in their home folders. Please refer to the COARE AUP for more info.
Uploading and Downloading Files
Terminal
Remote file transfers via the terminal is done using scp
or rsync
. All of the commands listed here should be done on the local computer for both upload and download operations.
Using scp
In your computer, upload files and/or folders with scp
using the following command:
The scp
options -r
and -v
is for recursive (entire folders) transfers and verbose output, respectively. The -i </path/to/ssh/priv/key>
option specifies the private SSH key file to use. If an SSH configuration file was created (for example, Host
is set as saliksik
), the command can be shortened into:
scp
follows the same principle as above with a minor modification: the source and destination should be switched, of course. To download, use either the long or shortened (if there is an SSH configuration file) version of the command:
$ scp [-r] [-v] <host>:</source/path/in/server> </dest/path/in/local/machine>
More information about scp
can be found in its manual pages:
Using rsync
rsync
over scp
is that the former updates the data of the files in the destination, so when rsync
detects that there's no difference between the source and destination files then the transfer can terminate immediately. With rsync
, transfers can be interrupted but can be continued later on without having to transfer everything from the start, which is different from scp
as it will overwrite the destination file even if they are exactly the same.
In your computer, upload files with rsync
using the following command:
The rsync
options -a
, -v
, -h
, and -P
are for archive mode (-a
), verbose output (-v
), human-readable output (-h
), and to keep partially transferred files and show progress (-P
), respectively. Like that with scp
, the -i
option also defines the path to the private SSH key file. If an SSH configuration file is set (for example, Host
is also set as saliksik
), the command can be shortened to:
To download files and/or folders, use either the long or shortened (again, if there is an SSH configuration file) version of the command:
$ rsync [-avhP] <host>:</source/path/in/server> </dest/path/in/local/machine>
For more information about rsync
and its options, refer to its manual pages:
Graphical
One non-terminal option to transfer files and folders to and from the HPC for Windows users is WinSCP.
To log in to the HPC, enter the following parameters in its interface:
- File protocol:
SFTP
- Host name:
saliksik.asti.dost.gov.ph
(or202.90.149.55
) - Port number:
22
- User name: (your HPC username)
- Password: (leave as blank)
Then, click the button which will bring up the Advanced Site Settings window:
Navigate to the SSH
> Authentication
tab and locate the private SSH key file generated using PuTTYgen:
Click to go back to the log in interface. After configuring the log in, click the button to connect to the HPC. Upon successful log in, WinSCP will show the Commander interface where local and remote files are shown on the left and right portions, respectively:
Uploading and downloading files to and from the HPC is as simple as "drag and drop" using WinSCP. Users are encouraged to explore the other settings and features of WinSCP such as displaying hidden files (with dot prefixes, e.g. .bashrc
), etc.
Modules and Environments
Modules allow program installations with different versions to be used without them interfering with each other, thus effectively keeping each version in a sandboxed environment. In other words, modules allow programs to be used in isolation from others which avoids possible incompatibilities and inconsistencies. However, it should be noted that the COARE Team is gradually doing away with modules in favor of Anaconda environments, but modules are still used for programs that are not available in the Anaconda repository (anaconda.org).
Module Commands
Modules have the format <module_name>/<version>
, for example: anaconda/3-2023.07-2
.
List Available Modules
Without any argument, this command will list all available versions of all installed modules. When one or more module names are provided, the available versions for the modules are listed:
For example, running module avail
without additional arguments will print the following example list of modules which is not exhaustive as it is constantly being updated:
On the other hand, when using the command module avail gromacs
for example, the available versions of the gromacs
module are listed:
Load module(s)
List loaded module(s)
Reload currently loaded module(s)
Unload module(s)
Unload all loaded modules
Anaconda
Anaconda is a package and environment manager written primarily in Python. Its official website is anaconda.org.
Configure conda and mamba
Anaconda's default package manager is conda
, although in practice mamba
is better to use because it's much more efficient and its warning and error messages are more intuitive. However, it's still a good idea to be able to use them both.
Initialize conda and mamba
As of writing, the latest Anaconda module is anaconda/3-2023.07-2
. In the past, running $ conda activate
will prompt an error saying that the ~/.bashrc
script has not yet been initialized. Loading the module will automatically initialize conda
and mamba
, so no need to modify your ~/.bashrc
script like in the previous Anaconda module versions.
$ conda activate # or mamba activate; activate base env
Change the Default Locations
The default locations for the conda
environments and packages is at ~/.conda/envs
and ~/.conda/pkgs
, respectively. The environments folder is the path prefix where environments are created (e.g., creating an environment named test
will be created in ~/.conda/envs/test
by default), while the packages folder is where the installers are downloaded and cached. Both folders are stored in the user's home folder by default. However, as previously explained, the home folder is significantly slower than the scratch folders. In addition, the storage quota for the home folder is significantly less than the scratch folders. Thus, to maximize job performance later on, the default paths for both folders will be changed to one of the scratch folders. The configuration set here will also be used by mamba
.
To do this, use the following commands:
# (e.g., /scratch3/trainee/conda/envs)
$ conda config --add envs_dirs /<scratch>/<username>/conda/envs
$ conda config --add pkgs_dirs /<scratch>/<username>/conda/pkgs
If these paths don't exist, conda
will automatically create them during package download or environment creation. To confirm the configuration, check that the conda
start up script has been modified which should look like this (YAML format):
envs_dirs:
- /scratch3/username/conda/envs
pkgs_dirs:
- /scratch3/username/conda/pkgs
The ~/.condarc
file may actually be directly created and/or modified without having to run the above commands. Intuitively, additional paths may be supplied to the envs_dirs
and pkgs_dirs
parameters which will be useful in cases, say the first path becomes full or the user has no permission to write to the path, so the next path will be used and so on. The paths are directly pointed to the scratch folder instead of that in the user's home folder (/home/username/scratch[1-3]
) because the former is the actual path of the scratch folder while the latter is only a symlink to the scratch folders.
Manage Environments
Create Environments
Default Way
To create an Anaconda environment, simply use the following command template:
The -y
argument is optional and tells mamba
to assume that "yes" is the answer to all its questions. However, -y
is required when creating the environment via SLURM because the job will fail if it is not defined since there will be interactive questions which cannot be answered. If the version
and build
of the package(s) are not defined, then the latest available will be installed.
For example, to create an environment named myenv
containing the package hmmer
from the bioconda
channel (https://anaconda.org/bioconda/hmmer):
Using Multiple Channels and Packages
Of course, multiple channels and packages may be used, such as hmmer
from the channel (https://anaconda.org/bioconda/hmmer) and sqsgenerator
from the conda-forge
channel (https://anaconda.org/conda-forge/sqsgenerator):
List Environments
To list the environments visible to the user, use the following command template:
Activate an Environment
To activate an environment, use the following command template:
Remove an Environment
To remove an environment:
Manage Packages
Install and Remove Packages
Default Way
To install packages into an existing environment in a single line, use the following command template:
To remove packages:
The above commands may also be done by activating the environment first prior to package installation or removal:
$ conda activate <env_name | env_path>
$ mamba install [-y] <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>] ...
$ mamba remove [-y] <package1>[=<version1>=<build1>] ...
Specific Package Version and Build
A package may have different version and builds available. For example, the pytorch
package in the pytorch
channel (https://anaconda.org/pytorch/pytorch) has multiple versions available and each version has multiple builds:
In the above screenshot, the linux-64
architecture offers multiple builds for version 1.11.0
, namely: py3.10_cuda11.1_cudnn8.0.5_0
, py3.10_cuda11.3_cudnn8.2.0_0
, py3.10_cuda11.5_cudnn8.3.2_0
and py3.7_cpu_0
. The other newer versions have multiple builds as well. The build for each package may be inferred from the name or accessed by pressing the icon, for example:
In the above example, pytorch
may be installed by simply specifying the version, like so:
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0
However, there may be instances where you need to install the CUDA-enabled (GPU) build but the latest build is CPU-only, so the above command would install the CPU build of pytorch
version 1.11.0
. To install the CUDA-enabled build, for example, py3.10_cuda11.1_cudnn8.0.5_0
, use the command below (hint: this should be submitted to a GPU-capable node such as those in the gpu
partition):
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0=py3.10_cuda11.1_cudnn8.0.5_0
List Installed Packages
This operation may be done interactively, so no need to submit this via SLURM. To list the packages installed in an environment, there are two ways:
Activate the environment, then list the packages:
$ <conda|mamba> list
Or, list the packages directly:
SLURM
SLURM is the job and resource manager used in the HPC. Its official online documentation is at https://slurm.schedmd.com/documentation.html.
Partitions and Quality-of-Service (QOS)
The compute nodes previously listed are grouped into partitions and each partition has its default QOS. The default partition is debug
. For all QOSes, the maximum number of concurrently running jobs is 30, while the maximum number of submitted jobs is 45.
Partition | Nodes | QOS | Limits | Remarks |
---|---|---|---|---|
debug | saliksik-cpu-[21-22] | debug_default | 86 CPUs, 1 day run time | |
batch | saliksik-cpu-[01-20,25-36] | batch_default | 86 CPUs, 7 days run time | |
serial | saliksik-cpu-[23-24] | serial_default | 86 CPUs, 14 days run time | |
gpu | saliksik-gpu-[01-06] | gpu-p40_default | 12 CPUs, 1 GPU, 3 days run time | To use the GPU, use either the |
gpu_a100 | saliksik-gpu-[09-10] | currently for limited access only |
Job Parameters
Required Parameters
These are the job parameters that are required prior to running any job:
--account
: (string) group account where job quotas are set;--partition
: (string) which partition the job will be submitted to;--qos
: (string) the appropriate QOS in the partition;--nodes
: (integer) number of nodes to request;--ntasks
: (integer) total number of CPUs to request;--output
: (string) job log file
Optional Parameters
On the other hand, these are some of the optional job parameters:
--ntasks-per-node
: (integer) specify the number of CPUs per node to be requested (must not contradict--ntasks
if also specified);--mem
: (string) memory per node (e.g., 1G, 500K, 4GB, etc.);--job-name
: (string) name for the job; will be displayed in job monitoring commands (as discussed later);--error
: (string) job error file; recommended to not define this parameter and use only--output
instead;--requeue
: (no arg) make job eligible for requeue;--mail-type
: (string) send an email to the user when the job is in the specified status, such asNONE
,BEGIN
,END
,FAIL
,REQUEUE
,ALL
, etc. (seesbatch
manual for more info);--mail-user
: (string) user's email address;
For other parameters or more info regarding the above listed parameters, see the sbatch
manual using the following command or go to the online manual.
Job Script
A job script is submitted to allocate resources for a job. The previously discussed job parameters and the commands to be used to run the job are placed here.
Here is a sample job script where comments have been included to describe what each block does:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#SBATCH --account=<slurm_group_acct>
#SBATCH --partition=<partition>
#SBATCH --qos=<qos>
#SBATCH --nodes=<num_nodes>
#SBATCH --ntasks=<num_cpus>
#SBATCH --job-name="<jobname>"
#SBATCH --output="%x.out" ## <jobname>.<jobid>.out
##SBATCH --mail-type=ALL ## optional
##SBATCH --mail-user=<email_add> ## optional
##SBATCH --requeue ## optional
##SBATCH --ntasks-per-node=1 ## optional
##SBATCH --mem=24G ## optional: mem per node
##SBATCH --error="%x.%j.err" ## optional; better to use --output only
## For more `sbatch` options, use `man sbatch` in the HPC, or go to https://slurm.schedmd.com/sbatch.html.
## Set stack size to unlimited.
ulimit -s unlimited
## Benchmarking.
start_time=$(date +%s.%N)
## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE : ${SLURM_MEM_PER_NODE}"
## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmpdir="yes"
if [[ $custom_tmpdir == "yes" ]]; then
JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
mkdir -p ${JOB_TMPDIR}
export TMPDIR=${JOB_TMPDIR}
echo "TMPDIR : $TMPDIR"
fi
## Reset modules.
module purge
module load <module1> [<module2> ...]
## Main job. Run your codes and executables here; `srun` is optional.
[srun] /path/to/exe1 <arg1> ...
[srun] /path/to/exe2 <arg2> ...
## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
rm -rf $TMPDIR
echo "Cleared the TMPDIR (${TMPDIR})"
fi
## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"
Job Management
Submit Job Script
It is recommended to submit the job inside the folder containing the job script. It is also recommended that any and all input and/or output files be within the same folder where the job script is located. This is to avoid changing working directories which may result in confusion and possible errors in accessing files/folders. For example, if the job folder is at /home/username/scratch3/test-job
where all the necessary input files are stored together with the job script named job.sbatch
:
$ sbatch job.sbatch
Show Job Queue
If no argument is passed, all jobs in the queue will be displayed.
Show Job Parameters
Check Node and/or Partition Status
Cancel Job(s)
You may only cancel jobs created under your account.
Activity
Test your knowledge and skills acquired from this module by performing the following tasks.
Create Environment
For your first task, create an Anaconda environment via a SLURM job. The environment should have the following specifications:
- Name:
mytestenv
- Channels:
conda-forge
- Packages:
openmpi-mpicc
version4.1.6
Install Additional Packages
Into the environment created above, install the following packages via another SLURM job:
- Channels:
conda-forge
pytorch
- Packages:
gromacs
version2023.3
buildmpi_openmpi_dblprec_hecbbb8f_0
pytorch-cuda
version11.8
Compile and Execute Code
Create a file (in any of your scratch folders) containing the following sample source code:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
mpi_hello_world.c
which will be compiled and executed using the mpicc
and mpiexec
executables, respectively, which have been installed during the creation of the mytestenv
environment. The following job script named mpi_hello_world.sbatch
will be used:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#SBATCH --account=<slurm_grp_acct>
#SBATCH --partition=debug
#SBATCH --qos=debug_default
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name="mpi_hello_world"
#SBATCH --output="%x.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.com
#SBATCH --requeue
## Set stack size to unlimited.
ulimit -s unlimited
## Benchmarking.
start_time=$(date +%s.%N)
## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE : ${SLURM_MEM_PER_NODE}"
## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmp="no"
if [[ $custom_tmp == "yes" ]]; then
JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
mkdir -p ${JOB_TMPDIR}
export TMPDIR=${JOB_TMPDIR}
echo "TMPDIR : ${TMPDIR}"
fi
## Reset modules.
module purge
module load anaconda/3-2023.07-2
## Main job. Run your codes and executables here. `srun` is optional.
conda activate openmpi-mpicc-4.1.6
mpicc mpi_hello_world.c -o mpi_hello_world.exe
mpiexec -n ${SLURM_NTASKS} ./mpi_hello_word.exe
## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
rm -rf $TMPDIR
echo "Cleared the TMPDIR (${TMPDIR})"
fi
## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"
In the above job script, the source code is compiled using mpicc
and the resulting binary file (mpi_hello_world.exe
) is executed using mpiexec
where the number of processors is the same as that defined for the #SBATCH --ntasks
parameter. It is expected that this job will only spawn a single processor. This may be confirmed by checking the resulting output file named mpi_hello_world.out
.
Benchmarking
It is also important to note the resulting total run time with the changes in job parameters. Hence, the job log will include the message Total run time (sec): <seconds>
. For this activity, any difference in run time is irrelevant because no heavy compute workload is being done.
For actual compute jobs, however, this is a crucial step in benchmarking to see which combination of job parameters are optimal. As shown in the figure below, the relationship between run time vs. number of processors used is not linear — compute performance will plateau (have little to no change) past a critical point. In the particular example below, the optimal number of processors is around 8. Therefore, it is essential to run benchmark tests prior to performing actual production runs.
Conclusion
Congratulations for completing the Basic HPC Usage Module. At this point, you should now have learned how to:
- Log in to the HPC front end;
- Upload and download files to and from the HPC;
- Check the HPC layout and storage quotas;
- Use environment modules;
- Configure Anaconda;
- Create environments and install packages;
- Create SLURM job scripts;
- Run and manage SLURM jobs; and
- Benchmark your jobs.
Moving forward, users are enjoined to:
- Perform benchmark runs to optimize resource usage;
- Learn advanced Linux terminal usage;
- Learn advanced HPC usage;
- Learn the other best practices when using the HPC.
Notes
- ^ Code reference: https://mpitutorial.com/tutorials/mpi-hello-world