Introduction

A high-performance cluster (HPC) is a network of servers that are pooled together to maximize their computational capabilities for specific purposes — often for computationally-intensive requirements such as simulations and modeling, among others. Users should treat the HPC as an extension of their personal computers to provide the extra computational power needed for research. However, it is important for users to exercise prudence on the use of any HPC, i.e., wrong input will produce wrong output.

For novice or first-time HPC users, the COARE Team prepared this basic HPC module for guidance on how to jump-start your HPC journey prior to running actual jobs. This module focuses on the overview and practicalities of using the Saliksik HPC.

An important reminder for all users is that to be able to effectively use any HPC, some basic knowledge of the Linux command-line interface (terminal) is needed. Here are some useful online references that users can study which covers the basics of the Linux terminal:

Of course, the Saliksik HPC also has its share of limitations, such as:

  • Non-Linux programs are currently not supported.
  • There is very limited support for applications that need graphical user interfaces (GUI). HPCs in general are optimized mostly for terminal applications.
  • Jobs are limited to the availability of physical resources upon submission.

After this module, the users are expected to be able to:

  • Log in to their HPC accounts;
  • Perform file and folder transfers to (upload) and from (download) the HPC;
  • Use environment modules;
  • Manage their Anaconda environments and packages;
  • Create SLURM job scripts; and
  • Run and manage their SLURM jobs.



Applications for Windows

For this module, Windows OS users who will opt to not use the built-in PowerShell application should install PuTTY and WinSCP for logging in and file transfers, respectively. The usage for such applications will be discussed further.


Before we proceed...

The following sections include commands that should be entered into the terminal which are indicated by the shell symbols $ (for Linux, Unix, and MacOS systems) and > (for Windows PowerShell). Unless otherwise indicated, terminal commands for Linux/Unix/MacOS systems may also be used for Windows PowerShell.

Commands or arguments enclosed in square brackets ([ and ]) are optional, while those in angled brackets (< and >) need to be supplied (e.g.<username>). Commands or arguments separated by vertical bar (|) indicate different choices (e.g.-n name | -p path means that either the name or path may be used).



Accessing the HPC

The HPC can only be accessed using passwordless SSH, so SSH key(s) need to be appended to the user's account. Every user is responsible for their own account. Account sharing is strictly prohibited as outlined in the COARE Acceptable Use Policy (AUP). This module assumes that the user hasn't logged in yet to their account.

Generating SSH Key Pairs

The SSH key pair consists of a private and public key. The public key is the one appended to the user's account in the HPC which is used to confirm the private key stored in the user's personal computer every time the user logs in to the HPC.

For Windows OS users, follow only either of the methods (either Terminal or Graphical) as a key generated with one method will not work with the other, for example: an SSH key generated using ssh-keygen in the terminal (as discussed further) cannot be used to log in with the graphical program PuTTY.

Terminal

Note

This section is applicable for Linux, MacOS, and Windows PowerShell terminals.

To generate an SSH key, open your computer's terminal and use this command:

$ ssh-keygen [<args>]

Subsequent prompts for input from the user will be displayed:

image-20230830163420-1.png

Simply pressing the  keyboard key without any input to the prompts will use the default options indicated in parentheses. Users may opt to put a passphrase (password) on the key pair, however it is more convenient not to as the passphrase will be asked every time the user logs in.

By default, the key pair will be stored in the ${HOME}/.ssh folder where the private and public keys are called id_rsa and id_rsa.pub, respectively. The $HOME folder will look like /home/username for Linux/Unix, /Users/username for MacOS, and C:\Users\username for Windows. For all the said platforms, the shorthand for the $HOME folder is ~/ (tilde symbol).

The private key SHOULD ONLY be accessible to the user for security reasons as this can be used by another person to log in to your account. Here is what a sample RSA-type SSH private key looks like:

1693381918576-168.png

And here is its corresponding public key:

1693382067456-440.png

The public key will be appended to the user's account so they can log in to the HPC. Make sure it is in the OpenSSH format like above — ssh-rsa <some_long_str_here> <comment> where <comment> is optional and usually takes the form <user>@<computer-name>.

Warning

DO NOT send your private key to anyone else, even the COARE Team (unless explicitly required by the Team).

Graphical

Note

This section is applicable for Windows OS only.

Open the PuTTYgen application comes with installing PuTTY. The default parameters at the bottom, i.e., RSA type of key and 2048 bits, is already good to use:

https://www.ssh.com/hubfs/Imported_Blog_Media/PuTTYgen_started-2.png

Click on the  button, then randomly move over the mouse pointer at the blank area to generate the key:

PuTTYgen generating RSA SSH key

The PuTTYgen interface will look like this after a key has been successfully generated:

How to install and use puttygen to create new key pairs and change  passphrases. Installing keys on server, managing SSH keys.

Click the  button to save the private key in your computer. The public key which will be appended to your account is inside the box indicated by Public key for pasting into OpenSSH authorized_keys file:. Send your public key to the COARE Team to have it appended to your account. If you need the OpenSSH-formatted public key from a previously created private key, simply click the  button and locate the private key.

Warning

Again, DO NOT send your private key to anyone else, even the COARE Team (unless explicitly required by the Team).

Logging In

The user may log in to the HPC after the COARE Team has appended the public key to the user's account.

For Windows OS users, follow only either of the methods (either Terminal or Graphical) from the previously followed method in generating the SSH key as a key generated with one method will not work with the other (as previously explained). Likewise, any setting/configuration in one method will not work with the other, for example: setting up an SSH configuration file (as discussed further) cannot be used when logging in with PuTTY.

Terminal

Note

This section is applicable for Linux, MacOS, and Windows PowerShell terminals.

Interactive Command

To log in to the HPC, use this command in your local machine's terminal:

$ ssh [-v] [-i </path/to/ssh/priv/key>] <username>@saliksik.asti.dost.gov.ph  # or 202.90.149.55

The -i option specifies the private key to your path which is ~/.ssh/id_rsa by default. To print more verbose messages with this command, add the -v option with more v's to increase verbosity (i.e.-vv and -vvv), but the single -v should suffice. The front end (log in) node has the public domain name saliksik.asti.dost.gov.ph or IP address 202.90.149.55. After successfully logging in, the HPC's welcome page will be displayed:

1693907291122-319.png

SSH Configuration File

The SSH parameters can be saved into a configuration file for convenient log in every time. This will also come in handy later on when downloading and uploading files.

In Linux and MacOS terminals, use vimnano, or any text editor. If the file doesn't exist, it will automatically be created upon saving:

$ <vim|nano> ~/.ssh/config

For Windows PowerShell, a blank file should be created prior to editing using notepad.exe because notepad.exe automatically adds a .txt filename extension which will make the config file unusable:

> ni ~/.ssh/config
> notepad.exe ~/.ssh/config

Here is a sample SSH configuration file:

1
2
3
4
Host          saliksik
User          username
Hostname      saliksik.asti.dost.gov.ph
IdentityFile  ~/.ssh/id_rsa

The column spacing set above is optional and is only set for better readability, so a single space for each line will suffice. The value set for Host (in this case, saliksik) will now be used to shorten the previously full SSH login command into:

$ ssh [-v<vv>] saliksik

Graphical

Note

This section is applicable for Windows OS only​​​​​​.

To log in using PuTTY, the minimum parameters needed are the username, hostname, and private key generated by PuTTYgen. Under the Session tab (the default tab), in the Host Name or (IP address) box, key in username@saliksik.asti.dost.gov.ph (or username@202.90.149.55):

Describes how to use PuTTY on Windows. Installation, terminal window,  configuring, generating SSH keys.

Then, go to the ConnectionSSH > Auth tab, and locate the private key previously created by PuTTYgen in the Private key file for authentication box:

PuTTY authentication public key options

To save the parameters, go back to Session tab, then put a name (such as saliksik) in the Saved Sessions box, and click . It should be added below Default Settings. In the future, to use the saved session settings, click on the name of the saved session, then click the  button to load the saved parameters.

Finally, click at the bottom portion of the window to log in to the HPC. The following security alert might appear:

Security Alert Dialog about unknown server host key

If logging in using PuTTY for the very first time, then this is normal as the server's host key is not yet recognized by PuTTY. However, if the server's host key has already been previously cached yet the alert still appeared, then kindly inform the COARE Team as this may be a security concern.

Users are also encouraged to explore the other settings of PuTTY, such as the terminal size, font size and color, etc.

HPC Layout

The Saliksik HPC is composed of the following nodes (servers):

  • Front end (login)
    • This is where users log in to the HPC. DO NOT run jobs here. Use the debug nodes instead (will be explained later). Violators will be subjected to the COARE AUP.
  • Compute nodes
    • CPU nodes x 36. Every node has:
      • 88 logical CPUs (86 usable)
      • 500 GB RAM
    • GPU nodes
      • P40 nodes x 6. Every node has:
        • 24 CPUs (22 usable)
        • 1 TB RAM
        • NVIDIA Tesla P40 GPU x 1
      • A100 nodes x 2. Every node has:
        • 128 CPUs (126 usable)
        • 1 TB RAM
        • NVIDIA Tesla A100 GPU x 8

Storage Quotas

Each user has the following default storage quotas:

  • Home (/home/username): 100 GB
  • Scratch folders (/scratch[1-3]/username symlinked to /home/username/scratch[1-3]): 5 TB for each scratch folder

The Saliksik HPC is regularly undergoing maintenance and streamlining operations, so this may change in the future with prior notice to users.

The home folder is intended for long-term data storage, while the scratch folders are for heavy input and output (I/O) file operations when running jobs. The scratch folders are also significantly faster than the home folder for read and write operations, so jobs should only be performed using the scratch folders and users are prohibited from running their jobs in their home folders. Please refer to the COARE AUP for more info.

Uploading and Downloading Files

Terminal

Note

This section is applicable for Linux, MacOS, and Windows PowerShell terminals.

Remote file transfers via the terminal is done using scp or rsync. All of the commands listed here should be done on the local computer for both upload and download operations.

Using scp

In your computer, upload files and/or folders with scp using the following command:

$ scp [-r] [-v] [-i </path/to/ssh/priv/key>] </source/path/in/local/machine> <username>@saliksik.asti.dost.gov.ph:</dest/path/in/server>

The scp options -r and -v is for recursive (entire folders) transfers and verbose output, respectively. The -i </path/to/ssh/priv/key> option specifies the private SSH key file to use. If an SSH configuration file was created (for example, Host is set as saliksik), the command can be shortened into:

$ scp [-r] [-v] </source/path/in/local/machine> saliksik:</destination/path/in/server>

Downloading files using scp follows the same principle as above with a minor modification: the source and destination should be switched, of course. To download, use either the long or shortened (if there is an SSH configuration file) version of the command:

$ scp [-r] [-v] [-i </path/to/priv/ssh/key>] <username>@saliksik.asti.dost.gov.ph:</source/path/in/server> </dest/path/in/local/machine>
$ scp [-r] [-v] <host>:</source/path/in/server> </dest/path/in/local/machine>

More information about scp can be found in its manual pages:

$ man scp

Using rsync

Caution

When using rsync, the path naming is critical: it interprets a slash (/) at the end to mean you're transferring the contents of the folder. For example, say inside your local machine's home folder there is a folder called folder1 containing a file called file2.​

If recursively transferring the folder into /home/username/dest in the HPC using the command $ rsync -avhP ~/folder1 saliksik:/home/username/dest for example, this will result in transferring the entire folder into /home/username/dest in the HPC, so file2 will be stored as /home/username/dest/folder1/file2.

However, when $ rsync -avhP ~/folder1/ hpc:/home/username/dest (mind the / after folder1) is used, only the contents of folder1 will be transferred, so file2 will be stored as /home/username/dest/file2 in the HPC. That subtle / at the end makes a significant difference which may affect your file and folder transfers.

One key advantage of rsync over scp is that the former updates the data of the files in the destination, so when rsync detects that there's no difference between the source and destination files then the transfer can terminate immediately. With rsync, transfers can be interrupted but can be continued later on without having to transfer everything from the start, which is different from scp as it will overwrite the destination file even if they are exactly the same.

In your computer, upload files with rsync using the following command:

$ rsync [-a] [-v] [-h] [-P] [-i </path/to/ssh/priv/key>] </source/path/in/local/machine> <username>@saliksik.asti.dost.gov.ph:</destination/path/in/server>

The rsync options -a, -v, -h, and -P are for archive mode (-a), verbose output (-v), human-readable output (-h), and to keep partially transferred files and show progress (-P), respectively. Like that with scpthe -i option also defines the path to the private SSH key file. If an SSH configuration file is set (for example, Host is also set as saliksik), the command can be shortened to:

$ rsync [-avhP] </source/path/in/local/machine> saliksik:</destination/path/in/server>

To download files and/or folders, use either the long or shortened (again, if there is an SSH configuration file) version of the command:

$ rsync [-avhP] [-i </path/to/priv/ssh/key>] <username>@saliksik.asti.dost.gov.ph:</source/path/in/server> </dest/path/in/local/machine>
$ rsync [-avhP] <host>:</source/path/in/server> </dest/path/in/local/machine>

For more information about rsync and its options, refer to its manual pages:

$ man rsync

Graphical

Note

This section is applicable for Windows OS only​​​​​​.

One non-terminal option to transfer files and folders to and from the HPC for Windows users is WinSCP.

To log in to the HPC, enter the following parameters in its interface:

  • File protocol: SFTP
  • Host name: saliksik.asti.dost.gov.ph (or 202.90.149.55)
  • Port number: 22
  • User name: (your HPC username)
  • Password: (leave as blank)

Configuring Session (Login Dialog) :: WinSCP

Then, click the  button which will bring up the Advanced Site Settings window:

Advanced Site Settings dialog :: WinSCP

Navigate to the SSH > Authentication tab and locate the private SSH key file generated using PuTTYgen:

The Authentication Page (Advanced Site Settings dialog) :: WinSCP

Click  to go back to the log in interface. After configuring the log in, click the  button to connect to the HPC. Upon successful log in, WinSCP will show the Commander interface where local and remote files are shown on the left and right portions, respectively:

https://winscp.net/eng/data/media/screenshots/commander.png

Uploading and downloading files to and from the HPC is as simple as "drag and drop" using WinSCP. Users are encouraged to explore the other settings and features of WinSCP such as displaying hidden files (with dot prefixes, e.g. .bashrc), etc.



Modules and Environments

Modules allow program installations with different versions to be used without them interfering with each other, thus effectively keeping each version in a sandboxed environment. In other words, modules allow programs to be used in isolation from others which avoids possible incompatibilities and inconsistencies. However, it should be noted that the COARE Team is gradually doing away with modules in favor of Anaconda environments, but modules are still used for programs that are not available in the Anaconda repository (anaconda.org).

Module Commands

Modules have the format  <module_name>/<version>, for example: anaconda/3-2023.07-2.

List Available Modules

Without any argument, this command will list all available versions of all installed modules. When one or more module names are provided, the available versions for the modules are listed:

$ module avail [<module1/version> <module2/version> ...]

For example, running module avail without additional arguments will print the following example list of modules which is not exhaustive as it is constantly being updated:

1696496230423-240.png

On the other hand, when using the command module avail gromacs for example, the available versions of the gromacs module are listed:

1696496371478-898.png

Load module(s)

$ module load <module1/version> [<module2/version> ...]

List loaded module(s)

$ module list

Reload currently loaded module(s)

$ module reload

Unload module(s)

$ module unload <module1/version> [<module2/version>]

Unload all loaded modules

$ module purge

Anaconda

Anaconda is a package and environment manager written primarily in Python. Its official website is anaconda.org.

Configure conda and mamba

Anaconda's default package manager is conda, although in practice mamba is better to use because it's much more efficient and its warning and error messages are more intuitive. However, it's still a good idea to be able to use them both.

Initialize conda and mamba

As of writing, the latest Anaconda module is anaconda/3-2023.07-2. In the past, running $ conda activate will prompt an error saying that the ~/.bashrc script has not yet been initialized. Loading the module will automatically initialize conda and mamba, so no need to modify your ~/.bashrc script like in the previous Anaconda module versions.

$ module load anaconda/3-2023.07-2  # use the latest available
$ conda activate   # or mamba activate; activate base env

Change the Default Locations

The default locations for the conda environments and packages is at ~/.conda/envs and ~/.conda/pkgs, respectively. The environments folder is the path prefix where environments are created (e.g., creating an environment named test will be created in ~/.conda/envs/test by default), while the packages folder is where the installers are downloaded and cached. Both folders are stored in the user's home folder by default. However, as previously explained, the home folder is significantly slower than the scratch folders. In addition, the storage quota for the home folder is significantly less than the scratch folders. Thus, to maximize job performance later on, the default paths for both folders will be changed to one of the scratch folders. The configuration set here will also be used by mamba.

To do this, use the following commands:

# <scratch> can be "scratch1", "scratch2", or "scratch3"
# (e.g., /scratch3/trainee/conda/envs)
$ conda config --add envs_dirs /<scratch>/<username>/conda/envs
$ conda config --add pkgs_dirs /<scratch>/<username>/conda/pkgs

If these paths don't exist, conda will automatically create them during package download or environment creation. To confirm the configuration, check that the conda start up script has been modified which should look like this (YAML format):

$ cat ~/.condarc

envs_dirs:
  - /scratch3/username/conda/envs
pkgs_dirs:
  - /scratch3/username/conda/pkgs

The ~/.condarc file may actually be directly created and/or modified without having to run the above commands. Intuitively, additional paths may be supplied to the envs_dirs and pkgs_dirs parameters which will be useful in cases, say the first path becomes full or the user has no permission to write to the path, so the next path will be used and so on. The paths are directly pointed to the scratch folder instead of that in the user's home folder (/home/username/scratch[1-3]) because the former is the actual path of the scratch folder while the latter is only a symlink to the scratch folders.

Manage Environments

Create Environments

Caution

Creating environments may significantly use computational resources which is not allowed in the front end node. This operation should be performed in a compute node. Therefore, the commands discussed here should be submitted as a SLURM job. Refer to the next section (SLURM) on how to submit a job.

Default Way

To create an Anaconda environment, simply use the following command template:

$ mamba create [-y] <-n env_name | -p env_path> <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>] <package2>[=<version2>=<build2>] ...

The -y argument is optional and tells mamba to assume that "yes" is the answer to all its questions. However, -y is required when creating the environment via SLURM because the job will fail if it is not defined since there will be interactive questions which cannot be answered. If the version and build of the package(s) are not defined, then the latest available will be installed.

For example, to create an environment named myenv containing the package hmmer from the bioconda channel (https://anaconda.org/bioconda/hmmer):

$ mamba create [-y] -n myenv -c bioconda hmmer

Using Multiple Channels and Packages

Of course, multiple channels and packages may be used, such as hmmer from the  channel (https://anaconda.org/bioconda/hmmer) and sqsgenerator from the conda-forge channel (https://anaconda.org/conda-forge/sqsgenerator):

$ mamba create [-y] -n myenv -c bioconda -c conda-forge hmmer sqsgenerator

List Environments

To list the environments visible to the user, use the following command template:

$ <mamba|conda> env list  # or: <mamba|conda> info <-e|--envs>

Activate an Environment

To activate an environment, use the following command template:

$ <mamba|conda> activate <env_name|env_path>

Remove an Environment

To remove an environment:

$ <mamba|conda> env remove <-n env_name | -p env_path>

Manage Packages

Install and Remove Packages

Caution

Installing packages may significantly use computational resources which is not allowed in the front end node. This operation should be performed in a compute node. Therefore, the commands discussed here should be submitted as a SLURM job. Refer to the next section (SLURM) on how to submit a job.

Default Way

To install packages into an existing environment in a single line, use the following command template:

$ mamba install [-y] <-n env_name | -p env_path> <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>]

To remove packages:

$ mamba remove [-y] <-n env_name | -p env_path> <package1>[=<version1>=<build1>]

The above commands may also be done by activating the environment first prior to package installation or removal:

$ ​module load anaconda/3-2023.07-2​
$ conda activate <env_name | env_path>
$ mamba install [-y] <-c channel1> [<-c channel2> ...] <package1>[=<version1>=<build1>] ...
$ mamba remove [-y] <package1>[=<version1>=<build1>] ...

Specific Package Version and Build

A package may have different version and builds available. For example, the pytorch package in the pytorch channel (https://anaconda.org/pytorch/pytorch) has multiple versions available and each version has multiple builds:

1698144471591-564.png

In the above screenshot, the linux-64 architecture offers multiple builds for version 1.11.0, namely: py3.10_cuda11.1_cudnn8.0.5_0py3.10_cuda11.3_cudnn8.2.0_0py3.10_cuda11.5_cudnn8.3.2_0 and py3.7_cpu_0. The other newer versions have multiple builds as well. The build for each package may be inferred from the name or accessed by pressing the ​ icon, for example:

1698145507225-181.png

In the above example, pytorch may be installed by simply specifying the version, like so:

$ ​module load anaconda/3-2023.07-2​
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0

However, there may be instances where you need to install the CUDA-enabled (GPU) build but the latest build is CPU-only, so the above command would install the CPU build of pytorch version 1.11.0. To install the CUDA-enabled build, for example, py3.10_cuda11.1_cudnn8.0.5_0, use the command below (hint: this should be submitted to a GPU-capable node such as those in the gpu partition):

$ ​module load anaconda/3-2023.07-2​
$ mamba create [-y] -n myenv -c conda-forge pytorch=1.11.0=py3.10_cuda11.1_cudnn8.0.5_0

List Installed Packages

This operation may be done interactively, so no need to submit this via SLURM. To list the packages installed in an environment, there are two ways:

Activate the environment, then list the packages:

$ <conda|mamba> activate <env_name | env_path>
$ <conda|mamba> list

Or, list the packages directly:

$ <conda|mamba> list <-n env_name | -p env_path>



SLURM

SLURM is the job and resource manager used in the HPC. Its official online documentation is at https://slurm.schedmd.com/documentation.html.

Partitions and Quality-of-Service (QOS)

The compute nodes previously listed are grouped into partitions and each partition has its default QOS. The default partition is debug. For all QOSes, the maximum number of concurrently running jobs is 30, while the maximum number of submitted jobs is 45.

PartitionNodesQOSLimitsRemarks
debugsaliksik-cpu-[21-22]debug_default86 CPUs, 1 day run time 
batchsaliksik-cpu-[01-20,25-36]batch_default86 CPUs, 7 days run time 
serialsaliksik-cpu-[23-24]serial_default86 CPUs, 14 days run time 
gpusaliksik-gpu-[01-06]gpu-p40_default

12 CPUs, 1 GPU, 3 days run time

To use the GPU, use either the
#SBATCH --gres=gpu:p40:1 or
#SBATCH --gres=gpu:1
job parameter

gpu_a100saliksik-gpu-[09-10]currently for limited access only 

Job Parameters

Required Parameters

These are the job parameters that are required prior to running any job:

  • --account: (string) group account where job quotas are set;
  • --partition: (string) which partition the job will be submitted to;
  • --qos: (string) the appropriate QOS in the partition;
  • --nodes: (integer) number of nodes to request;
  • --ntasks: (integer) total number of CPUs to request;
  • --output: (string) job log file 

Optional Parameters

On the other hand, these are some of the optional job parameters:

  • --ntasks-per-node: (integer) specify the number of CPUs per node to be requested (must not contradict --ntasks if also specified);
  • --mem: (string) memory per node (e.g., 1G, 500K, 4GB, etc.);
  • --job-name: (string) name for the job; will be displayed in job monitoring commands (as discussed later);
  • --error: (string) job error file; recommended to not define this parameter and use only --output instead;
  • --requeue: (no arg) make job eligible for requeue;
  • --mail-type: (string) send an email to the user when the job is in the specified status, such as NONEBEGINENDFAILREQUEUEALLetc. (see sbatch manual for more info);
  • --mail-user: (string) user's email address;

For other parameters or more info regarding the above listed parameters, see the sbatch manual using the following command or go to the online manual.

$ man sbatch

Job Script

A job script is submitted to allocate resources for a job. The previously discussed job parameters and the commands to be used to run the job are placed here.

Here is a sample job script where comments have been included to describe what each block does:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/bin/bash
#SBATCH --account=<slurm_group_acct>
#SBATCH --partition=<partition>
#SBATCH --qos=<qos>
#SBATCH --nodes=<num_nodes>
#SBATCH --ntasks=<num_cpus>
#SBATCH --job-name="<jobname>"
#SBATCH --output="%x.out"         ## <jobname>.<jobid>.out
##SBATCH --mail-type=ALL          ## optional
##SBATCH --mail-user=<email_add>  ## optional
##SBATCH --requeue                ## optional
##SBATCH --ntasks-per-node=1      ## optional
##SBATCH --mem=24G                ## optional: mem per node
##SBATCH --error="%x.%j.err"      ## optional; better to use --output only

## For more `sbatch` options, use `man sbatch` in the HPC, or go to https://slurm.schedmd.com/sbatch.html.

## Set stack size to unlimited.
ulimit -s unlimited

## Benchmarking.
start_time=$(date +%s.%N)

## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID          : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME        : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES   : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST    : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS          : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE    : ${SLURM_MEM_PER_NODE}"

## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmpdir="yes"

if [[ $custom_tmpdir == "yes" ]]; then
  JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
   mkdir -p ${JOB_TMPDIR}
  export TMPDIR=${JOB_TMPDIR}
  echo "TMPDIR                : $TMPDIR"
fi

## Reset modules.
module purge
module load <module1> [<module2> ...]

## Main job. Run your codes and executables here; `srun` is optional.
[srun] /path/to/exe1 <arg1> ...
[srun] /path/to/exe2 <arg2> ...

## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
   rm -rf $TMPDIR
  echo "Cleared the TMPDIR (${TMPDIR})"
fi

## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"

Job Management

Submit Job Script

It is recommended to submit the job inside the folder containing the job script. It is also recommended that any and all input and/or output files be within the same folder where the job script is located. This is to avoid changing working directories which may result in confusion and possible errors in accessing files/folders. For example, if the job folder is at /home/username/scratch3/test-job where all the necessary input files are stored together with the job script named job.sbatch:

$ cd /home/username/scratch3/test-job
$ sbatch job.sbatch
NOTE

In the following commands, nodelist can be written as a single node (e.g.saliksik-cpu-01) or a combination (e.g.saliksik-cpu-[01-10]saliksik-cpu-[01-10,15],saliksik-gpu-01, etc.). On the other hand, the partition argument can also be combination (e.g.batch,gpuetc.).

Show Job Queue

If no argument is passed, all jobs in the queue will be displayed.

$ squeue [-u <username> ] [-p <partition>] [-w <nodelist>]

Show Job Parameters

$ scontrol show job <job_id>  # or jobid=<job_id>

Check Node and/or Partition Status

$ sinfo [-p <partition> | -n <nodelist>]

Cancel Job(s)

You may only cancel jobs created under your account.

$ scancel <job_id1> [<job_id2> ...]



Activity

Test your knowledge and skills acquired from this module by performing the following tasks.

Create Environment

For your first task, create an Anaconda environment via a SLURM job. The environment should have the following specifications:

  • Name: mytestenv
  • Channels:
    • conda-forge
  • Packages:
    • openmpi-mpicc version 4.1.6

Install Additional Packages

Into the environment created above, install the following packages via another SLURM job:

  • Channels:
    • conda-forge
    • pytorch
  • Packages:
    • gromacs version 2023.3 build mpi_openmpi_dblprec_hecbbb8f_0
    • pytorch-cuda version 11.8

Compile and Execute Code

Create a file (in any of your scratch folders) containing the following sample source code:1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <mpi.h>
#include
<stdio.h>

int main(int argc, char** argv) {
   // Initialize the MPI environment
   MPI_Init(NULL, NULL);

   // Get the number of processes
   int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

   // Get the rank of the process
   int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

   // Get the name of the processor
   char processor_name[MPI_MAX_PROCESSOR_NAME];
   int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

   // Print off a hello world message
   printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

   // Finalize the MPI environment.
   MPI_Finalize();
}

This is a simple "hello world" script written in C to test if the right number of processors are spawned as allocated. For this example, the file is named mpi_hello_world.c which will be compiled and executed using the mpicc and mpiexec executables, respectively, which have been installed during the creation of the mytestenv environment. The following job script named mpi_hello_world.sbatch will be used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/bash
#SBATCH --account=<slurm_grp_acct>
#SBATCH --partition=debug
#SBATCH --qos=debug_default
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name="mpi_hello_world"
#SBATCH --output="%x.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.com
#SBATCH --requeue

## Set stack size to unlimited.
ulimit -s unlimited

## Benchmarking.
start_time=$(date +%s.%N)

## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID          : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME        : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES   : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST    : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS          : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE    : ${SLURM_MEM_PER_NODE}"

## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
custom_tmp="no"

if [[ $custom_tmp == "yes" ]]; then
  JOB_TMPDIR=/tmp/${USER}/SLURM_JOB_ID/${SLURM_JOB_ID}
   mkdir -p ${JOB_TMPDIR}
  export TMPDIR=${JOB_TMPDIR}
  echo "TMPDIR                : ${TMPDIR}"
fi

## Reset modules.
module purge
module load anaconda/3-2023.07-2

## Main job. Run your codes and executables here. `srun` is optional.
conda activate ​openmpi-mpicc-4.1.6​
mpicc mpi_hello_world.c -o mpi_hello_world.exe
mpiexec -n ${SLURM_NTASKS} ./mpi_hello_word.exe

## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
   rm -rf $TMPDIR
  echo "Cleared the TMPDIR (${TMPDIR})"
fi

## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"

In the above job script, the source code is compiled using mpicc and the resulting binary file (mpi_hello_world.exe) is executed using mpiexec where the number of processors is the same as that defined for the #SBATCH --ntasks parameter. It is expected that this job will only spawn a single processor. This may be confirmed by checking the resulting output file named mpi_hello_world.out.

Experiment with modifying the --ntasks parameter to see if the same number of processors are spawned. Another experiment to try is to set inconsistent values where mpiexec uses more processors than allocated, such as --ntasks=5 but mpiexec -n 10, which should be expected to result in an error. You can also try setting the opposite where the number of processors allocated exceeds the number mpiexec will use to see how it will turn out.

Benchmarking

It is also important to note the resulting total run time with the changes in job parameters. Hence, the job log will include the message Total run time (sec): <seconds>. For this activity, any difference in run time is irrelevant because no heavy compute workload is being done.

For actual compute jobs, however, this is a crucial step in benchmarking to see which combination of job parameters are optimal. As shown in the figure below, the relationship between run time vs. number of processors used is not linear — compute performance will plateau (have little to no change) past a critical point. In the particular example below, the optimal number of processors is around 8. Therefore, it is essential to run benchmark tests prior to performing actual production runs.

, Execution time in seconds for comparison of various processors in HPC 



Conclusion

Congratulations for completing the Basic HPC Usage Module. At this point, you should now have learned how to:

  • Log in to the HPC front end;
  • Upload and download files to and from the HPC;
  • Check the HPC layout and storage quotas;
  • Use environment modules;
  • Configure Anaconda;
  • Create environments and install packages;
  • Create SLURM job scripts;
  • Run and manage SLURM jobs; and
  • Benchmark your jobs.

Moving forward, users are enjoined to:

  • Perform benchmark runs to optimize resource usage;
  • Learn advanced Linux terminal usage;
  • Learn advanced HPC usage;
  • Learn the other best practices when using the HPC.



Notes

  1. ^ Code reference: https://mpitutorial.com/tutorials/mpi-hello-world
Tags: