Saliksik Cluster

Last modified by Administrator on Fri, 09/22/2023, 7:10 PM
Page Rating
0 Votes

As part of the efforts to upgrade the COARE's current infrastructure, the COARE Team has started to implement the saliksik cluster, which comprises the next generation of HPC-based CPUs and GPUs of the COARE.

Each CPU node has 88 threaded cores up from the previous 48 cores in the tux cluster. The COARE's next generation GPU cluster comprises of 6 Nvidia Tesla P40s. While the job scheduler will be the same SLURM workload manager platform as used in the tux cluster (previous setup), however the current setup has been upgraded and improved. 

Partitions

The saliksik cluster is divided into four (4) partitions:

  1.  Debug - for short jobs or for debugging of codes
  2.  Batch - for parallel jobs or for jobs requiring MPI
  3.  Serial - for single-core jobs or jobs not requiring MPI
  4.  GPU - for jobs that require GPU resources

Job Walltime Limits

Job limits are imposed for fair usage of the COARE's resources, basically to prevent users from hogging the COARE’s resources. Every job submitted by users to SLURM is subject to the COARE’s policy SLURM Job Limits.

For the saliksik cluster, the following job walltime limits are implemented and are summarized below:

DebugMaximum of 3 hours allowable runtime
BatchMaximum of 3 days allowable runtime
SerialMaximum of 7 days allowable runtime
GPUMaximum of 3 days allowable runtime

Job Scripts

Users should consider affixing the following on their sbatch script in specifying the partition that they will be running:

PartitionQOS
batch240c-1h_batch
GPU12c-1h_2gpu
debug240c-1h_debug
serial84c-1d_serial

For example:

#SBATCH --partition=gpu --qos=12c-1h_2gpu

NOTE: "c" in the QOS means CPU. 2gpu means a max resource of 2gpus.

Tags: