Saliksik Cluster
As part of the efforts to upgrade the COARE's current infrastructure, the COARE Team has started to implement the saliksik cluster, which comprises the next generation of HPC-based CPUs and GPUs of the COARE.
Each CPU node has 88 threaded cores up from the previous 48 cores in the tux cluster. The COARE's next generation GPU cluster comprises of 6 Nvidia Tesla P40s. While the job scheduler will be the same SLURM workload manager platform as used in the tux cluster (previous setup), however the current setup has been upgraded and improved.
Partitions
The saliksik cluster is divided into four (4) partitions:
- Debug - for short jobs or for debugging of codes
- Batch - for parallel jobs or for jobs requiring MPI
- Serial - for single-core jobs or jobs not requiring MPI
- GPU - for jobs that require GPU resources
Job Walltime Limits
Job limits are imposed for fair usage of the COARE's resources, basically to prevent users from hogging the COARE’s resources. Every job submitted by users to SLURM is subject to the COARE’s policy SLURM Job Limits.
For the saliksik cluster, the following job walltime limits are implemented and are summarized below:
Debug | Maximum of 3 hours allowable runtime |
Batch | Maximum of 3 days allowable runtime |
Serial | Maximum of 7 days allowable runtime |
GPU | Maximum of 3 days allowable runtime |
Job Scripts
Users should consider affixing the following on their sbatch script in specifying the partition that they will be running:
Partition | QOS |
batch | 240c-1h_batch |
GPU | 12c-1h_2gpu |
debug | 240c-1h_debug |
serial | 84c-1d_serial |
For example: