HPC
The COARE's HPC consists of a cluster of compute and storage servers to allow high-speed and resource-intensive computations and processing of large datasets.
The system architecture for COARE HPC service is detailed below:
The current capacity of COARE HPC is summarized below:
CPU | 30 Tflops |
GPU | 75 Tflops |
The HPC service uses SLURM as its batch scheduler. The cluster is divided into 4 partitions: Debug, Batch, Serial, and GPU. Below are the specifications per partition:
Debug (2 Nodes)
- 44 cores, 88 threads
- 528GB RAM
Batch (32 Nodes)
- 44 cores, 88 threads
- 528GB RAM
Serial (2 Nodes)
- 44 cores, 88 threads
- 528GB RAM
GPU (6 Nodes)
- 12 cores, 24 threads
- 1056GB (1TB) RAM
- NVIDIA Tesla P40
The home directory (/home) is COARE's network filesystem using GlusterFS and is built to serve as the user’s home directory. Users’ scripts input data are stored here.
The scratch directories (/scratch1, /scratch2, and /scratch3) are COARE's parallel filesystem using LustreFS. These are built to handle user’s I/O heavy workloads. The output of running jobs including the intermediary files are stored here.
NOTE: As part of the efforts to upgrade COARE's current infrastructure, the COARE Team has been implementing the saliksik cluster since Q2 of 2020, which comprises the next generation of HPC-based CPUs and GPUs of the COARE. For more information on saliksik, visit this Wiki page.
Default Allocation for HPC service
COARE provides a default allocation for each user to ensure the fair and equitable use of the COARE Services. For more information, please read the COARE's Acceptable Use Policy (AUP).
The implementation of the saliksik cluster also required the adjustment of the default allocation provided for each COARE HPC user, which is summarized in the table below:
CPU | 86 logical cores | |
Network filesystem (/home) | 100 GB usable | |
Parallel filesystem (Scratch directories: /scratch1, /scratch2, and /scratch3) | Total of 3TB across scratch directories | |
GPU | 1 GPU | |
Max running job | 30 jobs | |
Max submit job | 45 jobs | |
Job waiting time | No guarantee; depends on the status of the queue and the availability of the requested resource/s | |
Job walltime limit | Debug | Maximum of 1 day allowable runtime |
Batch | Maximum of 7 days allowable runtime | |
Serial | Maximum of 14 days allowable runtime | |
GPU | Maximum of 3 days allowable runtime |
Any requests for allocation increase will be subject to the COARE Team's evaluation and approval.