HPC

Last modified by Administrator on Wed, 10/20/2021, 11:15 AM
Page Rating
1 Votes

284914309.png

The COARE's HPC consists of a cluster of compute and storage servers to allow high-speed and resource-intensive computations and processing of large datasets.

The system architecture for the COARE HPC service is detailed below:

1129331312.png

The current capacity of the COARE HPC is summarized below:
 

CPU30 Tflops
GPU72 Tflops

The HPC service uses SLURM as its batch scheduler. The cluster is divided into 4 partitions: Debug, Batch, Serial, and GPU. Below are the specifications per partition:

Debug (2 Nodes)

  • 44 cores, 88 threads
  • 528GB RAM 
     

Batch (14 Nodes)

  • 44 cores, 88 threads
  • 528GB RAM
     

Serial (2 Nodes) 

  • 44 cores, 88 threads
  • 528GB RAM
                   

GPU (6 Nodes) 

  • 12 cores, 24 threads
  • 1056GB (1TB) RAM
  • NVIDIA Tesla P40
     

The home directory (/home) is the COARE's network filesystem using GlusterFS and is built to serve as the user’s home directory. Users’ scripts input data are stored here.

The scratch directories (/scratch1, /scratch2, and /scratch3) are the COARE's parallel filesystem using LustreFS. These are built to handle user’s I/O heavy workloads. The output of running jobs including the intermediary files are stored here.

NOTE:  As part of the efforts to upgrade the COARE's current infrastructure, the COARE Team has been implementing the saliksik cluster since Q2 of 2020, which comprises the next generation of HPC-based CPUs and GPUs of the COARE. For more information on saliksik, visit this Wiki page.

Default Allocation for HPC service

The COARE provides a default allocation for each user to ensure the fair and equitable use of the COARE Services. For more information, please read the COARE's Acceptable Use Policy (AUP).

The implementation of the saliksik cluster also required the adjustment of the default allocation provided for each COARE HPC user, which is summarized in the table below:

CPU240 logical cores
Network filesystem (/home)100 GB usable
Parallel filesystem
(Scratch directories:
/scratch1, /scratch2, and /scratch3)

Total of 10TB across scratch directories
GPU2 GPUs
Max running job30 jobs
Max submit job45 jobs
Job waiting timeNo guarantee;  depends on the status of the queue and the availability of the requested resource/s
Job walltime limitDebugMaximum of 3 hours allowable runtime
Batch Maximum of 3 days allowable runtime
SerialMaximum of 7 days allowable runtime
GPUMaximum of 3 days allowable runtime

Any requests for allocation increase will be subject to the COARE Team's evaluation and approval.

Tags: