HPC

Last modified by Administrator on Tue, 10/17/2023, 3:26 PM
Page Rating
1 Votes

284914309.png

HPC Capacity.png

 

The COARE's HPC consists of a cluster of compute and storage servers to allow high-speed and resource-intensive computations and processing of large datasets.

The system architecture for COARE HPC service is detailed below:

The current capacity of COARE HPC is summarized below:
 

CPU30 Tflops
GPU75 Tflops

The HPC service uses SLURM as its batch scheduler. The cluster is divided into 4 partitions: Debug, Batch, Serial, and GPU. Below are the specifications per partition:

Debug (2 Nodes)

  • 44 cores, 88 threads
  • 528GB RAM 
     

Batch (32 Nodes)

  • 44 cores, 88 threads
  • 528GB RAM
     

Serial (2 Nodes) 

  • 44 cores, 88 threads
  • 528GB RAM
                   

GPU (6 Nodes) 

  • 12 cores, 24 threads
  • 1056GB (1TB) RAM
  • NVIDIA Tesla P40
     

The home directory (/home) is COARE's network filesystem using GlusterFS and is built to serve as the user’s home directory. Users’ scripts input data are stored here.

The scratch directories (/scratch1, /scratch2, and /scratch3) are COARE's parallel filesystem using LustreFS. These are built to handle user’s I/O heavy workloads. The output of running jobs including the intermediary files are stored here.

NOTE:  As part of the efforts to upgrade COARE's current infrastructure, the COARE Team has been implementing the saliksik cluster since Q2 of 2020, which comprises the next generation of HPC-based CPUs and GPUs of the COARE. For more information on saliksik, visit this Wiki page.

Default Allocation for HPC service

COARE provides a default allocation for each user to ensure the fair and equitable use of the COARE Services. For more information, please read the COARE's Acceptable Use Policy (AUP).

The implementation of the saliksik cluster also required the adjustment of the default allocation provided for each COARE HPC user, which is summarized in the table below:

CPU86 logical cores
Network filesystem (/home)100 GB usable
Parallel filesystem
(Scratch directories:
/scratch1, /scratch2, and /scratch3)

Total of 3TB across scratch directories
GPU1 GPU
Max running job30 jobs
Max submit job45 jobs
Job waiting timeNo guarantee;  depends on the status of the queue and the availability of the requested resource/s
Job walltime limitDebugMaximum of 1 day allowable runtime
Batch Maximum of 7 days allowable runtime
SerialMaximum of 14 days allowable runtime
GPUMaximum of 3 days allowable runtime

Any requests for allocation increase will be subject to the COARE Team's evaluation and approval.

Tags: