Simple Linux Utility for Resource Management (SLURM)
SLURM is the native scheduler software that runs on COARE’s HPC cluster. Users request for allocation of compute resources through SLURM. It arbitrates contention for resources by managing a queue of pending work.
SLURM Entities
SLURM entities are relevant terminologies used in SLURM, which include the following:
Frontend
The frontend is where users log in to access the HPC. This should not be used for computing.
Job
An allocation of resources assigned to a user for a specified amount of time.
Jobs Step
Sets of (possibly parallel) tasks within a job.
Message Passing Interface (MPI)
A standardized and portable message-passing system designed to exchange information between processes running on a different nodes.
Modules
Environment modules enable users to choose the software that they want to use and add these to their environment.
Node
A physical, stand-alone computer that handles computing tasks and run jobs; a compute resource managed by SLURM
Partitions
Logical set of nodes with the same queue parameters (job size limit, job time limit, users permitted to use it, etc.)
Quality-of-Service (QOS)
The set of rules and limitations that apply to a category of job.
Runtime
The time required for a job to finish its execution.
Task
This is typically used to schedule an MPI process which, in turn, can use several CPUs.
Types of Jobs
The following are the types of jobs that users can run in the HPC:
Multi-node parallel jobs
Multi-node parallel jobs use more than one node and require message passing interface (MPI) to communicate between nodes. Jobs usually require computing resource (cores) more than a single node can offer.
Single-node parallel jobs
Single-node parallel jobs use only one node, but multiple cores on that node. These include pthreads, OpenMP, and shared memory MPI.
Truly-serial jobs
Truly-serial jobs require only one core on one node.
Array jobs
Multiple jobs to be executed with identical parameters.
SLURM Partitions
The COARE’s SLURM currently has four (4) partitions: debug, batch, serial, and GPU.
Debug
- COARE HPC's default partition
- Queue for small/short jobs
- Maximum runtime limit per job is 180 minutes or 3 hours
- Users may wish to compile or debug their codes in this partition
Batch
- Preferably for jobs that require MPI or parallel jobs
- Maximum runtime is 3 days
Serial
- Preferably for jobs that do not require MPI-enabled applications across multiple nodes
- Maximum runtime of 7 days
GPU
- Specified for jobs that uses GPU
- Users need to add the flag #SBATCH --gres=gpu:<count> to gain access to the nodes
- With a maximum runtime of 3 days
- A max resource of 2 GPUs
SLURM Job Limits
SLURM Job Limits are imposed for fair usage of the COARE's resources. Every job submitted by users to SLURM is subject to these job limits to prevent hogging of resources.
The COARE’s policy on SLURM Job Limits is as follows:
- Users can request up to 168 hours (1 week, 7 days) for a single job.
- Users can request up to 240 CPU cores (this can be just one job or allocated to multiple jobs).
- Users can have a total of 30 simultaneous running jobs.
Job limits implemented for the COARE's saliksik cluster are summarized in the table below:
Debug | Maximum of 3 hours allowable runtime |
Batch | Maximum of 3 days allowable runtime |
Serial | Maximum of 7 days allowable runtime |
GPU | Maximum of 3 days allowable runtime |
Job Script
A job script is a script that has the parameters needed to run the specific job of the user. Users should specify the requirements of the job before submitting it into the scheduler.
SLURM Parameters
The table below lists some common parameters in a job script:
Script | Description |
#!/bin/bash | Allows script to run as bash script |
#SBATCH | Script Directive |
-p, --partition=<name> | Submit job to a specific partition |
-q, --qos=<name> | Quality of Service |
-N, --nodes=<count> | Request number of nodes to be allocated to this job |
--ntasks-per-node=<count> | Processes per node |
-n, --ntasks=<count> | Total processes (across all nodes) |
--mem=<count> | RAM per node |
-J, --job-name=<name> | Job Name |
-o, --output=<file-name> | Standard Output File |
-e, --error=<file-name> | Standard Error File |
--mail-user=<email-address> | Email for job alerts |
--mail-type=<type> | Receive an email when certain event types occur |
-w, --nodelist=<nodes> | Request to run in a specific node/s |
--gres=gpu:<count> | Specifies the number of GPU devices |
-a, --array=<array-range> | Launch Job Arrays |
--requeue | Job restart |
-t, --time=<time> | Limit to the total runtime of the job allocation Acceptable time formats: "minutes", "MM:SS", "HH:MM:SS", "days-hours", "D-HH:MM" and "D-HH:MM:SS". |
QOS Parameter
Each SLURM partition has its own QOS:
Partition | QOS |
Debug | 240c-1h_debug |
Batch | 240c-1h_batch |
Serial | 84c-1d_serial |
GPU | 12c-1h_2gpu |
Job Script Example
#SBATCH --partition=debug
#SBATCH --qos=240c-1h_debug
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --job-name="<jobname>"
#SBATCH --output="%x.%j.out" # <jobname>.<jobid>.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=gridops@asti.dost.gov.ph
#SBATCH --requeue
##SBATCH --ntasks-per-node=1 # optional
##SBATCH --mem=24G # optional: mem per node
##SBATCH --error=JobName.%J.err # optional; better to use --output only
## For more `sbatch` options, use `man sbatch` in the HPC, or go to https://slurm.schedmd.com/sbatch.html.
## Set stack size to unlimited.
ulimit -s unlimited
## Benchmarking.
start_time=$(date +%s.%N)
## Print job parameters.
echo "Submitted on $(date)"
echo "JOB PARAMETERS"
echo "SLURM_JOB_ID : ${SLURM_JOB_ID}"
echo "SLURM_JOB_NAME : ${SLURM_JOB_NAME}"
echo "SLURM_JOB_NUM_NODES : ${SLURM_JOB_NUM_NODES}"
echo "SLURM_JOB_NODELIST : ${SLURM_JOB_NODELIST}"
echo "SLURM_NTASKS : ${SLURM_NTASKS}"
echo "SLURM_NTASKS_PER_NODE : ${SLURM_NTASKS_PER_NODE}"
echo "SLURM_MEM_PER_NODE : ${SLURM_MEM_PER_NODE}"
## Create a unique temporary folder in the node. Using a local temporary folder usually results in faster read/write for temporary files.
## If this results in errors, just change to "no".
custom_tmp="yes"
if [[ $custom_tmp == "yes" ]]; then
JOB_TMPDIR=/tmp/${SLURM_JOB_ID}
mkdir -p ${JOB_TMPDIR}
export TMPDIR=${JOB_TMPDIR}
echo "TMPDIR : $TMPDIR"
fi
## Reset modules.
module purge
module load <module1> [<module2> ...]
## Run your codes/scripts/apps. `srun` is optional.
[srun] /path/to/exe1 <arg1> ...
[srun] /path/to/exe2 <arg2> ...
## Flush the TMPDIR.
if [[ $custom_tmp == "yes" ]]; then
rm -rf $TMPDIR
echo "Cleared the \$TMPDIR [${TMPDIR}]"
fi
## Benchmarking
end_time=$(date +%s.%N)
echo "Finished on $(date)"
run_time=$(python -c "print($end_time - $start_time)")
echo "Total runtime (sec): ${run_time}"
IMPORTANT!
- It is important to set accurate resources and parameters. By doing this, you can effectively schedule jobs, prevent your program from crashing, and avoid wasting resources. Also, before you submit your job, you need to determine which partition you will submit it to: debug, batch, serial, or gpu.
- Running jobs in /home is not allowed.
- Active files should be transferred in scratch directories.
- Scratch directories should not be used as a long-term storage for your files. If you wish to store your files for a longer time, please use your /home directory.
Job Management
Users are able to manage their jobs by checking the status of nodes, submitting the job script to the queue, checking the job’s status, or cancelling a job.
The following will be helpful in managing your jobs:
- Node Status
- Job Submission
- Job Status
- Job Cancellation
Node Status
sinfo - view information about SLURM nodes and partitions
Node State Codes | Description |
Allocated | The node has been allocated to one or more jobs |
Completing | All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job's processes have terminated |
Down | The node is unavailable for use |
Drained | The node is unavailable for use per system administrator request |
Draining | The node is currently executing a job, but will not be allocated to additional jobs |
Idle | The node is not allocated to any jobs and is available for use |
Mix | The node has some of its CPUs ALLOCATED while others are IDLE |
Job Submission
sbatch - submit job script to the queue
Job Status
squeue - view information about jobs located in the SLURM scheduling queue.
Job State Codes | Description |
CD (Completed) | Job has terminated all processes on all nodes with an exit code of zero |
CG (Completing) | Job is in the process of completing. Some processes on some nodes may still be active |
PD (Pending) | Job is awaiting resource allocation |
R (Running) | Job currently has an allocation |
Job Reason Codes | Description |
Dependency | This job is waiting for a dependent job to complete |
InvalidQOS | The job's QOS is invalid |
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation |
QOSJobLimit | The job's QOS has reached its maximum job count |
ReqNodeNotAvail | Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding |
scontrol - view SLURM configuration and state
nvidia-smi - to check the “occupancy” or usage of the GPU devices.
Job Cancellation
scancel - to cancel submitted jobs.
For more information on the best practices in using the COARE HPC's SLURM, click here.