Simple Linux Utility for Resource Management (SLURM)

Last modified by Administrator on Mon, 03/07/2022, 9:46 AM
Page Rating
0 Votes

SLURM is the native scheduler software that runs on COARE’s HPC cluster. Users request for allocation of compute resources through SLURM. It arbitrates contention for resources by managing a queue of pending work.

SLURM Entities

SLURM entities are relevant terminologies used in SLURM, which include the following:

  • Frontend

    The frontend is where users log in to access the HPC.  This should not be used for computing.

  • Job

    An allocation of resources assigned to a user for a specified amount of time.

  • Jobs Step

    Sets of (possibly parallel) tasks within a job.

  • Message Passing Interface (MPI)

    A standardized and portable message-passing system designed to exchange information between processes running on a different nodes.

  • Modules

    Environment modules enable users to choose the software that they want to use and add these to their environment.

  • Node

    A physical, stand-alone computer that handles computing tasks and run jobs; a compute resource managed by SLURM

  • Partitions

    Logical set of nodes with the same queue parameters (job size limit, job time limit, users permitted to use it, etc.)

  • Quality-of-Service (QOS)

    The set of rules and limitations that apply to a category of job.

  • Runtime

    The time required for a job to finish its execution.

  • Task

    This is typically used to schedule an MPI process which, in turn, can use several CPUs.

Types of Jobs

The following are the types of jobs that users can run in the HPC:

  • Multi-node parallel jobs

    Multi-node parallel jobs use more than one node and require message passing interface (MPI) to communicate between nodes. Jobs usually require computing resource (cores) more than a single node can offer. 

  • Single-node parallel jobs

    Single-node parallel jobs use only one node, but multiple cores on that node. These include pthreads, OpenMP, and shared memory MPI.

  • Truly-serial jobs

    Truly-serial jobs require only one core on one node.

  • Array jobs

    Multiple jobs to be executed with identical parameters.

SLURM Partitions

The COARE’s SLURM currently has four (4) partitions: debug, batch, serial, and GPU.

Debug

-  COARE HPC's default partition

- Queue for small/short jobs
- Maximum runtime limit per job is 180 minutes or 3 hours
- Users may wish to compile or debug their codes in this partition

Batch

- Preferably for jobs that require MPI or parallel jobs
- Maximum runtime is 3 days

Serial

- Preferably for jobs that do not require MPI-enabled applications across multiple nodes
- Maximum runtime of 7 days

GPU

- Specified for jobs that uses GPU
- Users need to add the flag #SBATCH --gres=gpu:<count> to gain access to the nodes
- With a maximum runtime of 3 days
- A max resource of 2 GPUs

SLURM Job Limits

SLURM Job Limits are imposed for fair usage of the COARE's resources. Every job submitted by users to SLURM is subject to these job limits to prevent hogging of resources.

The COARE’s policy on SLURM Job Limits is as follows:

  • Users can request up to 168 hours (1 week, 7 days) for a single job.
  • Users can request up to 240 CPU cores (this can be just one job or allocated to multiple jobs).
  • Users can have a total of 30 simultaneous running jobs.

Job limits implemented for the COARE's saliksik cluster are summarized in the table below:

DebugMaximum of 3 hours allowable runtime
BatchMaximum of 3 days allowable runtime
SerialMaximum of 7 days allowable runtime
GPUMaximum of 3 days allowable runtime

Job Script

A job script is a script that has the parameters needed to run the specific job of the user. Users should specify the requirements of the job before submitting it into the scheduler.

SLURM Parameters

The table below lists some common parameters in a job script:

ScriptDescription

#!/bin/bash 

Allows script to run as bash script 

#SBATCH 

Script Directive 

-p, --partition=<name> 

Submit job to a specific partition 
(debug, batch, serial, gpu) 

-q, --qos=<name> 

Quality of Service 

-N, --nodes=<count> 

Request number of nodes to be allocated to this job 

--ntasks-per-node=<count> 

Processes per node  
Meant to be used with the --nodes option 

-n, --ntasks=<count> 

Total processes (across all nodes) 

--mem=<count> 

RAM per node 
Default units are megabytes (MB) 

-J, --job-name=<name> 

Job Name 

-o, --output=<file-name> 

Standard Output File 

-e, --error=<file-name> 

Standard Error File 

--mail-user=<email-address> 

Email for job alerts 

--mail-type=<type> 

Receive an email when certain event types occur 
Valid type values: BEGIN, END, FAIL, REQUEUE, ALL 

-w, --nodelist=<nodes> 

Request to run in a specific node/s 

--gres=gpu:<count> 

Specifies the number of GPU devices 

-a, --array=<array-range> 

Launch Job Arrays 

--requeue 

Job restart 

-t, --time=<time>

Limit to the total runtime of the job allocation

Acceptable time formats: "minutes", "MM:SS", "HH:MM:SS", "days-hours", "D-HH:MM" and "D-HH:MM:SS".

NOTES
- A Job Script must begin with the #!/bin/bash directive on the first line. The subsequent lines begin with the SLURM directive#SBATCHfollowed by a parameter. 
- For more information on SBATCH parameters, please visit this link.

QOS Parameter

Each SLURM partition has its own QOS:

PartitionQOS

Debug 

240c-1h_debug 

Batch 

240c-1h_batch 

Serial 

84c-1d_serial 

GPU 

12c-1h_2gpu 

NOTES:
- The c in the QOS means CPU.  
- 2gpu means a max resource of 2 GPUs. 
- The --qos parameter is required to successfully run in a partition.

Job Script Example

#!/bin/bash
#SBATCH --partition=debug

#SBATCH --qos=240c-1h_debug  
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=24G
#SBATCH --job-name=”JobName”
#SBATCH --output=JobName.%J.out

#SBATCH --error=JobName.%J.err
#SBATCH --mail-user=gridops@asti.dost.gov.ph
#SBATCH --mail-type=ALL
#SBATCH –requeue

echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
echo "working directory = "$SLURM_SUBMIT_DIR

# Place commands to load environment modules here
module load <module>

# Set stack size to unlimited
ulimit -s unlimited

# MAIN
srun /path/to/binary 

IMPORTANT!
- It is important to set accurate resources and parameters. By doing this, you can effectively schedule jobs, prevent your program from crashing, and avoid wasting resources. Also, before you submit your job, you need to determine which partition you will submit it to: debug, batch, serial, or gpu
- Running jobs in /home is not allowed
- Active files should be transferred in scratch directories.
- Scratch directories should not be used as a long-term storage for your files. If you wish to store your files for a longer time, please use your /home directory. 

Job Management

Users are able to manage their jobs by checking the status of nodes, submitting the job script to the queue, checking the job’s status, or cancelling a job. 

The following will be helpful in managing your jobs:

  • Node Status
  • Job Submission
  • Job Status
  • Job Cancellation

Node Status

sinfo - view information about SLURM nodes and partitions

sinfo
Node State CodesDescription

Allocated 

The node has been allocated to one or more jobs 

Completing 

All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job's processes have terminated 

Down 

The node is unavailable for use 

Drained 

The node is unavailable for use per system administrator request 

Draining 

The node is currently executing a job, but will not be allocated to additional jobs 

Idle 

The node is not allocated to any jobs and is available for use 

Mix 

The node has some of its CPUs ALLOCATED while others are IDLE 

Job Submission

sbatch - submit job script to the queue

sbatch <job-script>

Job Status

squeue - view information about jobs located in the SLURM scheduling queue.

squeue -u <username>
Job State CodesDescription

CD (Completed) 

Job has terminated all processes on all nodes with an exit code of zero 

CG (Completing) 

Job is in the process of completing. Some processes on some nodes may still be active 

PD (Pending) 

Job is awaiting resource allocation 

R (Running) 

Job currently has an allocation 

Job Reason CodesDescription

Dependency 

This job is waiting for a dependent job to complete 

InvalidQOS 

The job's QOS is invalid 

JobLaunchFailure 

The job could not be launched. This may be due to a file system problem, invalid program name, etc. 

Priority 

One or more higher priority jobs exist for this partition or advanced reservation 

QOSJobLimit 

The job's QOS has reached its maximum job count 

ReqNodeNotAvail 

Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding 

scontrol - view SLURM configuration and state

scontrol show jobid=<jobid>

nvidia-smi - to check the “occupancy” or usage of the GPU devices. 

nvidia-smi

 NOTE:
 Command available in GPU nodes only.

Job Cancellation

scancel - to cancel submitted jobs.

scancel <jobid>

For more information on the best practices in using the COARE HPC's SLURM, click here.

 

Tags: