Best Practices
To ensure optimal user experience in utilizing the COARE, the COARE Team encourages all users to adhere to the following best practices for the COARE HPC service:
Most errors are solved by reading the logs and understanding the behavior of the application. It usually does not require special/root privileges to fix the errors. Since the COARE is a multi-disciplinary infrastructure that caters to different fields in the academe. It is important the users must be able to understand and debug their jobs on their own.
On debugging jobs
View the log file: Every job in SLURM has an output log file either specified with the parameter #SBATCH --output = <filename>.out in the SLURM script or using the default naming scheme of <jobid>.out. Use the tailf or tail -f command to view the logs as it is being appended.
On benchmarking jobs
Observe CPU, GPU, and memory utilizations: Once a job has been allocated, the user can SSH into the node where their job runs.
To view the CPU and memory utilization, users can run htop -u $USER on the allocated node. Take note of the columns CPU% and RES, as these refer to the CPU and memory utilizations respectively. On the htop page, <F5> key is used to show the process tree this will give you a better understanding which part of the processes are underutilized or overprovisioned.
To view the GPU utilization, run gpustat on the allocated node. This will provide the utilization for the GPU. From here, the user will have an understanding if the job needs to adjust parameters to further optimize the run.
On submitting jobs
All jobs in the COARE's resources must use the SLURM scheduler to run their jobs. All jobs not using the SLURM scheduler will be terminated without warning as these will affect other users in the facility. To maximize the use of the COARE's resources and to optimize queueing time, users are highly encouraged to follow these tips:
- Benchmark the job. For an efficient and optimized run, it is important to run a smaller scale of the job before running it in a production environment
- Allocate resources properly. This means setting the CPU, Memory, and Time Limit properly to avoid over provisioning of resources. For more information, see On benchmarking jobs.
- Set accurate parameters: By doing this, you can effectively schedule jobs, prevent your program from crashing, and avoid wasting resources. Also, before you submit your job, you need to determine which partition (batch or debug) you will submit it to.
- Utilize the directories accordingly. Running jobs in /home is not allowed. Scratch directories should not be used as a long-term storage for your files. If you wish to store your files for a longer time, please use your /home directory. Running jobs in /home is not allowed.
Do’s and Don’ts
To maintain the stability of the COARE's resources and provide a good user experience to all users, the following are a few do’s and don’ts that COARE users are expected to comply with:
Do:
- Login only to the frontend.
- Change your password at first login.
- Copy files to and from the HPC Frontend.
Don’t:
- Run heavy jobs or intensive applications on the frontend.
- Run jobs without using the SLURM utility.