Basics
+
The COARE HPC is one of the services offered by the COARE which can be utilized for the processing of massive amounts of data that require high-speed calculations and powerful computing resources.
+
As of Oct 2020, the COARE HPC's current capacity is as follows:
CPU30 Tflops
GPU72 Tflops
More information on the COARE HPC can be found here.
Accessing the HPC
+
The COARE account is valid for three (3) months upon approval of your COARE account application.
+
You may have inputted the wrong private key. You can confirm this by looking for the keywords "Public key denied" in the logs. Logs can be found by using ssh @ -vv.
+
SSH keys that are in OpenSSH format can be generated via CLI (command line interface), while PuTTYgen is an SSH-generating GUI (graphical user interface) that primarily runs on Windows.​ They also differ in file formats, as shown in detail from this Wiki section.
+
  • "WARNING: UNPROTECTED PRIVATE KEY FILE!"
  • Permission denied (publickey,gssapi-keyex,gssapi-with-mic)/ bad permissions​
Since your private key file is in PuTTY (.ppk) and you already switched to Linux, you must convert this to OpenSSH format.On Linux, there is a package named putty-tools. To convert, run the the command:​
> puttygen [your ppk file] -O [new private key filename] -o id_rsa
On your linux machine, change the permissions of your new private key file​:
> sudo chmod 400 [your new private key]
To connect you can use:
$ ssh [–v] [–i ] [user]@saliksik.asti.dost.gov.ph   # or 202.90.149.55​
+
This error might be caused by the following:​
  • PuTTY can’t find the private key, specifically when PuTTY has an error prompt which states that it can’t open the private key '.ppk'. You may try to log-in once again. Alternatively, you may opt to generate a new set of SSH keys and have these appended by COARE Team.
  • You probably generated the SSH keys in a different format (e.g., SSH2). Please ensure that the SSH keys are in OpenSSH format. You may refer to this Wiki section as to how it should appear. You can convert SSH keys in SSH2 format to OpenSSH or you may also opt to generate a new set of SSH keys.
+
Since you changed your device when logging in your COARE credentials, the error that you encountered means that the permissions might be altered and OpenSSH is warning you that these permissions in your private key file is “too open”. For this error, you may refer to this resource.
+
You may refer to this Wiki for the procedures on how to access the HPC. However, if you still encountered the same error even after following all the procedures in the Wiki link provided, you may send a ticket to COARE User Portal as this might be a server network error.​
+
No, because you might encounter access error prompts upon logging in, such as: “No supported authentication methods available” or “Server refused our key”.​ However, these SSH keys can still be converted to correct OpenSSH or PuTTY formats.
+
To ensure that the generated OpenSSH public and private SSH key pairs are in the correct format, you may check with this Wiki.
+
No. You have to request the COARE Team to add another public key so you can use your other devices to access the COARE. To do this, please log a service request ticket on the COARE User Portal.
HPC Resource Allocation and Quota Limits
+
The default allocation per COARE HPC user is summarized in the table below:
CPU240 logical cores
Network filesystem (/home)100 GB usable
Parallel filesystem (Scratch directories: /scratch1 and /scratch2)5 TB for each scratch directory
GPU2 GPUs
Max running job30 jobs
Max submit job45 jobs
Job waiting timeNo guarantee;  depends on the status of the queue and the availability of the requested resource/s
Job walltime limitBatch and GPUOne (1) hour default; automatically extended by one (1) hour by the HPC job scheduler to a maximum of three (3) days only
DebugOne (1) hour default; automatically extended by one (1) hour by the HPC job scheduler to a maximum of three (3) hours only
Serial One (1) day default; automatically extended by one (1) day by the HPC job scheduler to a maximum of seven (7) days only
For more information, visit the COARE Service Catalogue.
+
Some of the best practices on allocating memory and CPU can be found on our Wiki page on benchmarking/parallelizing jobs.
+
For more information about the QOS (Quality of Service) per job, please refer to the Wiki section regarding SLURM.
+
Any requests for allocation increase will be subject to the COARE Team's evaluation and the COARE's current capacity. If you already have a COARE account, you can request for an increase by submitting a service request ticket thru the COARE User Portal. If you do not have a COARE account yet, you can apply for a COARE account following the instructions here. You can also email us at gridops@asti.dost.gov.ph before applying for a COARE account if you wish to discuss your request for a higher allocation first.
Navigating the HPC (Home and Scratch Directories)
+
Files in the home directory are not purged. However, the scratch directories are regularly purged. Unfortunately, you will no longer be able to retrieve your data that may have been purged because all files purged in the scratch directories are irrecoverable.
+
Scratch directories are intended for heavy io and temporary files. Because of this, the files in the scratch directory are not resilient enough to store long term/archived data. All files damaged in scratch directories are irrecoverable.
+
Files that are accidentally deleted by the rm command can no longer be recovered due to high disk writes in the scratch folders. We highly recommend that you always backup your files and be mindful when deleting files and folders.​
+
You may refer to this Wiki section which focuses on Transferring Files via WinSCP.
+
+
We are not deleting files from home and scratch directories except by user's request. All the data stays in their respective directories, provided that the users are active and respond to the notifications sent by us regarding their accounts.​
Running Jobs
+
None.
+
Here are some reasons why a job sits in queue:
  • No Available nodes
  • Maxed QOS allocations
  • Job has been assigned a low priority (since the scheduler increases the priority of jobs that have been queued longer)
+
This wiki on how to use SLURM contains some useful commands that may be helpful to you.
+
The COARE Team will kill jobs that are running on the frontend. Another reason may be that there are errors in the script/application itself.
+
This can either mean that you may have already maximized your QOS allocation or your job has been assigned a lower priority (since the scheduler increases the priority of jobs that have been queued longer).
+
Here are some recommendations on job submissions that would yield optimal results in this wiki.
+
On the batch script, we recommend that you set this to "unlimited" so that you will not encounter stack limit errors.
+
The default output file is .out on the directory where the sbatch script is executed. This wiki shows how to set the output file.
+
From time to time, nodes fail beyond our control which may be caused by the following factors:
  • Software issues
  • Hardware issues
  • Storage is already full.
In this case, this is because the SLURM controller became unresponsive. If your jobs have a checkpoint mechanism or if they can be continued from their latest output, then that can be used to continue your runs. If not, they may need to be resubmitted, but it would be better to create a new working directory in your scratch folder to keep the current outputs of the terminated jobs. In a SLURM script, this is usually created in the format of '[filename].out'. If the serial jobs can be scaled beyond 30 jobs, then we can allocate one or more nodes with at least 86 usable CPU threads per node to speed up your jobs.
+
For the installation of conda packages, please use SLURM when running jobs to prevent the frontend from crashing. You may use this SLURM script as guide to install modules/software/packages.You may review the following for more information about SLURM and module/s installation:
+
FROM:TO:
240c-1h_debug​
debug_default​
240c-1h_batch​
batch_default​
84c-1d_serial​
serial_default​
12c-1h_2gpu​
gpu-p40_default​
The QOS only defines the limits when running jobs, such as the number of CPUs allowed, number of concurrent running jobs, etc. Hence, configuring the QOS will not affect your data and all the software installed in your account.​
+
Any user who submitted their respective jobs will be in the queue first and you can use batch partition or your own machine instead of GPU partition in HPC to run any trial-and-error scripts.​
+
Due to many users submitting their jobs on our nodes, when a user starts to submit a SLURM script, it will queue first and wait for the available nodes. For the installation, you can log in to saliksik-cpu-20 or 21 and run conda environment to pip install​.
  1. module load anaconda
  2. conda create -n [environment name]
  3. source activate [environment name]
  4. Run installation link: https://anaconda.org/anaconda/pip
Important Note: Please DO NOT use frontend for installation purposes.
+
Knowing the partition to be used is on a case-to-case basis, depending on what is stated in the software’s documentation. However, here are the some of the general descriptions per partition: ​
  • Debug – This is one of the partitions inside the COARE HPC that is used for short jobs. You may use this partition for test runs, creating conda environments, among others.​
  • Batch – This is COARE HPC’s main partition dedicated to processing multi-core/multi-node CPU jobs in parallel. Hence, this occupies the bulk of the COARE’s CPU node.​
  • Serial – This partition is specifically for serial jobs or those that only require 1 CPU core (e.g., replications/ running 5 replicas of a certain program 5 CPUs running independently.​
  • GPU - These are used alongside a CPU to make quick work of numerically intensive operations. For certain workloads like image processing, training artificial neural networks and solving differential equations, a GPU-enabled code can vastly outperform a CPU code.​ You may may refer to this resource to learn more about the GPU.
+
Yes, but you need to justify this. You can log a service request ticket for this request here.
Installation/Compilation/Containerization of Applications
+
While users are not given SUDO/ADMIN capabilities, they can still choose to compile their own applications. However, it is recommended that compilation and installation of software in the HPC are done by the COARE Team.
+
If the software that you need requires a license, you will need to provide the license for the software.
+
Run this command:module avail
+
Installing a specific software/package is a case by case basis, The COARE Team refers to the documentation of the package to be installed and is optimized with compiler flags which can speed up the application.
+
We have created a Wiki for transferring files, which you can view here.
+
We highly suggest that you load and install your conda environments and packages in your /scratch [1, 2 or 3] folder because it is much faster than your /home folder. However, since both environments and packages folders are stored in your home folder by default, you may refer to this Wiki in changing the default locations to your scratch folder.​
+
You may refer to this Wiki on how to see the list of the installed packages in your environment.
+
Yes, you may install multiple packages. You can do this by submitting a job via SLURM. Below are the resource links for the scripts in installing multiple packages:
+
Most of the programs/ packages can be installed by users. You can use the Anaconda environment to do so. Here is the official website of Anaconda. Furthermore, you may refer to this Wiki resource link for the step-by-step process.In case you encounter any problem/s with regards to the installation, you may seek the assistance of COARE Team by logging in a service request ticket on the COARE User Portal. ​
+
If the program was installed directly as it is, you may try to check if it is available in Anaconda environment. If yes, you must carry out the installation via conda/ mamba. You may refer in this Wiki for the step-by-step procedures. We also suggest installing your package in your /scratch [1, 2 or 3] because it is much faster than your /home directory.​
+
For the installation of a CUDA-enabled package, you may do the standard installation procedures via conda/ mamba as stated in this Wiki. However, keep in mind that the SLURM job submission script for the installation as well as running of other pertinent jobs should be configured to the following: #SBATCH --partition=gpu and #SBATCH --gres=gpu:p40:1 (or #SBATCH --gres=gpu:1). You may also refer in this section for the specific installation commands for CUDA-enabled build [Note: Proceed to the CUDA-enabled (GPU) build part].
+
R packages can be installed by users. To do this, you can use the R module or Anaconda environment. If there is a specific R version that is not included in the list of available modules, you may ask the COARE Team to assist you in having this installed by logging a service request ticket on the COARE User Portal.
Tags:
Created by superadmin on Fri, 02/28/2020, 6:32 PM