Skip to content
Home > Documentation

Documentation

HPC4AI Documentation

Latest cluster news

17/01/2024

Four NVIDIA Grace Hopper Superchip boards are now available under the “gracehopper” Slurm partition.

27-11-23

IMPORTANT: an update to the Spack system repository broke compatibility with user installations.
To solve the issue you can either start with a fresh spack installation or update Spack using the following commands:

  1. cd ~/spack
  2. git pull
  3. git checkout develop

If this procedure doesn’t solve the problem, we suggest starting with a fresh installation:
1. rm spack .spack
2. spack_setup

26-08-23

You can now access Lustre filesystem also on the frontend node

Accessing the system

The required steps to access the cluster are:

  1. Request an account by filling out our registration form
  2. Once the account is approved, log into GitLab using the C3S Unified Login option 
  3. Upload your public SSH key to your account in the corresponding page
  4. Login through SSH at <username>@epito.di.unito.it
  5. With the same account you can log into the web calendar to book the system resources (more details below)
System Architecture

Nodes

  • 2 login nodes PowerEdge R7425
    • 2 x AMD EPYC 7281 16-Core Processor
    • 125 GiB RAM
    • Infiniband FDR interconnection network 
  • 4 compute nodes Supermicro SYS-2029U-TRT
    • 2 x Intel(R) Xeon(R) Gold 6230 CPU
    • 1536 GiB RAM
    • 1 x  NVIDIA Tesla T4 GPU
    • 1 x  NVIDIA Tesla V100S GPU
    • Infiniband FDR interconnection network 
  • 32 compute nodes Lenovo NeXtScale nx360 M5
    • 2 x  Intel(R) Xeon(R) CPU E5-2697 v4
    • 125giB RAM
    • OmniPath interconnection network

Storage

All nodes in the system are equipped with:

  • a 785 TB NFS network storage, mounted in /archive
  • a 102TB BeeGFS parallel file system, mounted in /beegfs

Each storage includes a personal user home folder

Spack environment

The system comes with a global Spack instance automatically sourced at login time. 

Users can setup their own Spack instance in their home folder. If so, global Spack automatic sourcing is disabled. 

If needed, a setup script called “Spack_setup” can be used to setup a personal spack instance chained with the global one.

System Resources Reservation [This section is being updated]

Resources can be accessed for batched computations through the Slurm queue manager (version 22.05). Plus, a custom Slurm plugin allows users to book scheduled access to resources through an external calendar, the Booked Scheduler

The system offers 4 jobs queues:

  • epito-compile (Time limit: 1h; Max resources: 2 cores, 1 node);
  • epito-1h (Time limit: 1h; Exclusive node access);
  • epito-12h (Time limit: 12h; Exclusive node access);
  • epito-booked (No time limit; Exclusive node access; reservation through the Booked Scheduler) .

Resource usage is accounted using credits, the Booked Scheduler currency. Initial credits count allow computation for a 5000 core-hours total, but more can be requested following this procedure.

To book resources on the system, first log into the web calendar. You can then select one or more epito nodes from the resource list to create a reservation. From the creation page, you can choose the starting and ending date and time, the resources needed and a mandatory title and description, as showed in the image below.

Once the reservation is correctly created, the system will prompt you with a confirmation message:

Now head to the system and login your user via ssh. You can check the corresponding Slurm reservation has been created with the command “scontrol show res”.

The output should look like the following:

You can now submit your batch script. Remember to specify the correct partition “epito-booked” and to use the reservation you created in the step before, or the job may not have the guaranteed resources or may be terminated after the default time limit. Here is a simple example of a script:

#!/bin/sh
#SBATCH –partition=epito-booked
#SBATCH –reservation=test_computing
#SBATCH –nodes=2
srun sleep 120

Using Slurm directives, the partition name, reservation and minimum number of nodes requested for the job are specified in the corresponding line. With “srun” you can specify the command to run on each node. You can find more information in the sbatch official documentation page.

Once the script is ready, you can submit the job with the “sbatch” command with the file and name as argument, and check its status by using the command “squeue” to list the job queue, showing the job id, partition, running time and the current status:

The script result and/or error logs will be created in your submitting working directory while the job runs in file named slurm-{jobid}.

Help Desk

For any problem or question, e.g. to request the installation of additional software, please submit a ticket to the C3S helpdesk or send an email to support@hpc4ai.unito.it