Revision as of 17:53, 4 November 2025

This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa must use SLURM to run resource-heavy processes, i.e. computing jobs that require any of the following:

GPUs
multiple CPUs
a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the login server virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

SLURM in a nutshell

Computation jobs on Mufasa needs to be launched via SLURM. SLURM provides jobs with access to the physical resources of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead queued. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the priority assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule,

The greater the fraction of Mufasa's overall resources that a job asks for, the lower its priority.

The time available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 (this was different in Mufasa 1.0) access to system resources is managed via SLURM's Quality of Service (QOS) mechanism. To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Quality of Service (QOS)

In SLURM, different Quality of Services (QOSes) define different levels of access to the server's resources. SLURM jobs must always specify the QOS that they use: this choice determines what resources the job can access.

The list of Mufasa's QOSes and their main features can be inspected with command

sacctmgr list qos format=name%-11,priority,MaxJobsPerUser,maxwall,maxtres%-80

which provides an output similar to the following:

Name          Priority MaxJobsPU     MaxWall MaxTRES                                                                          
----------- ---------- --------- ----------- -------------------------------------------------------------------------------- 
normal               0                                                                                                        
nogpu                4         1  3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G 
gpuheavy-20          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G             
gpuheavy-40          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G             
gpulight             8         1    12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              
gpu                  2         1  1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              
gpuwide              2         2  1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G

The columns of this output are the following:

Name: name of the QOS

Priority: priority tier associated to the QOS

MaxJobsPU: maximum number of jobs from a single user can be running with this QOS; (note that there are also other limitations on the number of running jobs by the same user)

MaxWall: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format [days-]hours:minutes:seconds

MaxTRES: maximum amount of resources ("Trackable RESources") available to a job using the QOS, where; cpu=K means that the maximum number of processor cores is K; gres/gpu:Type=K means that the maximum number of GPUs of class Type (see gres syntax) is K; mem=KG means that the maximum amount of system RAM is K GBytes

For instance, QOS gpulight provides jobs that use it with:

priority tier 8
a maximum of 1 running job per user
a maximum of 12 hours of duration
a maximum of 2 CPUs
a maximum of 64 GB of RAM
access to a maximum of 1 GPU of type gpu:3g.20gb
no access to GPUs of type gpu:40gb=0
no access to GPUs of type gpu:4g.20gb

The normal QOS is the one applied to jobs if no QOS is specified. normal provides no access at all to Mufasa's resources, so it is always necessary to specify a QOS (different from normal) when running a job via SLURM.

As seen in the example output from sacctmgr list qos above, each QOS has an associated priority tiers. As a rule, the more powerful (i.e., rich with resources) a QOS is, the lower the priority of the jobs that use such QOS. See Priority to understand how priority affects the execution order of jobs in Mufasa 2.0.

Important note. Some of the QOSes may be available only to a subset of users. In Mufasa, such a limitation is associated to the category that users belongs to.

Amount of resource available to a QOS

The maximum amount of resources that a QOS has access to (available to the running jobs using the QOS, collectively) can be inspected with command

sacctmgr list qos format=name%-11,grpTRES%-34

which provides an output similar to

Name        GrpTRES                            
----------- ---------------------------------- 
normal                                         
nogpu       cpu=48,mem=384G                    
gpuheavy-20 gres/gpu:4g.20gb=4                 
gpuheavy-40 gres/gpu:40gb=3                    
gpulight    cpu=8,gres/gpu:3g.20gb=4,mem=256G  
gpu         cpu=24,gres/gpu:3g.20gb=3,mem=192G 
gpuwide     cpu=40,gres/gpu:4g.20gb=5,mem=320G

Note how overall resources associated to the set of all QOS exceed overall available resources. With SLURM, multiple QOS can be given access to the same physical resource (e.g., a CPU or a GPU), because SLURM guarantees that the overall request for resources from all running jobs does not exceed the overall availability of resources in the system. SLURM will only execute a job if all the resources requested by the job are not already in use at the time of request.

Partitions

Since in Mufasa 2.0 access to resources is controlled via QOSes, partitions are not very relevant. Partitions are another mechanism provided by SLURM to create different levels of access to system resources.

In Mufasa 2.0, there is a single SLURM partition, called jobs, and all jobs run on it. The partition status of Mufasa can be inspected with

sinfo

which provides an output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
jobs*        up 3-00:00:00      1   idle gn01

As explained, in Mufasa 2.0 partitions are not much relevant, while QOS are very relevant.

Research users and students users

Users of Mufasa belong to two categories, which provide the users belonging to them with different access to system resources. The categories are:

Research users, i.e. academic personnel and Ph.D. students

have access to all QOSes
their jobs have a higher base priority

Students users, i.e. M.Sc. students

do not have access to QOS gpuheavy-20 and gpuheavy-40
their jobs have a lower base priority

You can inspect the differences between researcher and student users with command

sacctmgr list association format=account,priority,maxjobs,maxsubmit | grep -E 'Priority|research|students'

which provides an output similar to the following:

   Account   Priority MaxJobs MaxSubmit 
  research          4       2         4 
  students          1       1         2

This example output shows that the differences between research and students are the following:

base priority is 4 for jobs run by research users, while it is 1 for jobs run by students users
the number of running jobs is 2 for research users, while it is 1 for jobs run by students users
the number of queued jobs (i.e., of jobs submitted to SLURM for execution but not yet running) is 4 for research users, while it is 1 for student users

Job priority

Once the execution of a job has been requested, the job is not run immediately: it is instead queued by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the priority of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution. This goal is achieved by encouraging users to avoid asking for resources or execution time that their job does not need.

This mechanism creates a virtuous cycle. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

Elements determining job priority

In Mufasa, the priority of a job is computed by SLURM according to the following elements:

Category of user who launched the job - see Research users and students users: Used to provide higher priority to jobs run by research personnel

QOS used by the job - see Quality of Service (QOS): Used to provide higher priority to jobs requesting access to less system resources

Job duration, i.e. the execution time requested by the job: Used to provide higher priority to shorter jobs

Job size, i.e. the number of CPUs requested by the job: Used to provide higher priority to jobs requiring less CPUs.

Age, i.e. the time that the job has been waiting in the queue: Used to provide higher priority to jobs which have been queued for a long time

FairShare, i.e. a factor computed by SLURM to balance use of the system by different users: Used to provide higher priority to jobs by users who use Mufasa less than others

System resources subjected to limitations

The hardware resources of Mufasa are limited. For this reason, some of them are subjected to limitations, i.e. (these are SLURM's own terms):

cpu: the number of processor cores that a job uses

mem: the amount of RAM that a job uses

gres: the amount of generic resources that a job uses: in Mufasa, gres are GPUs

These are some of the TRES (Trackable RESources) defined by SLURM. From SLURM's documentation: "A TRES is a resource that can be tracked for usage or used to enforce limits against."

SLURM provides jobs with access to resources only for a limited time: i.e.,

execution time: is itself a limited resource.

When a resource is limited, a job cannot use arbitrary quantities of it. On the contrary, the job must specify how much of the resource it requests. Resource requests are done either by running the job on a partition for which a default amount of resources has been defined, or via the options of the command used to launch the job via SLURM.

`gres` syntax

Whenever it is necessary to specify the quantity of gres, i.e. generic resources, a special syntax must be used. In Mufasa gres resources are GPUs, so this syntax applies to GPUs. Number and type of Mufasa's GPUs is described here.

The name of each GPU resource takes the form

Name:Type

where Name is gpu and Type takes the following values [to be updated]:

40gb for GPUs with 40 Gbytes of onboard RAM
20gb for GPUs with 20 Gbytes of onboard RAM

So, for instance,

gpu:20gb

identifies the resource corresponding to GPUs with 20 GB of RAM.

When asking for a gres resource (e.g., in an srun command or an SBATCH directive of an execution script), the syntax required by SLURM is

<Name>:<Type>:<Quantity>

where Quantity is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 20gb the syntax is

gpu:20gb:2

SLURM's generic resources are defined in /etc/slurm/gres.conf. In order to make GPUs available to SLURM's gres management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.

Looking for unused GPUs

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to request a GPU that is not currently in use. This command

sinfo -O Gres:100

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following [to be updated]:

GRES                                                                                                
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)

To know which of the GPUs are currently in use, use command

sinfo -O GresUsed:100

which provides an output similar to this:

GRES_USED                                                                                           
gpu:40gb:2(IDX:0-1),gpu:20gb:2(IDX:5,8),gpu:10gb:3(IDX:3-4,6)

By comparing the two lists (GRES and GRES_USED) in the examples above, you can see that in this example:

the system has 2 40 GB GPUs, all of which are in use
the system has 3 20 GB GPUs, of which one is not in use
the system has 6 10 GB GPUs, of which 3 are not in use

SLURM partitions

Execution queues for jobs in SLURM are called partitions. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command

sinfo

(link to SLURM docs) provides a list of available partitions. Its output is similar to this [to be updated]:

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*        up      20:00      1    mix gn01
small         up   12:00:00      1    mix gn01
normal        up 1-00:00:00      1    mix gn01
longnormal    up 3-00:00:00      1    mix gn01
gpu           up 1-00:00:00      1    mix gn01
gpulong       up 3-00:00:00      1    mix gn01
fat           up 3-00:00:00      1    mix gn01

In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "debug" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified. On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.

The columns in the standard output of sinfo shown above correspond to the following information:

PARTITION: name of the partition

AVAIL: state/availability of the partition: see below

TIMELIMIT: maximum runtime of a job allowed by the partition, in format [days-]hours:minutes:seconds

NODES: number of nodes available to jobs run on the partition: for Mufasa, this is always 1 since there is only 1 node in the computing cluster

STATE: state of the node (using these codes); typical values are mixed - meaning that some of the resources of the node are busy executing jobs while other are free, and allocated - meaning that all of the resources of the node are busy

NODELIST: list of nodes available to the partition: for Mufasa this field always contains gn01 since Mufasa is the only node in the computing cluster [to be updated]

For what concerns hardware resources (such as CPUs, GPUs and RAM) the amounts of each resource available to Mufasa's partitions are set by SLURM's accounting system, and are not visible to sinfo. See Partition features for a description of these amounts.

Partition features

The output of sinfo (see above) provides a list of available partitions, but (except for time) it does not provide information about the amount of resources that a partition makes available to the user jobs which are run on it. The amount of resources is visible through command

sacctmgr list qos format=name%-10,maxwall,maxtres%-64

which provides an output similar to the following [to be updated]:

Name           MaxWall MaxTRES                                                          
---------- ----------- ---------------------------------------------------------------- 
normal      1-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G  
small         12:00:00 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G    
longnormal  3-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G  
gpu         1-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G                    
gpulong     3-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G                    
fat         3-00:00:00 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G

Its elements are the following (for more information, see SLURM's documentation):

Name: name of the partition

MaxWall: maximum wall clock duration of the jobs run on the partition (after which they are killed by SLURM), in format [days-]hours:minutes:seconds

MaxTRES: maximum amount of resources ("Trackable RESources") available to a job running on the partition, where; cpu=K means that the maximum number of processor cores is K; gres/gpu:Type=K means that the maximum number of GPUs of class Type (see gres syntax) is K; mem=KG means that the maximum amount of system RAM is K GBytes

Note that there may be additional limits to the possibility to fully exploit the resources of a partition. For instance, there may be a cap on the maximum number of GPUs that can be used at the same time by a single job and/or a single user.

Partitions of Mufasa 2.0

The features of the SLURM partitions of Mufasa 2.0 are the following:

Name of partition	example of use	max running jobs per user	GPU configurations available to each job	GPUs that the partition has access to	max CPUs per job	max RAM per job [GB]	max wall clock time per job [h]	default resources assigned to jobs	allowed users
gpulight	debug code	1	1 x 20 GB	4 x 20 GB	2	64	6	to be specified	researchers, students
nogpu	tasks not requiring GPUs (in particular: not AI)	1	-	none	16	128	72	to be specified	researchers, students
gpu	AI: train an already debugged model	1	1 x 20 GB	3 x 20 GB	8	64	24	to be specified	researchers, students
gpuwide	AI: search for optimal hyperparameter values	2	1 x 20 GB	5 x 20 GB	8	64	24	to be specified	researchers, students
gpuheavy	AI: train an already optimised model	1	1 x 20 GB or 2 x 20 GB or 1 x 40 GB	3 x 40 GB + 4 x 20 GB	8	128	72	to be specified	researchers

Overall resources associated to the set of all partitions exceed overall available resources, as multiple partitions can be given access to the same resource (e.g., a CPU or a GPU). SLURM will only execute a job if all the resources requested by the job are not already in use at the time of request.

Partition availability

An important information that sinfo provides (column "AVAIL") is the availability (also called state) of partitions. Possible partition states are:

up = the partition is available: Currently running jobs will be completed; Currently queued jobs will be executed as soon as resources allow

drain = the partition is in the process of becoming unavailable (i.e., to go in the down state): Currently running jobs will be completed; Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)

down = the partition is unavailable: There are no running jobs; Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)

When a partition goes from up to drain no harm is done to running jobs. When a partition passes from any other state to down, running jobs (if they exist) get killed. A partition in state drain or down requires intervention by a Job Administrator to be restored to up.

@@ Line 167: / Line 167: @@
 In Mufasa, the priority of a job is computed by SLURM according to the following elements:
-; Category of users - see [[#Research users and students users|Research users and students users]]
+; Category of user who launched the job - see [[#Research users and students users|Research users and students users]]
 : Used to provide higher priority to jobs run by research personnel
@@ Line 179: / Line 179: @@
 : Used to provide higher priority to jobs requiring less CPUs.
-; Age, i.e. the length of time that the job has been waiting in the queue
+; Age, i.e. the time that the job has been waiting in the queue
 : Used to provide higher priority to jobs which have been queued for a long time

Difference between revisions of "SLURM"

Revision as of 17:53, 4 November 2025

Contents

SLURM in a nutshell

Quality of Service (QOS)

Amount of resource available to a QOS

Partitions

Research users and students users

Job priority

Elements determining job priority

System resources subjected to limitations

`gres` syntax

Looking for unused GPUs

SLURM partitions

Partition features

Partitions of Mufasa 2.0

Partition availability

Navigation menu

Search

Difference between revisions of "SLURM"

Revision as of 17:53, 4 November 2025

SLURM in a nutshell

Quality of Service (QOS)

Amount of resource available to a QOS

Partitions

Research users and students users

Job priority

Elements determining job priority

System resources subjected to limitations

gres syntax

Looking for unused GPUs

SLURM partitions

Partition features

Partitions of Mufasa 2.0

Partition availability

Navigation menu

Search

`gres` syntax