GiulioFontana: Created page with "This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other..."

2025-11-04T14:51:33Z

Created page with "This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other..."

New page

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require any of the following:
* GPUs
* multiple CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Bastion server|bastion server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell (and general usage rules) =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule,
;: '''The greater the fraction of Mufasa's overall resources that a job asks for, the lower its priority'''.

The ''time'' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

To access system resources, user jobs make use of '''[[#SLURM partitions|SLURM's partitions]]'''. A partition is basically a job queue providing access to a specified set of resources. Partitions differ in the set of resources that they provide access to because each of them is designed to fit a given type of job. The partition that a job is run on defines the maximum amount of each resource (e.g., RAM) that the job can request; also, the chosen partition defines the maximum ''execution time'' that the job can ask for.

When a user launches a job via SLURM, they specify the ''partition'' on which the job must be run, the ''resources'' that the job needs for its execution (e.g., GPUs), and the ''execution time'' of the job. All these requests must be compatible with the limits associated to the chosen partition. When SLURM executes a job, SLURM reserves the resources requested by the job, for the time requested by the job, and gives the job exclusive access to the reserved resources.
Partitions can define a ''default amount'' for a resource: jobs run on the partion that do not specify how much of that resource they request are assigned the default amount (which may be zero).

The partition on which a job is run influences the priority of the job. Partitions with less available resources are associated to higher priorities than "more powerful" partitions. Also, jobs requesting less than the maximum amount of resources or time allowed by the partition they are run on get a higher priority than jobs that ask for the maximum. Therefore, as a rule,
;:'''A job should be run on the least powerful partition compatible with its needs'''.

When a job asks for less than the maximum amount of resources and/or time allowed by the partition it is run on, it gets a higher priority compared with jobs asking for the maximum. So, as a rule,
;:'''A job should never ask for more resources, or a longer execution time, than necessary'''.

== Job priority ==

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution. This goal is achieved by '''encouraging users to avoid asking for resources or execution time that their job does not need'''.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

=== Elements influencing job priority ===
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

; Job duration, i.e. the execution time requested by the job
: This is used to assign higher priority to shorter jobs

; Job size, i.e. the number of CPUs requested by the job
: This is used to assign higher priority to jobs requiring less CPUs.

; Age, i.e. the length of time that the job has been waiting in the queue
: This is used to increase the priority of a job the longer it has been waiting for execution.

; QOS (Quality Of Service), i.e. a factor associated to the resources requested by the job
: This is used to implement two different mechanisms that influence job priority, i.e.:
:: - to assign higher priority to jobs run on less powerful partitions
:: - to assign higher priority to jobs run by researchers (e.g., Ph.D. students) wrt jobs run by M.Sc. students

QOS is also used to set limits to the number of jobs by the same user that can be in execution at a given time, as well as the number of jobs by the same user that can be queued at a given time.
It is possible to get a list of the QOS that are defined in SLURM with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos
</pre>

= System resources subjected to limitations =

The hardware resources of Mufasa are limited. For this reason, some of them are subjected to limitations, i.e. (these are SLURM's own terms):

; cpu
: the number of processor cores that a job uses

; mem
: the amount of RAM that a job uses

;gres
: the amount of ''generic resources'' that a job uses: in Mufasa, gres are '''GPUs'''

These are some of the '''TRES (Trackable RESources)''' defined by SLURM. From [https://slurm.schedmd.com/tres.html SLURM's documentation]: "''A TRES is a resource that can be tracked for usage or used to enforce limits against.''"

SLURM provides jobs with access to resources only for a limited time: i.e.,
; execution time
: is itself a limited resource.

When a resource is limited, a job cannot use arbitrary quantities of it. On the contrary, the job must specify how much of the resource it requests. Resource requests are done either by running the job on a [[User Jobs#SLURM partitions|partition]] for which a default amount of resources has been defined, or via the options of the [[#Running_jobs_with_SLURM:_generalities|command]] used to launch the job via SLURM.

== <code>gres</code> syntax ==

Whenever it is necessary to specify the quantity of <code>gres</code>, i.e. generic resources, a special syntax must be used. In Mufasa <code>gres</code> resources are GPUs, so this syntax applies to GPUs. Number and type of Mufasa's GPUs is described [[System#CPUs and GPUs|here]].

The name of each GPU resource takes the form

'''<code>Name:Type</code>'''

where <code>Name</code> is '''<code>gpu</code>''' and <code>Type</code> takes the following values [to be updated]:

* '''<code>40gb</code>''' for GPUs with 40 Gbytes of onboard RAM
* '''<code>20gb</code>''' for GPUs with 20 Gbytes of onboard RAM

So, for instance,

<code>gpu:20gb</code>

identifies the resource corresponding to GPUs with 20 GB of RAM.

When asking for a <code>gres</code> resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code><Name>:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>20gb</code> the syntax is

<code>gpu:20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to request a GPU that is not currently in use. This command
<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>
provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following [to be updated]:
<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)
</pre>

To know which of the GPUs are currently in use, use command
<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>
which provides an output similar to this:
<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:20gb:2(IDX:5,8),gpu:10gb:3(IDX:3-4,6)
</pre>
By comparing the two lists (GRES and GRES_USED) in the examples above, you can see that in this example:

* the system has 2 40 GB GPUs, all of which are in use
* the system has 3 20 GB GPUs, of which one is not in use
* the system has 6 10 GB GPUs, of which 3 are not in use

= SLURM partitions =

Execution queues for jobs in SLURM are called '''partitions'''. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command

<pre style="color: lightgrey; background: black;">
sinfo
</pre>

([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this [to be updated]:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up 20:00 1 mix gn01
small up 12:00:00 1 mix gn01
normal up 1-00:00:00 1 mix gn01
longnormal up 3-00:00:00 1 mix gn01
gpu up 1-00:00:00 1 mix gn01
gpulong up 3-00:00:00 1 mix gn01
fat up 3-00:00:00 1 mix gn01
</pre>

In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "debug" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified. On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.

The columns in the standard output of <code>sinfo</code> shown above correspond to the following information:

; PARTITION
: name of the partition

; AVAIL
: state/availability of the partition: see [[User Jobs#Partition availability|below]]

; TIMELIMIT
: maximum runtime of a job allowed by the partition, in format ''[days-]hours:minutes:seconds''

; NODES
: number of nodes available to jobs run on the partition: for Mufasa, this is always 1 since [[System#The SLURM job scheduling system|there is only 1 node in the computing cluster]]

; STATE
: state of the node (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes]); typical values are <code>mixed</code> - meaning that some of the resources of the node are busy executing jobs while other are free, and <code>allocated</code> - meaning that all of the resources of the node are busy

; NODELIST
: list of nodes available to the partition: for Mufasa this field always contains <code>gn01</code> since [[System#The SLURM job scheduling system|Mufasa is the only node in the computing cluster]] [to be updated]

For what concerns hardware resources (such as CPUs, GPUs and RAM) the amounts of each resource available to Mufasa's partitions are set by SLURM's accounting system, and are not visible to <code>sinfo</code>. See [[User Jobs#Partition features|Partition features]] for a description of these amounts.

== Partition features ==

The output of <code>sinfo</code> ([[User Jobs#SLURM partitions|see above]]) provides a list of available partitions, but (except for time) it does not provide information about the amount of resources that a partition makes available to the user jobs which are run on it. The amount of resources is visible through command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-10,maxwall,maxtres%-64
</pre>

which provides an output similar to the following [to be updated]:

<pre style="color: lightgrey; background: black;">
Name MaxWall MaxTRES
---------- ----------- ----------------------------------------------------------------
normal 1-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G
small 12:00:00 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G
longnormal 3-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G
gpu 1-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G
gpulong 3-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G
fat 3-00:00:00 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G
</pre>

Its elements are the following (for more information, see [https://slurm.schedmd.com/qos.html SLURM's documentation]):

; Name
: name of the partition

; MaxWall
: maximum wall clock duration of the jobs run on the partition (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''

; MaxTRES
: maximum amount of resources ("''Trackable RESources''") available to a job running on the partition, where
: <code>'''cpu=''K'''''</code> means that the maximum number of processor cores is ''K''
: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes

Note that there may be additional limits to the possibility to fully exploit the resources of a partition. For instance, there may be a cap on the maximum number of GPUs that can be used at the same time by a single job and/or a single user.

=== Partitions of Mufasa 2.0 ===

The features of the SLURM partitions of Mufasa 2.0 are the following:

{| class=wikitable
!align="center"| Name of partition
!align="center"| example of use
!align="center"| max running jobs per user
!align="center"| GPU configurations available to each job
!align="center"| GPUs that the partition has access to
!align="center"| max CPUs per job
!align="center"| max RAM per job [GB]
!align="center"| max wall clock time per job [h]
!align="center"| default resources assigned to jobs
!align="center"| allowed users
|-
!align="center"| gpulight
|align="center"| debug code
|align="center"| 1
|align="center"| 1 x 20 GB
|align="center"| 4 x 20 GB
|align="center"| 2
|align="center"| 64
|align="center"| 6
|align="center"| to be specified
|align="center"| researchers, students
|-
!align="center"| nogpu
|align="center"| tasks not requiring GPUs (in particular: not AI)
|align="center"| 1
|align="center"| -
|align="center"| none
|align="center"| 16
|align="center"| 128
|align="center"| 72
|align="center"| to be specified
|align="center"| researchers, students
|-
!align="center"| gpu
|align="center"| AI: train an already debugged model
|align="center"| 1
|align="center"| 1 x 20 GB
|align="center"| 3 x 20 GB
|align="center"| 8
|align="center"| 64
|align="center"| 24
|align="center"| to be specified
|align="center"| researchers, students
|-
!align="center"| gpuwide
|align="center"| AI: search for optimal hyperparameter values
|align="center"| 2
|align="center"| 1 x 20 GB
|align="center"| 5 x 20 GB
|align="center"| 8
|align="center"| 64
|align="center"| 24
|align="center"| to be specified
|align="center"| researchers, students
|-
!align="center"| gpuheavy
|align="center"| AI: train an already optimised model
|align="center"| 1
|align="center"| 1 x 20 GB or 2 x 20 GB or 1 x 40 GB
|align="center"| 3 x 40 GB + 4 x 20 GB
|align="center"| 8
|align="center"| 128
|align="center"| 72
|align="center"| to be specified
|align="center"| researchers
|}

Overall resources associated to the set of all partitions exceed overall available resources, as multiple partitions can be given access to the same resource (e.g., a CPU or a GPU). SLURM will only execute a job if all the resources requested by the job are not already in use at the time of request.

== Partition availability ==

An important information that ''sinfo'' provides (column "AVAIL") is the ''availability'' (also called ''state'') of partitions. Possible partition states are:

; up = the partition is available
: Currently running jobs will be completed
: Currently queued jobs will be executed as soon as resources allow

; drain = the partition is in the process of becoming unavailable (i.e., to go in the ''down'' state)
: Currently running jobs will be completed
: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the ''up'' state)

; down = the partition is unavailable
: There are no running jobs
: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the ''up'' state)

When a partition goes from ''up'' to ''drain'' no harm is done to running jobs. When a partition passes from any other state to ''down'', running jobs (if they exist) get killed. A partition in state ''drain'' or ''down'' requires intervention by a [[Roles|Job Administrator]] to be restored to ''up''.

SLURM-BAK - Revision history

GiulioFontana: Created page with "This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other..."