Mufasa (BioHPC) - User contributions [en]

User Jobs

2026-05-07T14:48:57Z

GiulioFontana: /* How to know if your shell is a SLURM job */

= Running jobs with SLURM =

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa. This is a key difference between Mufasa 1.0 and [[System#Mufasa 2.0|Mufasa 2.0]].

== <code>srun</code> and <code>sbatch</code> ==

SLURM provides two commands to run jobs, called [https://slurm.schedmd.com/srun.html srun] and [https://slurm.schedmd.com/sbatch.html sbatch]:

<pre style="color: lightgrey; background: black;">
srun [options] <command_to_be_run_via_SLURM>
</pre>

<pre style="color: lightgrey; background: black;">
sbatch [options] <command_to_be_run_via_SLURM>
</pre>

In both cases, <code><command_to_be_run_via_SLURM></code> can be any Linux program (including shell scripts). By using <code>srun</code> or <code>sbatch</code>, the command or script specified by <code><command_to_be_run_via_SLURM></code> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for '''interactive jobs''': i.e., processes that use the console to interact with their user during job execution. <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

<code>sbatch</code> provides an additional possibility: <code><command_to_be_run_via_SLURM></code> can in fact be an [[#Using execution scripts to run jobs|'''execution script''']], i.e. a special (and SLURM-specific) type of Linux shell script that includes '''SBATCH directives'''. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the <code>[options]</code> part of the <code>sbatch</code> command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

Immediately after a <code>srun</code> or <code>sbatch</code> command is launched by a user, SLURM outputs a message informing the user that the job has been queued. The output is similar to this:

<pre style="color: lightgrey; background: black;">
srun: job 10849 queued and waiting for resources
</pre>

The shell is now locked while SLURM prepares the execution of the user program ([[#Detaching from a running job with screen|if you are using <code>screen</code> you can detach from that shell and come back later]]).

When SLURM is ready to run the program, it prints a message similar to

<pre style="color: lightgrey; background: black;">
srun: job 10849 has been allocated resources
</pre>

and then executes the program.

=== Options of <code>srun</code> and <code>sbatch</code> ===

The <code>[options]</code> part of <code>srun</code> and <code>sbatch</code> commands is used to tell SLURM what resources the job needs to be executed the job and how much time it will need to complete its execution.

For what concerns resources, the most important option is <code>--qos <qos_name></code>, specifying which SLURM [[#SLURM Quality of Service (QOS)|SLURM QOS]] the job will use. A job run with a given QOS has access to all and only the resources available to that QOS. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen QOS. Jobs that require resources that are not available to the chosen QOS do not get executed.

If the user forgets to use option <code>--qos <qos_name></code>, the job is run on the ''default qos'' (<code>normal</code>) which has access to ''zero'' resources. Therefore it is always necessary to specify option <code>--qos <qos_name></code> when launching a SLURM job on Mufasa.

More generally, the most relevant among the <code>[options]</code> are:

:;‑-qos=<qos__name>
:: specifies the [[SLURM#SLURM Quality of Service (QOS)|SLURM QOS]] that the job will use. It is mandatory to specify one.

:: ''Important! The chosen QOS limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is available to the chosen QOS.''

:: ''Important! If <code>‑‑qos <qos_name></code> is used and options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) are omitted, the job is assigned the default amount of the resource (as defined by the chosen QOS. A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job uses a QOS with access to GPUs.''

:; --job-name=<jobname>
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name.

:;‑‑gres=<gpu_resources>
:: specifies what GPUs to assign to the job. <code>gpu_resources</code> is a comma-delimited list where each element has the form <code>gpu:<Type>:<amount></code>, where <code><Type></code> is one of the types of GPU available on Mufasa (see [[SLURM#gres syntax|<code>gres</code> syntax]]) and <code><amount></code> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <code><gpu_resources></code> may be <code>gpu:40gb:1,gpu:3g.20gb:1</code>, corresponding to asking for one "full" GPU and 1 "small" GPU.

:: ''Important! The <code>‑‑gres</code> parameter is '''mandatory''' if the job is run with a QOS that allows access to the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.''

:;‑‑mem=<mem_resources>
:: specifies the amount of RAM to assign to the job; for instance, <code><mem_resources></code> may be <code>200G</code>

:;‑‑cpus-per-task=<cpu_amount>
:: specifies how many CPUs to assign to the job; for instance, <code><cpu_amount></code> may be <code>2</code>

:;<nowiki>‑‑time=<duration></nowiki>
:: specifies the maximum time allowed to the job to complete, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>. When the time expires, the job (if still running) gets killed by SLURM.

:;‑‑pty
:: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[#Interactive jobs|Interactive jobs]])

Note that GPU resources (if needed) must always be requested explicitly. For instance, in order to execute program <code>./my_program</code> which needs one GPU of type <code>3g.20gb</code> with QOS <code>gpulight</code> we can use the SLURM command

<pre style="color: lightgrey; background: black;">
srun --qos=gpulight --gres=gpu:3g.20gb:1 ./my_program
</pre>

== Interactive jobs ==

An '''interactive job''' is a process that use the console to interact with their user during job execution. Such a process is manually run by the user from a ''bash shell'' (i.e. a terminal session) provided by SLURM.

In order to ask SLURM to schedule the execution of a shell where the user can subsequently run the interactive job, it is necessary to use option <code>--pty</code>.

For instance, to ask SLURM to run a shell with QOS <code>nogpu</code>, the user should use command

<pre style="color: lightgrey; background: black;">
srun --qos=nogpu --pty /bin/bash
</pre>

By not specifying any other options, the user is telling SLURM that they want the shell spawned by SLURM to be provided with the default amount of resources associated to QOS <code>nogpu</code>. More generally, any combination of the other [[#Options of srun and sbatch|options of srun]] can be used together with <code>--pty</code>.

As every other job request to SLURM, the request to run a shell must be done from the [[System#Login server|login server]]. As soon as possible (i.e., as soon as the necessary resources are available) SLURM will open (in the same terminal that the user used to launch the <code>srun</code> command) a bash shell, where the user will be able to run their interactive programs.

To the user, this corresponds to the fact that the shell they were using to interact with the login server changes into a shell opened ''directly on Mufasa''. This corresponds to the command prompt changing from

<pre style="color: lightgrey; background: black;">
<username>@mufasa2-login:~$
</pre>

to

<pre style="color: lightgrey; background: black;">
<username>@mufasa2:~$
</pre>

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

When the user does not need the SLURM-spawned shell anymore, they should close it with command (the same used for any other Linux shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

to make the resources reserved for the interactive shell free again.

== Non-interactive jobs ==

<code>srun</code> commands are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the <code>srun</code> command can be substituted with a much simpler '''<code>sbatch</code> command'''. As [[#Running jobs with SLURM|already explained]], <code>sbatch</code> can make use of an '''execution script''' to specify all the parts of the command to be run via SLURM. So the command becomes

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

An execution script is a special type of Linux script that includes '''SBATCH directives'''. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an <code>srun</code> command.

:{|class="wikitable"
|'''''Note on Linux shell scripts'''''
|-
|''A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:

* ''have the “executable” flag set'' (see [[System#Changing file/directory ownership and permissions|here]] for details)
* ''have'' <code>#!/bin/bash</code> ''as its very first line''

''Usually, a Linux shell script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.''

''Within any shell script, lines preceded by <code>#</code> are comments (with the notable exception of the initial'' <code>#!/bin/bash</code> ''line)''.

''Use of blank lines as spacers is allowed.''
|}

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# [optionally] one or more '''<code>srun</code> commands''' that launch jobs with SLURM using the parameter values specified in the preamble

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

The template includes all the options [[#Using SLURM to run a container|already described above]], plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

In the template below, '''#SBATCH directives''' are requests made to SLURM. Notice that, though #SBATCH directives have a leading "#", that does ''not'' mean that they are comments: exactly as the <code>#!/bin/bash</code> at the beginning of a shell script, while starting with "#", is not a comment as well.

Other lines in the script that begin with <code>#</code> not followed by SBATCH are comments.

For what concerns directive that ask for a given amount of a resource (including time), if they are missing from the execution script (or commented out) the job will be assigned the default amount of the resource.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

'''<nowiki>#</nowiki>SBATCH ‑-nodes=1'''

'''<nowiki>#</nowiki>SBATCH ‑‑ntasks=1'''

'''<nowiki>#</nowiki>SBATCH ‑-partition=jobs'''

'''<nowiki>#</nowiki>SBATCH ‑-qos=<qos_name>'''

'''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>'''

'''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>'''

'''<nowiki>#</nowiki>SBATCH ‑‑output=./<filename>-%j.out'''

: <nowiki>#</nowiki> the text file where the output of the job gets written (i.e., standard output gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH ‑‑error=./<filename>-error-%j.out'''

: <nowiki>#</nowiki> the text file where any error messages generated by the job gets written (i.e., standard error gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH --job-name=<name>'''

<nowiki>#</nowiki>----------------end of preamble----------------

'''<command_to_run>'''

'''<command_to_run>'''

'''...'''
</blockquote>

= Executing jobs on Mufasa =

== Key concept ==

'''The key concept about executing jobs on Mufasa is that [[System#Containers|all computation on Mufasa must occur within containers]]'''.

A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if the user has writing permission on them: e.g., the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

The system used by Mufasa to create and execute containers is '''[[System#Singularity|Singularity]]'''. This wiki includes [[Singularity|directions]] on preparing containers with Singularity.

The container where a user job runs must contain all the libraries needed by the job. In fact (for maintainability and safety reasons) '''no software and no libraries are installed on Mufasa 2.0'''.

== Interactive and non-interactive user jobs ==

This section explains how to execute a user job contained in a container. It considers two types of user jobs, i.e.:
;: Interactive user jobs
::: as [[#Interactive jobs|already explained]], these are jobs that require interaction with the user while they are running, via a bash shell running within the container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the container is in execution.

;: Non-interactive user jobs
::: are the most common variety. The user prepares the container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the container while it is in execution. Executing the container and running the required programs within the container's environment is done via [[#Interactive jobs|execution scripts]].

== Using SLURM to run an interactive job on Mufasa ==

The first step to run an interactive user job on Mufasa is to run the [[System#Containers|container]] where the job will take place. Each user is in charge of preparing the container(s) where the user's jobs will be executed.

In order to run a container via SLURM by hand, i.e. via an interactive shell, a user must first open the shell with command

<pre style="color: lightgrey; background: black;">
srun [general_SLURM_options] ‑‑pty /bin/bash
</pre>

where [general_SLURM_options] are those [[#Options of srun and sbatch|already described above]].

Then the user must run the container: this is done as follows.

First, it is necessary to load the Singularity software module with

<pre style="color: lightgrey; background: black;">
module load amd/singularity
</pre>

(if needed, the list of software modules available in the system can be obtained with command <code>module av</code>).

Then, the user must use Singularity to run the container with command (see the [[Singularity|section about Singularity]] for further details)

<pre style="color: lightgrey; background: black;">
singularity run <repository>://<name_of_container>
</pre>

which pulls the container from the specified repository and executes it. Possible values for <code><repository></code> are:

:: <code>docker</code> (Docker Hub)
:: <code>library</code> (Singularityhub)
:: <code>path/to/container</code> if the container is local, i.e. located in the filesystem of Mufasa

As soon as the container is in execution, the terminal window used, so far, to interact with Mufasa becomes a shell ''in the container''. This shell belongs to the software environment of the container, and the user can use it to interact with the container's own software environment and filesystem.

It is easy to understand if a shell is open to Mufasa or to the container because in a container shell the system prompt becomes

<pre style="color: lightgrey; background: black;">
singularity>
</pre>

=== Interaction between container filesystem and local filesystem ===

The filesystem inside the container and the local one, i.e. Mufasa's, can interact. This means that the container can access the local filesystem to read and/or write files. However, the only parts of Mufasa's filesystem that can be accessed by the container are those that the user running the container has access rights to.

As a default, the user's <code>/home/username</code> directory on Mufasa is automatically mapped onto <code>/home/username</code> into the filesystem of the container. Whatever is done to that container directory, the changes are actually applied to the local <code>/home/username</code> directory on Mufasa.

The mapping of the home directory does not need to be explicitly requested. However, if the user needs (in addition to the home directory) other parts of the local filesystem of Mufasa to be mapped onto the container's filesystem, this is possible by using this modified version of the <code>singularity run</code> command:

<pre style="color: lightgrey; background: black;">
singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container>
</pre>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

=== How to know if your shell is a SLURM job ===
To know if the shell you are using is being run via SLURM or not (becoming confused is easy...), use command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If it provides an output, your shell is a SLURM job and the output is the ID of the job.

If it doesn't provide any output, your shell is not a SLURM job.

== Using SLURM to run a non-interactive job on Mufasa ==

When the user job to be executed into a container is non-interactive, the mechanism based on an ''execution script'' already described in [[#Non-interactive jobs|Non-interactive jobs]] is employed. The command to run the script which in turn will run the container where the user job takes place is therefore

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

The general features of a SLURM execution script and the SBATCH directives used for generic jobs have [[#Non-interactive jobs|already been described]]. Here we focus, therefore, on the SBATCH directives specifically used when SLURM is used to run a non-interactive job within a container.

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

[[#Non-interactive jobs|#SBATCH directives already described above]]

<nowiki>#</nowiki>----------------end of preamble----------------

'''module load amd/singularity'''

'''singularity run <repository>://<name_of_container> <command_to_run>'''

</blockquote>

In the last line of the script, <code><command_to_run></code> is the command (e.g., the name of an executable script), complete with path within the container's filesystem, of the program to be run into the container. Please refer to the [[Singularity|section about Singularity]] for details about its commands.

The interactions between container filesystem and local filesystem in non-interactive jobs are exactly the same [[#Interaction between container filesystem and local filesystem|already described]] for interactive jobs. In particular, the user's home directory is mapped by default onto the filesystem of the container.

If, in addition to that, the user needs another part of the filesystem of Mufasa are to be mapped onto the container's filesystem, this is possible using this modified version of the <code>singularity run</code> command at the end of the script:

:<code>singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container></code>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

== Job output ==

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of Mufasa by the container where the computation takes place.

As [[#Using SLURM to run a container|explained below]], SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own <code>/home</code> directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own <code>/home</code> directory).

== Cancelling completed jobs ==

When a user process run via SLURM has completed its execution and is not needed anymore, it is important to [[User_Jobs#Canceling_a_job_with_scancel|close it with scancel]]. Especially if much time remains to the end of the execution time requested by the job.

Cancelling a SLURM job makes the resources reserved by SLURM free again for other users, and thus speeds up the execution of the jobs still queued.

Typically, one doesn't know how long a piece of code will take to complete its work. So please make sure to check from time to time if that happened, and -if there's still time before the duration of your SLURM job ends- just ''scancel'' the job. Other users will be grateful :-)

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

= Detaching from a running job with <code>screen</code> =

A consequence of the way <code>srun</code> operates is that if you launch an [[#Interactive and non-interactive user jobs|interactive user job]], the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the [[System#Accessing Mufasa|SSH connection to Mufasa]] exists.

If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then ''detach'' from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online).

Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then ''reattach'' to the ''screen''.

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Basic usage of <code>screen</code> is explained below.

== Creating a screen session, running a job in it, detaching from it ==

# Connect to the [[System#Login server|login server]] with SSH
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created (it has the look of an empty shell), launch your job with <code>srun</code>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell, while your process will go on running in the screen
# You can now close the SSH connection to the login server without damaging your running job

== Reattaching to an active screen session ==

# Connect to the [[System#Login server|login server]] with SSH
# In the login server shell, run <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you launched your job

== Closing (i.e. destroying) a screen session ==

When you do not need a screen session anymore:

# reattach to the active screen session as explained [[#Reattaching to an active screen session|above]]
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash), then confirming that you really want to proceed

Of course, any program (including SLURM jobs) running within the screen gets terminated when the screen is destroyed.

= Using <code>salloc</code> to reserve resources =

== What is <code>salloc</code>? ==

[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.

The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.

More precisely:
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them;
* when the delivery comes, SLURM creates an interactive shell session for the user;
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.

Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.

When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.

Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.

== <code>salloc</code> commands ==

<code>salloc</code> commands use a similar syntax to <code>srun</code> commands. In particular, <code>salloc</code> lets a user specify what resources they need and -importantly- a '''delivery time''' for the requested resources (delivery time can also be specified with <code>srun</code>, but in that case it is not very useful).

The typical <code>salloc</code> command has this form:'

<pre style="color: lightgrey; background: black;">
salloc [general_SLURM_options] --begin=<time>
</pre>

where

:; [general_SLURM_options]
:: represents the options already described in [[#Options of srun and sbatch|Options of srun and sbatch]]

:;<nowiki>--begin=<time></nowiki>
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time.

=== Syntax of parameter <code>--begin</code> ===

If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form

:<code>HH:MM</code>

If you want to specify a time of a different day, the form for <time> is <code>YYYY-MM-DDTHH:MM</code>, where the uppercase 'T' separates date from time.

It is also possible to specify <time> as relative to the current time, in one of the following forms:
: <code>now+Kminutes</code>
: <code>now+Khours</code>
: <code>now+Kdays</code>
where K is a (positive) integer.

Examples:
: <code>--begin=16:00</code>
: <code>--begin=now+1hours</code>
: <code>--begin=now+1days</code>
: <code>--begin=2030-01-20T12:34:00</code>

Note that Mufasa's time zone is GMT, so <nowiki><time></nowiki> must be expressed in GMT as well. If you want to know Mufasa's current time, use command

<pre style="color: lightgrey; background: black;">
date
</pre>

It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Thu Nov 10 16:43:30 UTC 2022
</pre>

== How to use <code>salloc</code> ==

In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ([[#Creating_a_screen_session.2C_running_a_job_in_it.2C_detaching_from_it|detaching from the screen]]). It is then possible to reach again the screen at a later time ([[#Reattaching_to_an_active_screen_session|reattaching to the screen]]). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>.

More precisely, the operations needed to do this are the following:

# Connect to the [[System#Login server|login server]].
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created run the [[#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered.
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell.
# You can now close the SSH connection to the login server without damaging your resource allocation request.
# At the delivery time you specified in the [[#salloc commands|<code>salloc</code> command]], connect to the login server with SSH.
# Once you are in the login server shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt.
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. '''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.)
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.)
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

== Cancelling a resource request made with <code>salloc</code> ==

To cancel a request for resources made as explained in [[#How to use salloc|How to use <code>salloc</code>]], follow these steps:

# Connect to the the [[System#Login server|login server]] with SSH.
# Once you are in the login server shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources.
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre>
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

= Monitoring and managing jobs =

SLURM provides Job Users with tools to inspect and manage jobs. While a [[Roles|Job User]] is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are '''[https://slurm.schedmd.com/squeue.html <code>squeue</code>]''' to inspect the scheduling queues and '''[https://slurm.schedmd.com/scancel.html <code>scancel</code>]''' to terminate queued or running jobs.

== Inspecting jobs with <code>squeue</code> ==

Running command

<pre style="color: lightgrey; background: black;">
squeue
</pre>

provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

This output comprises the following information:

:; JOBID
:: Numerical identifier of the job assigned by SLURM
:: This identifier is used to intervene on the job, for instance with <code>scancel</code>

:; PARTITION
:: the partition that the job is run on

:; NAME
:: the name assigned to the job; can be personalised using the <code>--job-name</code> option

:; USER
:: username of the user who launched the job

:; ST
:: job state (see [[SLURM#Job state|Job state]] for further information)

:; TIME
:: time that has passed since the beginning of job execution

:; NODES
:: number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)

:; NODELIST (REASON)
:: name of the nodes where the job is being executed: for Mufasa it is always <code>gn01</code>, which is the name of the node corresponding to Mufasa.

To limit the output of <code>squeue</code> to the jobs owned by user <code><username></code>, it can be used like this:

<pre style="color: lightgrey; background: black;">
squeue -u <username>
</pre>

=== Interpreting Job state as provided by <code>squeue</code> ===

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

:'''<code>PD</code>''' PENDING
:: Job is awaiting resource allocation.

:'''<code>R</code>''' RUNNING
:: Job currently has an allocation.

:'''<code>S</code>''' SUSPENDED
:: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

:'''<code>CG</code>''' COMPLETING
:: Job is in the process of completing. Some processes on some nodes may still be active.

:'''<code>CD</code>''' COMPLETED
:: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them.

== Knowing when jobs are expected to end or start ==

If you are interested in understanding when jobs are expected to start or end, use command

<pre style="color: lightgrey; background: black;">
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
</pre>

which provides an output is similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID USER PARTITION ST |START_TIME | TIME_LEFT|
5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14|
5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41|
5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05|
5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11|
5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43|
5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
</pre>

;:For running jobs (state <code>R</code>):
::column "START_TIME" tells you when the job started its execution
::column "TIME_LEFT" tells you how much remains of the running time requested by the job

;:For pending jobs (state <code>PD</code>):
::column "START_TIME" tells you when the job is expected to start its execution
::column "TIME_LEFT" tells you how much running time has been requested by the job

'''Important!''' Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.

If you simply want to know when pending jobs (state <code>PD</code>) are expected to begin execution, use

<pre style="color: lightgrey; background: black;">
squeue --start
</pre>

which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
</pre>

== Getting detailed information about a job ==

If needed, complete information about a job (either pending or running) can be obtained using command

<pre style="color: lightgrey; background: black;">
scontrol show job <JOBID>
</pre>

where <code><JOBID></code> is the number from the first column of the output of <code>squeue</code>. The output of this command is similar to the following:

<pre style="color: lightgrey; background: black;">
JobId=65 JobName=test_script.sh
UserId=gfontana(10003) GroupId=gfontana(10004) MCS_label=N/A
Priority=14208 Nice=0 Account=admin QOS=nogpu
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:55 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-11-06T10:31:10 EligibleTime=2025-11-06T10:31:10
AccrueTime=2025-11-06T10:31:10
StartTime=2025-11-06T10:31:10 EndTime=2025-11-06T11:31:10 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-11-06T10:31:10 Scheduler=Main
Partition=jobs AllocNode:Sid=mufasa2-login:42020
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gn01
BatchHost=gn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=4G,node=1,billing=1
AllocTRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=./test_script.sh
WorkDir=/home/gfontana
</pre>

In particular, the line beginning with ''"StartTime="'' provides expected times for the start and end of job execution. As explained in [[User_Jobs#Knowing_when_jobs_are_expected_to_end_or_start|Knowing when jobs are expected to end or start]], start time is only a prediction and subject to change.

== Cancelling a job with <code>scancel</code> ==

It is possible to cancel a job using command <code>scancel</code>, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it).

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

The following are some examples of use of <code>scancel</code> adapted from [https://slurm.schedmd.com/scancel.html SLURM's documentation].

<pre style="color: lightgrey; background: black;">
scancel <JOBID>
</pre>
removes queued job <code><JOBID></code> from the execution queue.

<pre style="color: lightgrey; background: black;">
scancel --signal=TERM <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGTERM (request to stop).

<pre style="color: lightgrey; background: black;">
scancel --signal=KILL <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGKILL (force stop).

<pre style="color: lightgrey; background: black;">
scancel --state=PENDING --user=<username> --partition=<partition_name>
</pre>
cancels all pending jobs belonging to user <code><username></code> in partition <code><partition_name></code>.

== Knowing what jobs you ran today ==

Command

<pre style="color: lightgrey; background: black;">
sacct -X
</pre>

provides a list of all jobs run today by your user.

User Jobs

2026-05-07T14:48:33Z

GiulioFontana: /* How to know if your shell is a SLURM job */

= Running jobs with SLURM =

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa. This is a key difference between Mufasa 1.0 and [[System#Mufasa 2.0|Mufasa 2.0]].

== <code>srun</code> and <code>sbatch</code> ==

SLURM provides two commands to run jobs, called [https://slurm.schedmd.com/srun.html srun] and [https://slurm.schedmd.com/sbatch.html sbatch]:

<pre style="color: lightgrey; background: black;">
srun [options] <command_to_be_run_via_SLURM>
</pre>

<pre style="color: lightgrey; background: black;">
sbatch [options] <command_to_be_run_via_SLURM>
</pre>

In both cases, <code><command_to_be_run_via_SLURM></code> can be any Linux program (including shell scripts). By using <code>srun</code> or <code>sbatch</code>, the command or script specified by <code><command_to_be_run_via_SLURM></code> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for '''interactive jobs''': i.e., processes that use the console to interact with their user during job execution. <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

<code>sbatch</code> provides an additional possibility: <code><command_to_be_run_via_SLURM></code> can in fact be an [[#Using execution scripts to run jobs|'''execution script''']], i.e. a special (and SLURM-specific) type of Linux shell script that includes '''SBATCH directives'''. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the <code>[options]</code> part of the <code>sbatch</code> command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

Immediately after a <code>srun</code> or <code>sbatch</code> command is launched by a user, SLURM outputs a message informing the user that the job has been queued. The output is similar to this:

<pre style="color: lightgrey; background: black;">
srun: job 10849 queued and waiting for resources
</pre>

The shell is now locked while SLURM prepares the execution of the user program ([[#Detaching from a running job with screen|if you are using <code>screen</code> you can detach from that shell and come back later]]).

When SLURM is ready to run the program, it prints a message similar to

<pre style="color: lightgrey; background: black;">
srun: job 10849 has been allocated resources
</pre>

and then executes the program.

=== Options of <code>srun</code> and <code>sbatch</code> ===

The <code>[options]</code> part of <code>srun</code> and <code>sbatch</code> commands is used to tell SLURM what resources the job needs to be executed the job and how much time it will need to complete its execution.

For what concerns resources, the most important option is <code>--qos <qos_name></code>, specifying which SLURM [[#SLURM Quality of Service (QOS)|SLURM QOS]] the job will use. A job run with a given QOS has access to all and only the resources available to that QOS. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen QOS. Jobs that require resources that are not available to the chosen QOS do not get executed.

If the user forgets to use option <code>--qos <qos_name></code>, the job is run on the ''default qos'' (<code>normal</code>) which has access to ''zero'' resources. Therefore it is always necessary to specify option <code>--qos <qos_name></code> when launching a SLURM job on Mufasa.

More generally, the most relevant among the <code>[options]</code> are:

:;‑-qos=<qos__name>
:: specifies the [[SLURM#SLURM Quality of Service (QOS)|SLURM QOS]] that the job will use. It is mandatory to specify one.

:: ''Important! The chosen QOS limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is available to the chosen QOS.''

:: ''Important! If <code>‑‑qos <qos_name></code> is used and options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) are omitted, the job is assigned the default amount of the resource (as defined by the chosen QOS. A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job uses a QOS with access to GPUs.''

:; --job-name=<jobname>
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name.

:;‑‑gres=<gpu_resources>
:: specifies what GPUs to assign to the job. <code>gpu_resources</code> is a comma-delimited list where each element has the form <code>gpu:<Type>:<amount></code>, where <code><Type></code> is one of the types of GPU available on Mufasa (see [[SLURM#gres syntax|<code>gres</code> syntax]]) and <code><amount></code> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <code><gpu_resources></code> may be <code>gpu:40gb:1,gpu:3g.20gb:1</code>, corresponding to asking for one "full" GPU and 1 "small" GPU.

:: ''Important! The <code>‑‑gres</code> parameter is '''mandatory''' if the job is run with a QOS that allows access to the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.''

:;‑‑mem=<mem_resources>
:: specifies the amount of RAM to assign to the job; for instance, <code><mem_resources></code> may be <code>200G</code>

:;‑‑cpus-per-task=<cpu_amount>
:: specifies how many CPUs to assign to the job; for instance, <code><cpu_amount></code> may be <code>2</code>

:;<nowiki>‑‑time=<duration></nowiki>
:: specifies the maximum time allowed to the job to complete, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>. When the time expires, the job (if still running) gets killed by SLURM.

:;‑‑pty
:: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[#Interactive jobs|Interactive jobs]])

Note that GPU resources (if needed) must always be requested explicitly. For instance, in order to execute program <code>./my_program</code> which needs one GPU of type <code>3g.20gb</code> with QOS <code>gpulight</code> we can use the SLURM command

<pre style="color: lightgrey; background: black;">
srun --qos=gpulight --gres=gpu:3g.20gb:1 ./my_program
</pre>

== Interactive jobs ==

An '''interactive job''' is a process that use the console to interact with their user during job execution. Such a process is manually run by the user from a ''bash shell'' (i.e. a terminal session) provided by SLURM.

In order to ask SLURM to schedule the execution of a shell where the user can subsequently run the interactive job, it is necessary to use option <code>--pty</code>.

For instance, to ask SLURM to run a shell with QOS <code>nogpu</code>, the user should use command

<pre style="color: lightgrey; background: black;">
srun --qos=nogpu --pty /bin/bash
</pre>

By not specifying any other options, the user is telling SLURM that they want the shell spawned by SLURM to be provided with the default amount of resources associated to QOS <code>nogpu</code>. More generally, any combination of the other [[#Options of srun and sbatch|options of srun]] can be used together with <code>--pty</code>.

As every other job request to SLURM, the request to run a shell must be done from the [[System#Login server|login server]]. As soon as possible (i.e., as soon as the necessary resources are available) SLURM will open (in the same terminal that the user used to launch the <code>srun</code> command) a bash shell, where the user will be able to run their interactive programs.

To the user, this corresponds to the fact that the shell they were using to interact with the login server changes into a shell opened ''directly on Mufasa''. This corresponds to the command prompt changing from

<pre style="color: lightgrey; background: black;">
<username>@mufasa2-login:~$
</pre>

to

<pre style="color: lightgrey; background: black;">
<username>@mufasa2:~$
</pre>

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

When the user does not need the SLURM-spawned shell anymore, they should close it with command (the same used for any other Linux shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

to make the resources reserved for the interactive shell free again.

== Non-interactive jobs ==

<code>srun</code> commands are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the <code>srun</code> command can be substituted with a much simpler '''<code>sbatch</code> command'''. As [[#Running jobs with SLURM|already explained]], <code>sbatch</code> can make use of an '''execution script''' to specify all the parts of the command to be run via SLURM. So the command becomes

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

An execution script is a special type of Linux script that includes '''SBATCH directives'''. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an <code>srun</code> command.

:{|class="wikitable"
|'''''Note on Linux shell scripts'''''
|-
|''A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:

* ''have the “executable” flag set'' (see [[System#Changing file/directory ownership and permissions|here]] for details)
* ''have'' <code>#!/bin/bash</code> ''as its very first line''

''Usually, a Linux shell script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.''

''Within any shell script, lines preceded by <code>#</code> are comments (with the notable exception of the initial'' <code>#!/bin/bash</code> ''line)''.

''Use of blank lines as spacers is allowed.''
|}

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# [optionally] one or more '''<code>srun</code> commands''' that launch jobs with SLURM using the parameter values specified in the preamble

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

The template includes all the options [[#Using SLURM to run a container|already described above]], plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

In the template below, '''#SBATCH directives''' are requests made to SLURM. Notice that, though #SBATCH directives have a leading "#", that does ''not'' mean that they are comments: exactly as the <code>#!/bin/bash</code> at the beginning of a shell script, while starting with "#", is not a comment as well.

Other lines in the script that begin with <code>#</code> not followed by SBATCH are comments.

For what concerns directive that ask for a given amount of a resource (including time), if they are missing from the execution script (or commented out) the job will be assigned the default amount of the resource.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

'''<nowiki>#</nowiki>SBATCH ‑-nodes=1'''

'''<nowiki>#</nowiki>SBATCH ‑‑ntasks=1'''

'''<nowiki>#</nowiki>SBATCH ‑-partition=jobs'''

'''<nowiki>#</nowiki>SBATCH ‑-qos=<qos_name>'''

'''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>'''

'''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>'''

'''<nowiki>#</nowiki>SBATCH ‑‑output=./<filename>-%j.out'''

: <nowiki>#</nowiki> the text file where the output of the job gets written (i.e., standard output gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH ‑‑error=./<filename>-error-%j.out'''

: <nowiki>#</nowiki> the text file where any error messages generated by the job gets written (i.e., standard error gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH --job-name=<name>'''

<nowiki>#</nowiki>----------------end of preamble----------------

'''<command_to_run>'''

'''<command_to_run>'''

'''...'''
</blockquote>

= Executing jobs on Mufasa =

== Key concept ==

'''The key concept about executing jobs on Mufasa is that [[System#Containers|all computation on Mufasa must occur within containers]]'''.

A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if the user has writing permission on them: e.g., the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

The system used by Mufasa to create and execute containers is '''[[System#Singularity|Singularity]]'''. This wiki includes [[Singularity|directions]] on preparing containers with Singularity.

The container where a user job runs must contain all the libraries needed by the job. In fact (for maintainability and safety reasons) '''no software and no libraries are installed on Mufasa 2.0'''.

== Interactive and non-interactive user jobs ==

This section explains how to execute a user job contained in a container. It considers two types of user jobs, i.e.:
;: Interactive user jobs
::: as [[#Interactive jobs|already explained]], these are jobs that require interaction with the user while they are running, via a bash shell running within the container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the container is in execution.

;: Non-interactive user jobs
::: are the most common variety. The user prepares the container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the container while it is in execution. Executing the container and running the required programs within the container's environment is done via [[#Interactive jobs|execution scripts]].

== Using SLURM to run an interactive job on Mufasa ==

The first step to run an interactive user job on Mufasa is to run the [[System#Containers|container]] where the job will take place. Each user is in charge of preparing the container(s) where the user's jobs will be executed.

In order to run a container via SLURM by hand, i.e. via an interactive shell, a user must first open the shell with command

<pre style="color: lightgrey; background: black;">
srun [general_SLURM_options] ‑‑pty /bin/bash
</pre>

where [general_SLURM_options] are those [[#Options of srun and sbatch|already described above]].

Then the user must run the container: this is done as follows.

First, it is necessary to load the Singularity software module with

<pre style="color: lightgrey; background: black;">
module load amd/singularity
</pre>

(if needed, the list of software modules available in the system can be obtained with command <code>module av</code>).

Then, the user must use Singularity to run the container with command (see the [[Singularity|section about Singularity]] for further details)

<pre style="color: lightgrey; background: black;">
singularity run <repository>://<name_of_container>
</pre>

which pulls the container from the specified repository and executes it. Possible values for <code><repository></code> are:

:: <code>docker</code> (Docker Hub)
:: <code>library</code> (Singularityhub)
:: <code>path/to/container</code> if the container is local, i.e. located in the filesystem of Mufasa

As soon as the container is in execution, the terminal window used, so far, to interact with Mufasa becomes a shell ''in the container''. This shell belongs to the software environment of the container, and the user can use it to interact with the container's own software environment and filesystem.

It is easy to understand if a shell is open to Mufasa or to the container because in a container shell the system prompt becomes

<pre style="color: lightgrey; background: black;">
singularity>
</pre>

=== Interaction between container filesystem and local filesystem ===

The filesystem inside the container and the local one, i.e. Mufasa's, can interact. This means that the container can access the local filesystem to read and/or write files. However, the only parts of Mufasa's filesystem that can be accessed by the container are those that the user running the container has access rights to.

As a default, the user's <code>/home/username</code> directory on Mufasa is automatically mapped onto <code>/home/username</code> into the filesystem of the container. Whatever is done to that container directory, the changes are actually applied to the local <code>/home/username</code> directory on Mufasa.

The mapping of the home directory does not need to be explicitly requested. However, if the user needs (in addition to the home directory) other parts of the local filesystem of Mufasa to be mapped onto the container's filesystem, this is possible by using this modified version of the <code>singularity run</code> command:

<pre style="color: lightgrey; background: black;">
singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container>
</pre>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

=== How to know if your shell is a SLURM job ===
To know if the shell you are using is being run via SLURM or not (becoming confused is easy...), use command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If it provides an output, your shell is a SLURM job and the output is the ID of the job.
If it doesn't provide any output, your shell is not a SLURM job.

== Using SLURM to run a non-interactive job on Mufasa ==

When the user job to be executed into a container is non-interactive, the mechanism based on an ''execution script'' already described in [[#Non-interactive jobs|Non-interactive jobs]] is employed. The command to run the script which in turn will run the container where the user job takes place is therefore

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

The general features of a SLURM execution script and the SBATCH directives used for generic jobs have [[#Non-interactive jobs|already been described]]. Here we focus, therefore, on the SBATCH directives specifically used when SLURM is used to run a non-interactive job within a container.

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

[[#Non-interactive jobs|#SBATCH directives already described above]]

<nowiki>#</nowiki>----------------end of preamble----------------

'''module load amd/singularity'''

'''singularity run <repository>://<name_of_container> <command_to_run>'''

</blockquote>

In the last line of the script, <code><command_to_run></code> is the command (e.g., the name of an executable script), complete with path within the container's filesystem, of the program to be run into the container. Please refer to the [[Singularity|section about Singularity]] for details about its commands.

The interactions between container filesystem and local filesystem in non-interactive jobs are exactly the same [[#Interaction between container filesystem and local filesystem|already described]] for interactive jobs. In particular, the user's home directory is mapped by default onto the filesystem of the container.

If, in addition to that, the user needs another part of the filesystem of Mufasa are to be mapped onto the container's filesystem, this is possible using this modified version of the <code>singularity run</code> command at the end of the script:

:<code>singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container></code>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

== Job output ==

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of Mufasa by the container where the computation takes place.

As [[#Using SLURM to run a container|explained below]], SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own <code>/home</code> directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own <code>/home</code> directory).

== Cancelling completed jobs ==

When a user process run via SLURM has completed its execution and is not needed anymore, it is important to [[User_Jobs#Canceling_a_job_with_scancel|close it with scancel]]. Especially if much time remains to the end of the execution time requested by the job.

Cancelling a SLURM job makes the resources reserved by SLURM free again for other users, and thus speeds up the execution of the jobs still queued.

Typically, one doesn't know how long a piece of code will take to complete its work. So please make sure to check from time to time if that happened, and -if there's still time before the duration of your SLURM job ends- just ''scancel'' the job. Other users will be grateful :-)

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

= Detaching from a running job with <code>screen</code> =

A consequence of the way <code>srun</code> operates is that if you launch an [[#Interactive and non-interactive user jobs|interactive user job]], the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the [[System#Accessing Mufasa|SSH connection to Mufasa]] exists.

If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then ''detach'' from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online).

Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then ''reattach'' to the ''screen''.

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Basic usage of <code>screen</code> is explained below.

== Creating a screen session, running a job in it, detaching from it ==

# Connect to the [[System#Login server|login server]] with SSH
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created (it has the look of an empty shell), launch your job with <code>srun</code>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell, while your process will go on running in the screen
# You can now close the SSH connection to the login server without damaging your running job

== Reattaching to an active screen session ==

# Connect to the [[System#Login server|login server]] with SSH
# In the login server shell, run <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you launched your job

== Closing (i.e. destroying) a screen session ==

When you do not need a screen session anymore:

# reattach to the active screen session as explained [[#Reattaching to an active screen session|above]]
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash), then confirming that you really want to proceed

Of course, any program (including SLURM jobs) running within the screen gets terminated when the screen is destroyed.

= Using <code>salloc</code> to reserve resources =

== What is <code>salloc</code>? ==

[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.

The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.

More precisely:
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them;
* when the delivery comes, SLURM creates an interactive shell session for the user;
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.

Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.

When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.

Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.

== <code>salloc</code> commands ==

<code>salloc</code> commands use a similar syntax to <code>srun</code> commands. In particular, <code>salloc</code> lets a user specify what resources they need and -importantly- a '''delivery time''' for the requested resources (delivery time can also be specified with <code>srun</code>, but in that case it is not very useful).

The typical <code>salloc</code> command has this form:'

<pre style="color: lightgrey; background: black;">
salloc [general_SLURM_options] --begin=<time>
</pre>

where

:; [general_SLURM_options]
:: represents the options already described in [[#Options of srun and sbatch|Options of srun and sbatch]]

:;<nowiki>--begin=<time></nowiki>
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time.

=== Syntax of parameter <code>--begin</code> ===

If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form

:<code>HH:MM</code>

If you want to specify a time of a different day, the form for <time> is <code>YYYY-MM-DDTHH:MM</code>, where the uppercase 'T' separates date from time.

It is also possible to specify <time> as relative to the current time, in one of the following forms:
: <code>now+Kminutes</code>
: <code>now+Khours</code>
: <code>now+Kdays</code>
where K is a (positive) integer.

Examples:
: <code>--begin=16:00</code>
: <code>--begin=now+1hours</code>
: <code>--begin=now+1days</code>
: <code>--begin=2030-01-20T12:34:00</code>

Note that Mufasa's time zone is GMT, so <nowiki><time></nowiki> must be expressed in GMT as well. If you want to know Mufasa's current time, use command

<pre style="color: lightgrey; background: black;">
date
</pre>

It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Thu Nov 10 16:43:30 UTC 2022
</pre>

== How to use <code>salloc</code> ==

In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ([[#Creating_a_screen_session.2C_running_a_job_in_it.2C_detaching_from_it|detaching from the screen]]). It is then possible to reach again the screen at a later time ([[#Reattaching_to_an_active_screen_session|reattaching to the screen]]). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>.

More precisely, the operations needed to do this are the following:

# Connect to the [[System#Login server|login server]].
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created run the [[#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered.
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell.
# You can now close the SSH connection to the login server without damaging your resource allocation request.
# At the delivery time you specified in the [[#salloc commands|<code>salloc</code> command]], connect to the login server with SSH.
# Once you are in the login server shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt.
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. '''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.)
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.)
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

== Cancelling a resource request made with <code>salloc</code> ==

To cancel a request for resources made as explained in [[#How to use salloc|How to use <code>salloc</code>]], follow these steps:

# Connect to the the [[System#Login server|login server]] with SSH.
# Once you are in the login server shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources.
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre>
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

= Monitoring and managing jobs =

SLURM provides Job Users with tools to inspect and manage jobs. While a [[Roles|Job User]] is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are '''[https://slurm.schedmd.com/squeue.html <code>squeue</code>]''' to inspect the scheduling queues and '''[https://slurm.schedmd.com/scancel.html <code>scancel</code>]''' to terminate queued or running jobs.

== Inspecting jobs with <code>squeue</code> ==

Running command

<pre style="color: lightgrey; background: black;">
squeue
</pre>

provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

This output comprises the following information:

:; JOBID
:: Numerical identifier of the job assigned by SLURM
:: This identifier is used to intervene on the job, for instance with <code>scancel</code>

:; PARTITION
:: the partition that the job is run on

:; NAME
:: the name assigned to the job; can be personalised using the <code>--job-name</code> option

:; USER
:: username of the user who launched the job

:; ST
:: job state (see [[SLURM#Job state|Job state]] for further information)

:; TIME
:: time that has passed since the beginning of job execution

:; NODES
:: number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)

:; NODELIST (REASON)
:: name of the nodes where the job is being executed: for Mufasa it is always <code>gn01</code>, which is the name of the node corresponding to Mufasa.

To limit the output of <code>squeue</code> to the jobs owned by user <code><username></code>, it can be used like this:

<pre style="color: lightgrey; background: black;">
squeue -u <username>
</pre>

=== Interpreting Job state as provided by <code>squeue</code> ===

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

:'''<code>PD</code>''' PENDING
:: Job is awaiting resource allocation.

:'''<code>R</code>''' RUNNING
:: Job currently has an allocation.

:'''<code>S</code>''' SUSPENDED
:: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

:'''<code>CG</code>''' COMPLETING
:: Job is in the process of completing. Some processes on some nodes may still be active.

:'''<code>CD</code>''' COMPLETED
:: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them.

== Knowing when jobs are expected to end or start ==

If you are interested in understanding when jobs are expected to start or end, use command

<pre style="color: lightgrey; background: black;">
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
</pre>

which provides an output is similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID USER PARTITION ST |START_TIME | TIME_LEFT|
5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14|
5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41|
5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05|
5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11|
5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43|
5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
</pre>

;:For running jobs (state <code>R</code>):
::column "START_TIME" tells you when the job started its execution
::column "TIME_LEFT" tells you how much remains of the running time requested by the job

;:For pending jobs (state <code>PD</code>):
::column "START_TIME" tells you when the job is expected to start its execution
::column "TIME_LEFT" tells you how much running time has been requested by the job

'''Important!''' Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.

If you simply want to know when pending jobs (state <code>PD</code>) are expected to begin execution, use

<pre style="color: lightgrey; background: black;">
squeue --start
</pre>

which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
</pre>

== Getting detailed information about a job ==

If needed, complete information about a job (either pending or running) can be obtained using command

<pre style="color: lightgrey; background: black;">
scontrol show job <JOBID>
</pre>

where <code><JOBID></code> is the number from the first column of the output of <code>squeue</code>. The output of this command is similar to the following:

<pre style="color: lightgrey; background: black;">
JobId=65 JobName=test_script.sh
UserId=gfontana(10003) GroupId=gfontana(10004) MCS_label=N/A
Priority=14208 Nice=0 Account=admin QOS=nogpu
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:55 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-11-06T10:31:10 EligibleTime=2025-11-06T10:31:10
AccrueTime=2025-11-06T10:31:10
StartTime=2025-11-06T10:31:10 EndTime=2025-11-06T11:31:10 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-11-06T10:31:10 Scheduler=Main
Partition=jobs AllocNode:Sid=mufasa2-login:42020
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gn01
BatchHost=gn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=4G,node=1,billing=1
AllocTRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=./test_script.sh
WorkDir=/home/gfontana
</pre>

In particular, the line beginning with ''"StartTime="'' provides expected times for the start and end of job execution. As explained in [[User_Jobs#Knowing_when_jobs_are_expected_to_end_or_start|Knowing when jobs are expected to end or start]], start time is only a prediction and subject to change.

== Cancelling a job with <code>scancel</code> ==

It is possible to cancel a job using command <code>scancel</code>, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it).

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

The following are some examples of use of <code>scancel</code> adapted from [https://slurm.schedmd.com/scancel.html SLURM's documentation].

<pre style="color: lightgrey; background: black;">
scancel <JOBID>
</pre>
removes queued job <code><JOBID></code> from the execution queue.

<pre style="color: lightgrey; background: black;">
scancel --signal=TERM <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGTERM (request to stop).

<pre style="color: lightgrey; background: black;">
scancel --signal=KILL <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGKILL (force stop).

<pre style="color: lightgrey; background: black;">
scancel --state=PENDING --user=<username> --partition=<partition_name>
</pre>
cancels all pending jobs belonging to user <code><username></code> in partition <code><partition_name></code>.

== Knowing what jobs you ran today ==

Command

<pre style="color: lightgrey; background: black;">
sacct -X
</pre>

provides a list of all jobs run today by your user.

User Jobs

2026-05-07T14:48:11Z

GiulioFontana: /* How to know if your shell is a SLURM job */

= Running jobs with SLURM =

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa. This is a key difference between Mufasa 1.0 and [[System#Mufasa 2.0|Mufasa 2.0]].

== <code>srun</code> and <code>sbatch</code> ==

SLURM provides two commands to run jobs, called [https://slurm.schedmd.com/srun.html srun] and [https://slurm.schedmd.com/sbatch.html sbatch]:

<pre style="color: lightgrey; background: black;">
srun [options] <command_to_be_run_via_SLURM>
</pre>

<pre style="color: lightgrey; background: black;">
sbatch [options] <command_to_be_run_via_SLURM>
</pre>

In both cases, <code><command_to_be_run_via_SLURM></code> can be any Linux program (including shell scripts). By using <code>srun</code> or <code>sbatch</code>, the command or script specified by <code><command_to_be_run_via_SLURM></code> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for '''interactive jobs''': i.e., processes that use the console to interact with their user during job execution. <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

<code>sbatch</code> provides an additional possibility: <code><command_to_be_run_via_SLURM></code> can in fact be an [[#Using execution scripts to run jobs|'''execution script''']], i.e. a special (and SLURM-specific) type of Linux shell script that includes '''SBATCH directives'''. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the <code>[options]</code> part of the <code>sbatch</code> command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

Immediately after a <code>srun</code> or <code>sbatch</code> command is launched by a user, SLURM outputs a message informing the user that the job has been queued. The output is similar to this:

<pre style="color: lightgrey; background: black;">
srun: job 10849 queued and waiting for resources
</pre>

The shell is now locked while SLURM prepares the execution of the user program ([[#Detaching from a running job with screen|if you are using <code>screen</code> you can detach from that shell and come back later]]).

When SLURM is ready to run the program, it prints a message similar to

<pre style="color: lightgrey; background: black;">
srun: job 10849 has been allocated resources
</pre>

and then executes the program.

=== Options of <code>srun</code> and <code>sbatch</code> ===

The <code>[options]</code> part of <code>srun</code> and <code>sbatch</code> commands is used to tell SLURM what resources the job needs to be executed the job and how much time it will need to complete its execution.

For what concerns resources, the most important option is <code>--qos <qos_name></code>, specifying which SLURM [[#SLURM Quality of Service (QOS)|SLURM QOS]] the job will use. A job run with a given QOS has access to all and only the resources available to that QOS. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen QOS. Jobs that require resources that are not available to the chosen QOS do not get executed.

If the user forgets to use option <code>--qos <qos_name></code>, the job is run on the ''default qos'' (<code>normal</code>) which has access to ''zero'' resources. Therefore it is always necessary to specify option <code>--qos <qos_name></code> when launching a SLURM job on Mufasa.

More generally, the most relevant among the <code>[options]</code> are:

:;‑-qos=<qos__name>
:: specifies the [[SLURM#SLURM Quality of Service (QOS)|SLURM QOS]] that the job will use. It is mandatory to specify one.

:: ''Important! The chosen QOS limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is available to the chosen QOS.''

:: ''Important! If <code>‑‑qos <qos_name></code> is used and options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) are omitted, the job is assigned the default amount of the resource (as defined by the chosen QOS. A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job uses a QOS with access to GPUs.''

:; --job-name=<jobname>
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name.

:;‑‑gres=<gpu_resources>
:: specifies what GPUs to assign to the job. <code>gpu_resources</code> is a comma-delimited list where each element has the form <code>gpu:<Type>:<amount></code>, where <code><Type></code> is one of the types of GPU available on Mufasa (see [[SLURM#gres syntax|<code>gres</code> syntax]]) and <code><amount></code> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <code><gpu_resources></code> may be <code>gpu:40gb:1,gpu:3g.20gb:1</code>, corresponding to asking for one "full" GPU and 1 "small" GPU.

:: ''Important! The <code>‑‑gres</code> parameter is '''mandatory''' if the job is run with a QOS that allows access to the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.''

:;‑‑mem=<mem_resources>
:: specifies the amount of RAM to assign to the job; for instance, <code><mem_resources></code> may be <code>200G</code>

:;‑‑cpus-per-task=<cpu_amount>
:: specifies how many CPUs to assign to the job; for instance, <code><cpu_amount></code> may be <code>2</code>

:;<nowiki>‑‑time=<duration></nowiki>
:: specifies the maximum time allowed to the job to complete, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>. When the time expires, the job (if still running) gets killed by SLURM.

:;‑‑pty
:: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[#Interactive jobs|Interactive jobs]])

Note that GPU resources (if needed) must always be requested explicitly. For instance, in order to execute program <code>./my_program</code> which needs one GPU of type <code>3g.20gb</code> with QOS <code>gpulight</code> we can use the SLURM command

<pre style="color: lightgrey; background: black;">
srun --qos=gpulight --gres=gpu:3g.20gb:1 ./my_program
</pre>

== Interactive jobs ==

An '''interactive job''' is a process that use the console to interact with their user during job execution. Such a process is manually run by the user from a ''bash shell'' (i.e. a terminal session) provided by SLURM.

In order to ask SLURM to schedule the execution of a shell where the user can subsequently run the interactive job, it is necessary to use option <code>--pty</code>.

For instance, to ask SLURM to run a shell with QOS <code>nogpu</code>, the user should use command

<pre style="color: lightgrey; background: black;">
srun --qos=nogpu --pty /bin/bash
</pre>

By not specifying any other options, the user is telling SLURM that they want the shell spawned by SLURM to be provided with the default amount of resources associated to QOS <code>nogpu</code>. More generally, any combination of the other [[#Options of srun and sbatch|options of srun]] can be used together with <code>--pty</code>.

As every other job request to SLURM, the request to run a shell must be done from the [[System#Login server|login server]]. As soon as possible (i.e., as soon as the necessary resources are available) SLURM will open (in the same terminal that the user used to launch the <code>srun</code> command) a bash shell, where the user will be able to run their interactive programs.

To the user, this corresponds to the fact that the shell they were using to interact with the login server changes into a shell opened ''directly on Mufasa''. This corresponds to the command prompt changing from

<pre style="color: lightgrey; background: black;">
<username>@mufasa2-login:~$
</pre>

to

<pre style="color: lightgrey; background: black;">
<username>@mufasa2:~$
</pre>

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

When the user does not need the SLURM-spawned shell anymore, they should close it with command (the same used for any other Linux shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

to make the resources reserved for the interactive shell free again.

== Non-interactive jobs ==

<code>srun</code> commands are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the <code>srun</code> command can be substituted with a much simpler '''<code>sbatch</code> command'''. As [[#Running jobs with SLURM|already explained]], <code>sbatch</code> can make use of an '''execution script''' to specify all the parts of the command to be run via SLURM. So the command becomes

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

An execution script is a special type of Linux script that includes '''SBATCH directives'''. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an <code>srun</code> command.

:{|class="wikitable"
|'''''Note on Linux shell scripts'''''
|-
|''A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:

* ''have the “executable” flag set'' (see [[System#Changing file/directory ownership and permissions|here]] for details)
* ''have'' <code>#!/bin/bash</code> ''as its very first line''

''Usually, a Linux shell script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.''

''Within any shell script, lines preceded by <code>#</code> are comments (with the notable exception of the initial'' <code>#!/bin/bash</code> ''line)''.

''Use of blank lines as spacers is allowed.''
|}

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# [optionally] one or more '''<code>srun</code> commands''' that launch jobs with SLURM using the parameter values specified in the preamble

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

The template includes all the options [[#Using SLURM to run a container|already described above]], plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

In the template below, '''#SBATCH directives''' are requests made to SLURM. Notice that, though #SBATCH directives have a leading "#", that does ''not'' mean that they are comments: exactly as the <code>#!/bin/bash</code> at the beginning of a shell script, while starting with "#", is not a comment as well.

Other lines in the script that begin with <code>#</code> not followed by SBATCH are comments.

For what concerns directive that ask for a given amount of a resource (including time), if they are missing from the execution script (or commented out) the job will be assigned the default amount of the resource.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

'''<nowiki>#</nowiki>SBATCH ‑-nodes=1'''

'''<nowiki>#</nowiki>SBATCH ‑‑ntasks=1'''

'''<nowiki>#</nowiki>SBATCH ‑-partition=jobs'''

'''<nowiki>#</nowiki>SBATCH ‑-qos=<qos_name>'''

'''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>'''

'''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>'''

'''<nowiki>#</nowiki>SBATCH ‑‑output=./<filename>-%j.out'''

: <nowiki>#</nowiki> the text file where the output of the job gets written (i.e., standard output gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH ‑‑error=./<filename>-error-%j.out'''

: <nowiki>#</nowiki> the text file where any error messages generated by the job gets written (i.e., standard error gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH --job-name=<name>'''

<nowiki>#</nowiki>----------------end of preamble----------------

'''<command_to_run>'''

'''<command_to_run>'''

'''...'''
</blockquote>

= Executing jobs on Mufasa =

== Key concept ==

'''The key concept about executing jobs on Mufasa is that [[System#Containers|all computation on Mufasa must occur within containers]]'''.

A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if the user has writing permission on them: e.g., the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

The system used by Mufasa to create and execute containers is '''[[System#Singularity|Singularity]]'''. This wiki includes [[Singularity|directions]] on preparing containers with Singularity.

The container where a user job runs must contain all the libraries needed by the job. In fact (for maintainability and safety reasons) '''no software and no libraries are installed on Mufasa 2.0'''.

== Interactive and non-interactive user jobs ==

This section explains how to execute a user job contained in a container. It considers two types of user jobs, i.e.:
;: Interactive user jobs
::: as [[#Interactive jobs|already explained]], these are jobs that require interaction with the user while they are running, via a bash shell running within the container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the container is in execution.

;: Non-interactive user jobs
::: are the most common variety. The user prepares the container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the container while it is in execution. Executing the container and running the required programs within the container's environment is done via [[#Interactive jobs|execution scripts]].

== Using SLURM to run an interactive job on Mufasa ==

The first step to run an interactive user job on Mufasa is to run the [[System#Containers|container]] where the job will take place. Each user is in charge of preparing the container(s) where the user's jobs will be executed.

In order to run a container via SLURM by hand, i.e. via an interactive shell, a user must first open the shell with command

<pre style="color: lightgrey; background: black;">
srun [general_SLURM_options] ‑‑pty /bin/bash
</pre>

where [general_SLURM_options] are those [[#Options of srun and sbatch|already described above]].

Then the user must run the container: this is done as follows.

First, it is necessary to load the Singularity software module with

<pre style="color: lightgrey; background: black;">
module load amd/singularity
</pre>

(if needed, the list of software modules available in the system can be obtained with command <code>module av</code>).

Then, the user must use Singularity to run the container with command (see the [[Singularity|section about Singularity]] for further details)

<pre style="color: lightgrey; background: black;">
singularity run <repository>://<name_of_container>
</pre>

which pulls the container from the specified repository and executes it. Possible values for <code><repository></code> are:

:: <code>docker</code> (Docker Hub)
:: <code>library</code> (Singularityhub)
:: <code>path/to/container</code> if the container is local, i.e. located in the filesystem of Mufasa

As soon as the container is in execution, the terminal window used, so far, to interact with Mufasa becomes a shell ''in the container''. This shell belongs to the software environment of the container, and the user can use it to interact with the container's own software environment and filesystem.

It is easy to understand if a shell is open to Mufasa or to the container because in a container shell the system prompt becomes

<pre style="color: lightgrey; background: black;">
singularity>
</pre>

=== Interaction between container filesystem and local filesystem ===

The filesystem inside the container and the local one, i.e. Mufasa's, can interact. This means that the container can access the local filesystem to read and/or write files. However, the only parts of Mufasa's filesystem that can be accessed by the container are those that the user running the container has access rights to.

As a default, the user's <code>/home/username</code> directory on Mufasa is automatically mapped onto <code>/home/username</code> into the filesystem of the container. Whatever is done to that container directory, the changes are actually applied to the local <code>/home/username</code> directory on Mufasa.

The mapping of the home directory does not need to be explicitly requested. However, if the user needs (in addition to the home directory) other parts of the local filesystem of Mufasa to be mapped onto the container's filesystem, this is possible by using this modified version of the <code>singularity run</code> command:

<pre style="color: lightgrey; background: black;">
singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container>
</pre>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

=== How to know if your shell is a SLURM job ===
To know if the shell you are using is being run via SLURM or not (becoming confused is easy), use command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If it provides an output, your shell is a SLURM job and the output is the ID of the job.
If it doesn't provide any output, your shell is not a SLURM job.

== Using SLURM to run a non-interactive job on Mufasa ==

When the user job to be executed into a container is non-interactive, the mechanism based on an ''execution script'' already described in [[#Non-interactive jobs|Non-interactive jobs]] is employed. The command to run the script which in turn will run the container where the user job takes place is therefore

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

The general features of a SLURM execution script and the SBATCH directives used for generic jobs have [[#Non-interactive jobs|already been described]]. Here we focus, therefore, on the SBATCH directives specifically used when SLURM is used to run a non-interactive job within a container.

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

[[#Non-interactive jobs|#SBATCH directives already described above]]

<nowiki>#</nowiki>----------------end of preamble----------------

'''module load amd/singularity'''

'''singularity run <repository>://<name_of_container> <command_to_run>'''

</blockquote>

In the last line of the script, <code><command_to_run></code> is the command (e.g., the name of an executable script), complete with path within the container's filesystem, of the program to be run into the container. Please refer to the [[Singularity|section about Singularity]] for details about its commands.

The interactions between container filesystem and local filesystem in non-interactive jobs are exactly the same [[#Interaction between container filesystem and local filesystem|already described]] for interactive jobs. In particular, the user's home directory is mapped by default onto the filesystem of the container.

If, in addition to that, the user needs another part of the filesystem of Mufasa are to be mapped onto the container's filesystem, this is possible using this modified version of the <code>singularity run</code> command at the end of the script:

:<code>singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container></code>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

== Job output ==

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of Mufasa by the container where the computation takes place.

As [[#Using SLURM to run a container|explained below]], SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own <code>/home</code> directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own <code>/home</code> directory).

== Cancelling completed jobs ==

When a user process run via SLURM has completed its execution and is not needed anymore, it is important to [[User_Jobs#Canceling_a_job_with_scancel|close it with scancel]]. Especially if much time remains to the end of the execution time requested by the job.

Cancelling a SLURM job makes the resources reserved by SLURM free again for other users, and thus speeds up the execution of the jobs still queued.

Typically, one doesn't know how long a piece of code will take to complete its work. So please make sure to check from time to time if that happened, and -if there's still time before the duration of your SLURM job ends- just ''scancel'' the job. Other users will be grateful :-)

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

= Detaching from a running job with <code>screen</code> =

A consequence of the way <code>srun</code> operates is that if you launch an [[#Interactive and non-interactive user jobs|interactive user job]], the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the [[System#Accessing Mufasa|SSH connection to Mufasa]] exists.

If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then ''detach'' from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online).

Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then ''reattach'' to the ''screen''.

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Basic usage of <code>screen</code> is explained below.

== Creating a screen session, running a job in it, detaching from it ==

# Connect to the [[System#Login server|login server]] with SSH
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created (it has the look of an empty shell), launch your job with <code>srun</code>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell, while your process will go on running in the screen
# You can now close the SSH connection to the login server without damaging your running job

== Reattaching to an active screen session ==

# Connect to the [[System#Login server|login server]] with SSH
# In the login server shell, run <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you launched your job

== Closing (i.e. destroying) a screen session ==

When you do not need a screen session anymore:

# reattach to the active screen session as explained [[#Reattaching to an active screen session|above]]
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash), then confirming that you really want to proceed

Of course, any program (including SLURM jobs) running within the screen gets terminated when the screen is destroyed.

= Using <code>salloc</code> to reserve resources =

== What is <code>salloc</code>? ==

[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.

The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.

More precisely:
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them;
* when the delivery comes, SLURM creates an interactive shell session for the user;
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.

Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.

When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.

Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.

== <code>salloc</code> commands ==

<code>salloc</code> commands use a similar syntax to <code>srun</code> commands. In particular, <code>salloc</code> lets a user specify what resources they need and -importantly- a '''delivery time''' for the requested resources (delivery time can also be specified with <code>srun</code>, but in that case it is not very useful).

The typical <code>salloc</code> command has this form:'

<pre style="color: lightgrey; background: black;">
salloc [general_SLURM_options] --begin=<time>
</pre>

where

:; [general_SLURM_options]
:: represents the options already described in [[#Options of srun and sbatch|Options of srun and sbatch]]

:;<nowiki>--begin=<time></nowiki>
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time.

=== Syntax of parameter <code>--begin</code> ===

If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form

:<code>HH:MM</code>

If you want to specify a time of a different day, the form for <time> is <code>YYYY-MM-DDTHH:MM</code>, where the uppercase 'T' separates date from time.

It is also possible to specify <time> as relative to the current time, in one of the following forms:
: <code>now+Kminutes</code>
: <code>now+Khours</code>
: <code>now+Kdays</code>
where K is a (positive) integer.

Examples:
: <code>--begin=16:00</code>
: <code>--begin=now+1hours</code>
: <code>--begin=now+1days</code>
: <code>--begin=2030-01-20T12:34:00</code>

Note that Mufasa's time zone is GMT, so <nowiki><time></nowiki> must be expressed in GMT as well. If you want to know Mufasa's current time, use command

<pre style="color: lightgrey; background: black;">
date
</pre>

It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Thu Nov 10 16:43:30 UTC 2022
</pre>

== How to use <code>salloc</code> ==

In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ([[#Creating_a_screen_session.2C_running_a_job_in_it.2C_detaching_from_it|detaching from the screen]]). It is then possible to reach again the screen at a later time ([[#Reattaching_to_an_active_screen_session|reattaching to the screen]]). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>.

More precisely, the operations needed to do this are the following:

# Connect to the [[System#Login server|login server]].
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created run the [[#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered.
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell.
# You can now close the SSH connection to the login server without damaging your resource allocation request.
# At the delivery time you specified in the [[#salloc commands|<code>salloc</code> command]], connect to the login server with SSH.
# Once you are in the login server shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt.
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. '''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.)
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.)
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

== Cancelling a resource request made with <code>salloc</code> ==

To cancel a request for resources made as explained in [[#How to use salloc|How to use <code>salloc</code>]], follow these steps:

# Connect to the the [[System#Login server|login server]] with SSH.
# Once you are in the login server shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources.
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre>
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

= Monitoring and managing jobs =

SLURM provides Job Users with tools to inspect and manage jobs. While a [[Roles|Job User]] is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are '''[https://slurm.schedmd.com/squeue.html <code>squeue</code>]''' to inspect the scheduling queues and '''[https://slurm.schedmd.com/scancel.html <code>scancel</code>]''' to terminate queued or running jobs.

== Inspecting jobs with <code>squeue</code> ==

Running command

<pre style="color: lightgrey; background: black;">
squeue
</pre>

provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

This output comprises the following information:

:; JOBID
:: Numerical identifier of the job assigned by SLURM
:: This identifier is used to intervene on the job, for instance with <code>scancel</code>

:; PARTITION
:: the partition that the job is run on

:; NAME
:: the name assigned to the job; can be personalised using the <code>--job-name</code> option

:; USER
:: username of the user who launched the job

:; ST
:: job state (see [[SLURM#Job state|Job state]] for further information)

:; TIME
:: time that has passed since the beginning of job execution

:; NODES
:: number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)

:; NODELIST (REASON)
:: name of the nodes where the job is being executed: for Mufasa it is always <code>gn01</code>, which is the name of the node corresponding to Mufasa.

To limit the output of <code>squeue</code> to the jobs owned by user <code><username></code>, it can be used like this:

<pre style="color: lightgrey; background: black;">
squeue -u <username>
</pre>

=== Interpreting Job state as provided by <code>squeue</code> ===

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

:'''<code>PD</code>''' PENDING
:: Job is awaiting resource allocation.

:'''<code>R</code>''' RUNNING
:: Job currently has an allocation.

:'''<code>S</code>''' SUSPENDED
:: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

:'''<code>CG</code>''' COMPLETING
:: Job is in the process of completing. Some processes on some nodes may still be active.

:'''<code>CD</code>''' COMPLETED
:: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them.

== Knowing when jobs are expected to end or start ==

If you are interested in understanding when jobs are expected to start or end, use command

<pre style="color: lightgrey; background: black;">
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
</pre>

which provides an output is similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID USER PARTITION ST |START_TIME | TIME_LEFT|
5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14|
5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41|
5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05|
5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11|
5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43|
5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
</pre>

;:For running jobs (state <code>R</code>):
::column "START_TIME" tells you when the job started its execution
::column "TIME_LEFT" tells you how much remains of the running time requested by the job

;:For pending jobs (state <code>PD</code>):
::column "START_TIME" tells you when the job is expected to start its execution
::column "TIME_LEFT" tells you how much running time has been requested by the job

'''Important!''' Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.

If you simply want to know when pending jobs (state <code>PD</code>) are expected to begin execution, use

<pre style="color: lightgrey; background: black;">
squeue --start
</pre>

which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
</pre>

== Getting detailed information about a job ==

If needed, complete information about a job (either pending or running) can be obtained using command

<pre style="color: lightgrey; background: black;">
scontrol show job <JOBID>
</pre>

where <code><JOBID></code> is the number from the first column of the output of <code>squeue</code>. The output of this command is similar to the following:

<pre style="color: lightgrey; background: black;">
JobId=65 JobName=test_script.sh
UserId=gfontana(10003) GroupId=gfontana(10004) MCS_label=N/A
Priority=14208 Nice=0 Account=admin QOS=nogpu
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:55 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-11-06T10:31:10 EligibleTime=2025-11-06T10:31:10
AccrueTime=2025-11-06T10:31:10
StartTime=2025-11-06T10:31:10 EndTime=2025-11-06T11:31:10 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-11-06T10:31:10 Scheduler=Main
Partition=jobs AllocNode:Sid=mufasa2-login:42020
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gn01
BatchHost=gn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=4G,node=1,billing=1
AllocTRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=./test_script.sh
WorkDir=/home/gfontana
</pre>

In particular, the line beginning with ''"StartTime="'' provides expected times for the start and end of job execution. As explained in [[User_Jobs#Knowing_when_jobs_are_expected_to_end_or_start|Knowing when jobs are expected to end or start]], start time is only a prediction and subject to change.

== Cancelling a job with <code>scancel</code> ==

It is possible to cancel a job using command <code>scancel</code>, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it).

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

The following are some examples of use of <code>scancel</code> adapted from [https://slurm.schedmd.com/scancel.html SLURM's documentation].

<pre style="color: lightgrey; background: black;">
scancel <JOBID>
</pre>
removes queued job <code><JOBID></code> from the execution queue.

<pre style="color: lightgrey; background: black;">
scancel --signal=TERM <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGTERM (request to stop).

<pre style="color: lightgrey; background: black;">
scancel --signal=KILL <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGKILL (force stop).

<pre style="color: lightgrey; background: black;">
scancel --state=PENDING --user=<username> --partition=<partition_name>
</pre>
cancels all pending jobs belonging to user <code><username></code> in partition <code><partition_name></code>.

== Knowing what jobs you ran today ==

Command

<pre style="color: lightgrey; background: black;">
sacct -X
</pre>

provides a list of all jobs run today by your user.

User Jobs

2026-05-07T14:46:14Z

GiulioFontana: /* How to know if your shell is running within SLURM */

= Running jobs with SLURM =

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa. This is a key difference between Mufasa 1.0 and [[System#Mufasa 2.0|Mufasa 2.0]].

== <code>srun</code> and <code>sbatch</code> ==

SLURM provides two commands to run jobs, called [https://slurm.schedmd.com/srun.html srun] and [https://slurm.schedmd.com/sbatch.html sbatch]:

<pre style="color: lightgrey; background: black;">
srun [options] <command_to_be_run_via_SLURM>
</pre>

<pre style="color: lightgrey; background: black;">
sbatch [options] <command_to_be_run_via_SLURM>
</pre>

In both cases, <code><command_to_be_run_via_SLURM></code> can be any Linux program (including shell scripts). By using <code>srun</code> or <code>sbatch</code>, the command or script specified by <code><command_to_be_run_via_SLURM></code> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for '''interactive jobs''': i.e., processes that use the console to interact with their user during job execution. <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

<code>sbatch</code> provides an additional possibility: <code><command_to_be_run_via_SLURM></code> can in fact be an [[#Using execution scripts to run jobs|'''execution script''']], i.e. a special (and SLURM-specific) type of Linux shell script that includes '''SBATCH directives'''. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the <code>[options]</code> part of the <code>sbatch</code> command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

Immediately after a <code>srun</code> or <code>sbatch</code> command is launched by a user, SLURM outputs a message informing the user that the job has been queued. The output is similar to this:

<pre style="color: lightgrey; background: black;">
srun: job 10849 queued and waiting for resources
</pre>

The shell is now locked while SLURM prepares the execution of the user program ([[#Detaching from a running job with screen|if you are using <code>screen</code> you can detach from that shell and come back later]]).

When SLURM is ready to run the program, it prints a message similar to

<pre style="color: lightgrey; background: black;">
srun: job 10849 has been allocated resources
</pre>

and then executes the program.

=== Options of <code>srun</code> and <code>sbatch</code> ===

The <code>[options]</code> part of <code>srun</code> and <code>sbatch</code> commands is used to tell SLURM what resources the job needs to be executed the job and how much time it will need to complete its execution.

For what concerns resources, the most important option is <code>--qos <qos_name></code>, specifying which SLURM [[#SLURM Quality of Service (QOS)|SLURM QOS]] the job will use. A job run with a given QOS has access to all and only the resources available to that QOS. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen QOS. Jobs that require resources that are not available to the chosen QOS do not get executed.

If the user forgets to use option <code>--qos <qos_name></code>, the job is run on the ''default qos'' (<code>normal</code>) which has access to ''zero'' resources. Therefore it is always necessary to specify option <code>--qos <qos_name></code> when launching a SLURM job on Mufasa.

More generally, the most relevant among the <code>[options]</code> are:

:;‑-qos=<qos__name>
:: specifies the [[SLURM#SLURM Quality of Service (QOS)|SLURM QOS]] that the job will use. It is mandatory to specify one.

:: ''Important! The chosen QOS limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is available to the chosen QOS.''

:: ''Important! If <code>‑‑qos <qos_name></code> is used and options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) are omitted, the job is assigned the default amount of the resource (as defined by the chosen QOS. A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job uses a QOS with access to GPUs.''

:; --job-name=<jobname>
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name.

:;‑‑gres=<gpu_resources>
:: specifies what GPUs to assign to the job. <code>gpu_resources</code> is a comma-delimited list where each element has the form <code>gpu:<Type>:<amount></code>, where <code><Type></code> is one of the types of GPU available on Mufasa (see [[SLURM#gres syntax|<code>gres</code> syntax]]) and <code><amount></code> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <code><gpu_resources></code> may be <code>gpu:40gb:1,gpu:3g.20gb:1</code>, corresponding to asking for one "full" GPU and 1 "small" GPU.

:: ''Important! The <code>‑‑gres</code> parameter is '''mandatory''' if the job is run with a QOS that allows access to the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.''

:;‑‑mem=<mem_resources>
:: specifies the amount of RAM to assign to the job; for instance, <code><mem_resources></code> may be <code>200G</code>

:;‑‑cpus-per-task=<cpu_amount>
:: specifies how many CPUs to assign to the job; for instance, <code><cpu_amount></code> may be <code>2</code>

:;<nowiki>‑‑time=<duration></nowiki>
:: specifies the maximum time allowed to the job to complete, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>. When the time expires, the job (if still running) gets killed by SLURM.

:;‑‑pty
:: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[#Interactive jobs|Interactive jobs]])

Note that GPU resources (if needed) must always be requested explicitly. For instance, in order to execute program <code>./my_program</code> which needs one GPU of type <code>3g.20gb</code> with QOS <code>gpulight</code> we can use the SLURM command

<pre style="color: lightgrey; background: black;">
srun --qos=gpulight --gres=gpu:3g.20gb:1 ./my_program
</pre>

== Interactive jobs ==

An '''interactive job''' is a process that use the console to interact with their user during job execution. Such a process is manually run by the user from a ''bash shell'' (i.e. a terminal session) provided by SLURM.

In order to ask SLURM to schedule the execution of a shell where the user can subsequently run the interactive job, it is necessary to use option <code>--pty</code>.

For instance, to ask SLURM to run a shell with QOS <code>nogpu</code>, the user should use command

<pre style="color: lightgrey; background: black;">
srun --qos=nogpu --pty /bin/bash
</pre>

By not specifying any other options, the user is telling SLURM that they want the shell spawned by SLURM to be provided with the default amount of resources associated to QOS <code>nogpu</code>. More generally, any combination of the other [[#Options of srun and sbatch|options of srun]] can be used together with <code>--pty</code>.

As every other job request to SLURM, the request to run a shell must be done from the [[System#Login server|login server]]. As soon as possible (i.e., as soon as the necessary resources are available) SLURM will open (in the same terminal that the user used to launch the <code>srun</code> command) a bash shell, where the user will be able to run their interactive programs.

To the user, this corresponds to the fact that the shell they were using to interact with the login server changes into a shell opened ''directly on Mufasa''. This corresponds to the command prompt changing from

<pre style="color: lightgrey; background: black;">
<username>@mufasa2-login:~$
</pre>

to

<pre style="color: lightgrey; background: black;">
<username>@mufasa2:~$
</pre>

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

When the user does not need the SLURM-spawned shell anymore, they should close it with command (the same used for any other Linux shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

to make the resources reserved for the interactive shell free again.

== Non-interactive jobs ==

<code>srun</code> commands are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the <code>srun</code> command can be substituted with a much simpler '''<code>sbatch</code> command'''. As [[#Running jobs with SLURM|already explained]], <code>sbatch</code> can make use of an '''execution script''' to specify all the parts of the command to be run via SLURM. So the command becomes

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

An execution script is a special type of Linux script that includes '''SBATCH directives'''. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an <code>srun</code> command.

:{|class="wikitable"
|'''''Note on Linux shell scripts'''''
|-
|''A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:

* ''have the “executable” flag set'' (see [[System#Changing file/directory ownership and permissions|here]] for details)
* ''have'' <code>#!/bin/bash</code> ''as its very first line''

''Usually, a Linux shell script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.''

''Within any shell script, lines preceded by <code>#</code> are comments (with the notable exception of the initial'' <code>#!/bin/bash</code> ''line)''.

''Use of blank lines as spacers is allowed.''
|}

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# [optionally] one or more '''<code>srun</code> commands''' that launch jobs with SLURM using the parameter values specified in the preamble

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

The template includes all the options [[#Using SLURM to run a container|already described above]], plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

In the template below, '''#SBATCH directives''' are requests made to SLURM. Notice that, though #SBATCH directives have a leading "#", that does ''not'' mean that they are comments: exactly as the <code>#!/bin/bash</code> at the beginning of a shell script, while starting with "#", is not a comment as well.

Other lines in the script that begin with <code>#</code> not followed by SBATCH are comments.

For what concerns directive that ask for a given amount of a resource (including time), if they are missing from the execution script (or commented out) the job will be assigned the default amount of the resource.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

'''<nowiki>#</nowiki>SBATCH ‑-nodes=1'''

'''<nowiki>#</nowiki>SBATCH ‑‑ntasks=1'''

'''<nowiki>#</nowiki>SBATCH ‑-partition=jobs'''

'''<nowiki>#</nowiki>SBATCH ‑-qos=<qos_name>'''

'''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>'''

'''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>'''

'''<nowiki>#</nowiki>SBATCH ‑‑output=./<filename>-%j.out'''

: <nowiki>#</nowiki> the text file where the output of the job gets written (i.e., standard output gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH ‑‑error=./<filename>-error-%j.out'''

: <nowiki>#</nowiki> the text file where any error messages generated by the job gets written (i.e., standard error gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH --job-name=<name>'''

<nowiki>#</nowiki>----------------end of preamble----------------

'''<command_to_run>'''

'''<command_to_run>'''

'''...'''
</blockquote>

= Executing jobs on Mufasa =

== Key concept ==

'''The key concept about executing jobs on Mufasa is that [[System#Containers|all computation on Mufasa must occur within containers]]'''.

A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if the user has writing permission on them: e.g., the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

The system used by Mufasa to create and execute containers is '''[[System#Singularity|Singularity]]'''. This wiki includes [[Singularity|directions]] on preparing containers with Singularity.

The container where a user job runs must contain all the libraries needed by the job. In fact (for maintainability and safety reasons) '''no software and no libraries are installed on Mufasa 2.0'''.

== Interactive and non-interactive user jobs ==

This section explains how to execute a user job contained in a container. It considers two types of user jobs, i.e.:
;: Interactive user jobs
::: as [[#Interactive jobs|already explained]], these are jobs that require interaction with the user while they are running, via a bash shell running within the container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the container is in execution.

;: Non-interactive user jobs
::: are the most common variety. The user prepares the container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the container while it is in execution. Executing the container and running the required programs within the container's environment is done via [[#Interactive jobs|execution scripts]].

== Using SLURM to run an interactive job on Mufasa ==

The first step to run an interactive user job on Mufasa is to run the [[System#Containers|container]] where the job will take place. Each user is in charge of preparing the container(s) where the user's jobs will be executed.

In order to run a container via SLURM by hand, i.e. via an interactive shell, a user must first open the shell with command

<pre style="color: lightgrey; background: black;">
srun [general_SLURM_options] ‑‑pty /bin/bash
</pre>

where [general_SLURM_options] are those [[#Options of srun and sbatch|already described above]].

Then the user must run the container: this is done as follows.

First, it is necessary to load the Singularity software module with

<pre style="color: lightgrey; background: black;">
module load amd/singularity
</pre>

(if needed, the list of software modules available in the system can be obtained with command <code>module av</code>).

Then, the user must use Singularity to run the container with command (see the [[Singularity|section about Singularity]] for further details)

<pre style="color: lightgrey; background: black;">
singularity run <repository>://<name_of_container>
</pre>

which pulls the container from the specified repository and executes it. Possible values for <code><repository></code> are:

:: <code>docker</code> (Docker Hub)
:: <code>library</code> (Singularityhub)
:: <code>path/to/container</code> if the container is local, i.e. located in the filesystem of Mufasa

As soon as the container is in execution, the terminal window used, so far, to interact with Mufasa becomes a shell ''in the container''. This shell belongs to the software environment of the container, and the user can use it to interact with the container's own software environment and filesystem.

It is easy to understand if a shell is open to Mufasa or to the container because in a container shell the system prompt becomes

<pre style="color: lightgrey; background: black;">
singularity>
</pre>

=== Interaction between container filesystem and local filesystem ===

The filesystem inside the container and the local one, i.e. Mufasa's, can interact. This means that the container can access the local filesystem to read and/or write files. However, the only parts of Mufasa's filesystem that can be accessed by the container are those that the user running the container has access rights to.

As a default, the user's <code>/home/username</code> directory on Mufasa is automatically mapped onto <code>/home/username</code> into the filesystem of the container. Whatever is done to that container directory, the changes are actually applied to the local <code>/home/username</code> directory on Mufasa.

The mapping of the home directory does not need to be explicitly requested. However, if the user needs (in addition to the home directory) other parts of the local filesystem of Mufasa to be mapped onto the container's filesystem, this is possible by using this modified version of the <code>singularity run</code> command:

<pre style="color: lightgrey; background: black;">
singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container>
</pre>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

=== How to know if your shell is a SLURM job ===
To know if your shell is being run via SLURM, use command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If it provides an output, your shell is a SLURM job and the output is the ID of the job. If it doesn't provide any output, your shell is not running via SLURM.

== Using SLURM to run a non-interactive job on Mufasa ==

When the user job to be executed into a container is non-interactive, the mechanism based on an ''execution script'' already described in [[#Non-interactive jobs|Non-interactive jobs]] is employed. The command to run the script which in turn will run the container where the user job takes place is therefore

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

The general features of a SLURM execution script and the SBATCH directives used for generic jobs have [[#Non-interactive jobs|already been described]]. Here we focus, therefore, on the SBATCH directives specifically used when SLURM is used to run a non-interactive job within a container.

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

[[#Non-interactive jobs|#SBATCH directives already described above]]

<nowiki>#</nowiki>----------------end of preamble----------------

'''module load amd/singularity'''

'''singularity run <repository>://<name_of_container> <command_to_run>'''

</blockquote>

In the last line of the script, <code><command_to_run></code> is the command (e.g., the name of an executable script), complete with path within the container's filesystem, of the program to be run into the container. Please refer to the [[Singularity|section about Singularity]] for details about its commands.

The interactions between container filesystem and local filesystem in non-interactive jobs are exactly the same [[#Interaction between container filesystem and local filesystem|already described]] for interactive jobs. In particular, the user's home directory is mapped by default onto the filesystem of the container.

If, in addition to that, the user needs another part of the filesystem of Mufasa are to be mapped onto the container's filesystem, this is possible using this modified version of the <code>singularity run</code> command at the end of the script:

:<code>singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container></code>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

== Job output ==

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of Mufasa by the container where the computation takes place.

As [[#Using SLURM to run a container|explained below]], SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own <code>/home</code> directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own <code>/home</code> directory).

== Cancelling completed jobs ==

When a user process run via SLURM has completed its execution and is not needed anymore, it is important to [[User_Jobs#Canceling_a_job_with_scancel|close it with scancel]]. Especially if much time remains to the end of the execution time requested by the job.

Cancelling a SLURM job makes the resources reserved by SLURM free again for other users, and thus speeds up the execution of the jobs still queued.

Typically, one doesn't know how long a piece of code will take to complete its work. So please make sure to check from time to time if that happened, and -if there's still time before the duration of your SLURM job ends- just ''scancel'' the job. Other users will be grateful :-)

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

= Detaching from a running job with <code>screen</code> =

A consequence of the way <code>srun</code> operates is that if you launch an [[#Interactive and non-interactive user jobs|interactive user job]], the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the [[System#Accessing Mufasa|SSH connection to Mufasa]] exists.

If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then ''detach'' from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online).

Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then ''reattach'' to the ''screen''.

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Basic usage of <code>screen</code> is explained below.

== Creating a screen session, running a job in it, detaching from it ==

# Connect to the [[System#Login server|login server]] with SSH
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created (it has the look of an empty shell), launch your job with <code>srun</code>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell, while your process will go on running in the screen
# You can now close the SSH connection to the login server without damaging your running job

== Reattaching to an active screen session ==

# Connect to the [[System#Login server|login server]] with SSH
# In the login server shell, run <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you launched your job

== Closing (i.e. destroying) a screen session ==

When you do not need a screen session anymore:

# reattach to the active screen session as explained [[#Reattaching to an active screen session|above]]
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash), then confirming that you really want to proceed

Of course, any program (including SLURM jobs) running within the screen gets terminated when the screen is destroyed.

= Using <code>salloc</code> to reserve resources =

== What is <code>salloc</code>? ==

[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.

The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.

More precisely:
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them;
* when the delivery comes, SLURM creates an interactive shell session for the user;
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.

Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.

When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.

Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.

== <code>salloc</code> commands ==

<code>salloc</code> commands use a similar syntax to <code>srun</code> commands. In particular, <code>salloc</code> lets a user specify what resources they need and -importantly- a '''delivery time''' for the requested resources (delivery time can also be specified with <code>srun</code>, but in that case it is not very useful).

The typical <code>salloc</code> command has this form:'

<pre style="color: lightgrey; background: black;">
salloc [general_SLURM_options] --begin=<time>
</pre>

where

:; [general_SLURM_options]
:: represents the options already described in [[#Options of srun and sbatch|Options of srun and sbatch]]

:;<nowiki>--begin=<time></nowiki>
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time.

=== Syntax of parameter <code>--begin</code> ===

If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form

:<code>HH:MM</code>

If you want to specify a time of a different day, the form for <time> is <code>YYYY-MM-DDTHH:MM</code>, where the uppercase 'T' separates date from time.

It is also possible to specify <time> as relative to the current time, in one of the following forms:
: <code>now+Kminutes</code>
: <code>now+Khours</code>
: <code>now+Kdays</code>
where K is a (positive) integer.

Examples:
: <code>--begin=16:00</code>
: <code>--begin=now+1hours</code>
: <code>--begin=now+1days</code>
: <code>--begin=2030-01-20T12:34:00</code>

Note that Mufasa's time zone is GMT, so <nowiki><time></nowiki> must be expressed in GMT as well. If you want to know Mufasa's current time, use command

<pre style="color: lightgrey; background: black;">
date
</pre>

It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Thu Nov 10 16:43:30 UTC 2022
</pre>

== How to use <code>salloc</code> ==

In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ([[#Creating_a_screen_session.2C_running_a_job_in_it.2C_detaching_from_it|detaching from the screen]]). It is then possible to reach again the screen at a later time ([[#Reattaching_to_an_active_screen_session|reattaching to the screen]]). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>.

More precisely, the operations needed to do this are the following:

# Connect to the [[System#Login server|login server]].
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created run the [[#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered.
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell.
# You can now close the SSH connection to the login server without damaging your resource allocation request.
# At the delivery time you specified in the [[#salloc commands|<code>salloc</code> command]], connect to the login server with SSH.
# Once you are in the login server shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt.
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. '''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.)
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.)
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

== Cancelling a resource request made with <code>salloc</code> ==

To cancel a request for resources made as explained in [[#How to use salloc|How to use <code>salloc</code>]], follow these steps:

# Connect to the the [[System#Login server|login server]] with SSH.
# Once you are in the login server shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources.
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre>
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

= Monitoring and managing jobs =

SLURM provides Job Users with tools to inspect and manage jobs. While a [[Roles|Job User]] is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are '''[https://slurm.schedmd.com/squeue.html <code>squeue</code>]''' to inspect the scheduling queues and '''[https://slurm.schedmd.com/scancel.html <code>scancel</code>]''' to terminate queued or running jobs.

== Inspecting jobs with <code>squeue</code> ==

Running command

<pre style="color: lightgrey; background: black;">
squeue
</pre>

provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

This output comprises the following information:

:; JOBID
:: Numerical identifier of the job assigned by SLURM
:: This identifier is used to intervene on the job, for instance with <code>scancel</code>

:; PARTITION
:: the partition that the job is run on

:; NAME
:: the name assigned to the job; can be personalised using the <code>--job-name</code> option

:; USER
:: username of the user who launched the job

:; ST
:: job state (see [[SLURM#Job state|Job state]] for further information)

:; TIME
:: time that has passed since the beginning of job execution

:; NODES
:: number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)

:; NODELIST (REASON)
:: name of the nodes where the job is being executed: for Mufasa it is always <code>gn01</code>, which is the name of the node corresponding to Mufasa.

To limit the output of <code>squeue</code> to the jobs owned by user <code><username></code>, it can be used like this:

<pre style="color: lightgrey; background: black;">
squeue -u <username>
</pre>

=== Interpreting Job state as provided by <code>squeue</code> ===

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

:'''<code>PD</code>''' PENDING
:: Job is awaiting resource allocation.

:'''<code>R</code>''' RUNNING
:: Job currently has an allocation.

:'''<code>S</code>''' SUSPENDED
:: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

:'''<code>CG</code>''' COMPLETING
:: Job is in the process of completing. Some processes on some nodes may still be active.

:'''<code>CD</code>''' COMPLETED
:: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them.

== Knowing when jobs are expected to end or start ==

If you are interested in understanding when jobs are expected to start or end, use command

<pre style="color: lightgrey; background: black;">
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
</pre>

which provides an output is similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID USER PARTITION ST |START_TIME | TIME_LEFT|
5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14|
5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41|
5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05|
5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11|
5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43|
5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
</pre>

;:For running jobs (state <code>R</code>):
::column "START_TIME" tells you when the job started its execution
::column "TIME_LEFT" tells you how much remains of the running time requested by the job

;:For pending jobs (state <code>PD</code>):
::column "START_TIME" tells you when the job is expected to start its execution
::column "TIME_LEFT" tells you how much running time has been requested by the job

'''Important!''' Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.

If you simply want to know when pending jobs (state <code>PD</code>) are expected to begin execution, use

<pre style="color: lightgrey; background: black;">
squeue --start
</pre>

which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
</pre>

== Getting detailed information about a job ==

If needed, complete information about a job (either pending or running) can be obtained using command

<pre style="color: lightgrey; background: black;">
scontrol show job <JOBID>
</pre>

where <code><JOBID></code> is the number from the first column of the output of <code>squeue</code>. The output of this command is similar to the following:

<pre style="color: lightgrey; background: black;">
JobId=65 JobName=test_script.sh
UserId=gfontana(10003) GroupId=gfontana(10004) MCS_label=N/A
Priority=14208 Nice=0 Account=admin QOS=nogpu
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:55 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-11-06T10:31:10 EligibleTime=2025-11-06T10:31:10
AccrueTime=2025-11-06T10:31:10
StartTime=2025-11-06T10:31:10 EndTime=2025-11-06T11:31:10 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-11-06T10:31:10 Scheduler=Main
Partition=jobs AllocNode:Sid=mufasa2-login:42020
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gn01
BatchHost=gn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=4G,node=1,billing=1
AllocTRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=./test_script.sh
WorkDir=/home/gfontana
</pre>

In particular, the line beginning with ''"StartTime="'' provides expected times for the start and end of job execution. As explained in [[User_Jobs#Knowing_when_jobs_are_expected_to_end_or_start|Knowing when jobs are expected to end or start]], start time is only a prediction and subject to change.

== Cancelling a job with <code>scancel</code> ==

It is possible to cancel a job using command <code>scancel</code>, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it).

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

The following are some examples of use of <code>scancel</code> adapted from [https://slurm.schedmd.com/scancel.html SLURM's documentation].

<pre style="color: lightgrey; background: black;">
scancel <JOBID>
</pre>
removes queued job <code><JOBID></code> from the execution queue.

<pre style="color: lightgrey; background: black;">
scancel --signal=TERM <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGTERM (request to stop).

<pre style="color: lightgrey; background: black;">
scancel --signal=KILL <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGKILL (force stop).

<pre style="color: lightgrey; background: black;">
scancel --state=PENDING --user=<username> --partition=<partition_name>
</pre>
cancels all pending jobs belonging to user <code><username></code> in partition <code><partition_name></code>.

== Knowing what jobs you ran today ==

Command

<pre style="color: lightgrey; background: black;">
sacct -X
</pre>

provides a list of all jobs run today by your user.

User Jobs

2026-05-07T14:45:34Z

GiulioFontana: /* Using SLURM to run an interactive job on Mufasa */

= Running jobs with SLURM =

Users of Mufasa '''must''' use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM.

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa. This is a key difference between Mufasa 1.0 and [[System#Mufasa 2.0|Mufasa 2.0]].

== <code>srun</code> and <code>sbatch</code> ==

SLURM provides two commands to run jobs, called [https://slurm.schedmd.com/srun.html srun] and [https://slurm.schedmd.com/sbatch.html sbatch]:

<pre style="color: lightgrey; background: black;">
srun [options] <command_to_be_run_via_SLURM>
</pre>

<pre style="color: lightgrey; background: black;">
sbatch [options] <command_to_be_run_via_SLURM>
</pre>

In both cases, <code><command_to_be_run_via_SLURM></code> can be any Linux program (including shell scripts). By using <code>srun</code> or <code>sbatch</code>, the command or script specified by <code><command_to_be_run_via_SLURM></code> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for '''interactive jobs''': i.e., processes that use the console to interact with their user during job execution. <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

<code>sbatch</code> provides an additional possibility: <code><command_to_be_run_via_SLURM></code> can in fact be an [[#Using execution scripts to run jobs|'''execution script''']], i.e. a special (and SLURM-specific) type of Linux shell script that includes '''SBATCH directives'''. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the <code>[options]</code> part of the <code>sbatch</code> command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

Immediately after a <code>srun</code> or <code>sbatch</code> command is launched by a user, SLURM outputs a message informing the user that the job has been queued. The output is similar to this:

<pre style="color: lightgrey; background: black;">
srun: job 10849 queued and waiting for resources
</pre>

The shell is now locked while SLURM prepares the execution of the user program ([[#Detaching from a running job with screen|if you are using <code>screen</code> you can detach from that shell and come back later]]).

When SLURM is ready to run the program, it prints a message similar to

<pre style="color: lightgrey; background: black;">
srun: job 10849 has been allocated resources
</pre>

and then executes the program.

=== Options of <code>srun</code> and <code>sbatch</code> ===

The <code>[options]</code> part of <code>srun</code> and <code>sbatch</code> commands is used to tell SLURM what resources the job needs to be executed the job and how much time it will need to complete its execution.

For what concerns resources, the most important option is <code>--qos <qos_name></code>, specifying which SLURM [[#SLURM Quality of Service (QOS)|SLURM QOS]] the job will use. A job run with a given QOS has access to all and only the resources available to that QOS. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen QOS. Jobs that require resources that are not available to the chosen QOS do not get executed.

If the user forgets to use option <code>--qos <qos_name></code>, the job is run on the ''default qos'' (<code>normal</code>) which has access to ''zero'' resources. Therefore it is always necessary to specify option <code>--qos <qos_name></code> when launching a SLURM job on Mufasa.

More generally, the most relevant among the <code>[options]</code> are:

:;‑-qos=<qos__name>
:: specifies the [[SLURM#SLURM Quality of Service (QOS)|SLURM QOS]] that the job will use. It is mandatory to specify one.

:: ''Important! The chosen QOS limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is available to the chosen QOS.''

:: ''Important! If <code>‑‑qos <qos_name></code> is used and options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) are omitted, the job is assigned the default amount of the resource (as defined by the chosen QOS. A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job uses a QOS with access to GPUs.''

:; --job-name=<jobname>
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name.

:;‑‑gres=<gpu_resources>
:: specifies what GPUs to assign to the job. <code>gpu_resources</code> is a comma-delimited list where each element has the form <code>gpu:<Type>:<amount></code>, where <code><Type></code> is one of the types of GPU available on Mufasa (see [[SLURM#gres syntax|<code>gres</code> syntax]]) and <code><amount></code> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <code><gpu_resources></code> may be <code>gpu:40gb:1,gpu:3g.20gb:1</code>, corresponding to asking for one "full" GPU and 1 "small" GPU.

:: ''Important! The <code>‑‑gres</code> parameter is '''mandatory''' if the job is run with a QOS that allows access to the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.''

:;‑‑mem=<mem_resources>
:: specifies the amount of RAM to assign to the job; for instance, <code><mem_resources></code> may be <code>200G</code>

:;‑‑cpus-per-task=<cpu_amount>
:: specifies how many CPUs to assign to the job; for instance, <code><cpu_amount></code> may be <code>2</code>

:;<nowiki>‑‑time=<duration></nowiki>
:: specifies the maximum time allowed to the job to complete, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>. When the time expires, the job (if still running) gets killed by SLURM.

:;‑‑pty
:: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[#Interactive jobs|Interactive jobs]])

Note that GPU resources (if needed) must always be requested explicitly. For instance, in order to execute program <code>./my_program</code> which needs one GPU of type <code>3g.20gb</code> with QOS <code>gpulight</code> we can use the SLURM command

<pre style="color: lightgrey; background: black;">
srun --qos=gpulight --gres=gpu:3g.20gb:1 ./my_program
</pre>

== Interactive jobs ==

An '''interactive job''' is a process that use the console to interact with their user during job execution. Such a process is manually run by the user from a ''bash shell'' (i.e. a terminal session) provided by SLURM.

In order to ask SLURM to schedule the execution of a shell where the user can subsequently run the interactive job, it is necessary to use option <code>--pty</code>.

For instance, to ask SLURM to run a shell with QOS <code>nogpu</code>, the user should use command

<pre style="color: lightgrey; background: black;">
srun --qos=nogpu --pty /bin/bash
</pre>

By not specifying any other options, the user is telling SLURM that they want the shell spawned by SLURM to be provided with the default amount of resources associated to QOS <code>nogpu</code>. More generally, any combination of the other [[#Options of srun and sbatch|options of srun]] can be used together with <code>--pty</code>.

As every other job request to SLURM, the request to run a shell must be done from the [[System#Login server|login server]]. As soon as possible (i.e., as soon as the necessary resources are available) SLURM will open (in the same terminal that the user used to launch the <code>srun</code> command) a bash shell, where the user will be able to run their interactive programs.

To the user, this corresponds to the fact that the shell they were using to interact with the login server changes into a shell opened ''directly on Mufasa''. This corresponds to the command prompt changing from

<pre style="color: lightgrey; background: black;">
<username>@mufasa2-login:~$
</pre>

to

<pre style="color: lightgrey; background: black;">
<username>@mufasa2:~$
</pre>

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

When the user does not need the SLURM-spawned shell anymore, they should close it with command (the same used for any other Linux shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

to make the resources reserved for the interactive shell free again.

== Non-interactive jobs ==

<code>srun</code> commands are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the <code>srun</code> command can be substituted with a much simpler '''<code>sbatch</code> command'''. As [[#Running jobs with SLURM|already explained]], <code>sbatch</code> can make use of an '''execution script''' to specify all the parts of the command to be run via SLURM. So the command becomes

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

An execution script is a special type of Linux script that includes '''SBATCH directives'''. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an <code>srun</code> command.

:{|class="wikitable"
|'''''Note on Linux shell scripts'''''
|-
|''A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:

* ''have the “executable” flag set'' (see [[System#Changing file/directory ownership and permissions|here]] for details)
* ''have'' <code>#!/bin/bash</code> ''as its very first line''

''Usually, a Linux shell script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.''

''Within any shell script, lines preceded by <code>#</code> are comments (with the notable exception of the initial'' <code>#!/bin/bash</code> ''line)''.

''Use of blank lines as spacers is allowed.''
|}

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# [optionally] one or more '''<code>srun</code> commands''' that launch jobs with SLURM using the parameter values specified in the preamble

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

The template includes all the options [[#Using SLURM to run a container|already described above]], plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

In the template below, '''#SBATCH directives''' are requests made to SLURM. Notice that, though #SBATCH directives have a leading "#", that does ''not'' mean that they are comments: exactly as the <code>#!/bin/bash</code> at the beginning of a shell script, while starting with "#", is not a comment as well.

Other lines in the script that begin with <code>#</code> not followed by SBATCH are comments.

For what concerns directive that ask for a given amount of a resource (including time), if they are missing from the execution script (or commented out) the job will be assigned the default amount of the resource.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

'''<nowiki>#</nowiki>SBATCH ‑-nodes=1'''

'''<nowiki>#</nowiki>SBATCH ‑‑ntasks=1'''

'''<nowiki>#</nowiki>SBATCH ‑-partition=jobs'''

'''<nowiki>#</nowiki>SBATCH ‑-qos=<qos_name>'''

'''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>'''

'''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>'''

'''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>'''

'''<nowiki>#</nowiki>SBATCH ‑‑output=./<filename>-%j.out'''

: <nowiki>#</nowiki> the text file where the output of the job gets written (i.e., standard output gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH ‑‑error=./<filename>-error-%j.out'''

: <nowiki>#</nowiki> the text file where any error messages generated by the job gets written (i.e., standard error gets redirected onto the file). "%j" is the current time.

'''<nowiki>#</nowiki>SBATCH --job-name=<name>'''

<nowiki>#</nowiki>----------------end of preamble----------------

'''<command_to_run>'''

'''<command_to_run>'''

'''...'''
</blockquote>

= Executing jobs on Mufasa =

== Key concept ==

'''The key concept about executing jobs on Mufasa is that [[System#Containers|all computation on Mufasa must occur within containers]]'''.

A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if the user has writing permission on them: e.g., the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

The system used by Mufasa to create and execute containers is '''[[System#Singularity|Singularity]]'''. This wiki includes [[Singularity|directions]] on preparing containers with Singularity.

The container where a user job runs must contain all the libraries needed by the job. In fact (for maintainability and safety reasons) '''no software and no libraries are installed on Mufasa 2.0'''.

== Interactive and non-interactive user jobs ==

This section explains how to execute a user job contained in a container. It considers two types of user jobs, i.e.:
;: Interactive user jobs
::: as [[#Interactive jobs|already explained]], these are jobs that require interaction with the user while they are running, via a bash shell running within the container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the container is in execution.

;: Non-interactive user jobs
::: are the most common variety. The user prepares the container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the container while it is in execution. Executing the container and running the required programs within the container's environment is done via [[#Interactive jobs|execution scripts]].

== Using SLURM to run an interactive job on Mufasa ==

The first step to run an interactive user job on Mufasa is to run the [[System#Containers|container]] where the job will take place. Each user is in charge of preparing the container(s) where the user's jobs will be executed.

In order to run a container via SLURM by hand, i.e. via an interactive shell, a user must first open the shell with command

<pre style="color: lightgrey; background: black;">
srun [general_SLURM_options] ‑‑pty /bin/bash
</pre>

where [general_SLURM_options] are those [[#Options of srun and sbatch|already described above]].

Then the user must run the container: this is done as follows.

First, it is necessary to load the Singularity software module with

<pre style="color: lightgrey; background: black;">
module load amd/singularity
</pre>

(if needed, the list of software modules available in the system can be obtained with command <code>module av</code>).

Then, the user must use Singularity to run the container with command (see the [[Singularity|section about Singularity]] for further details)

<pre style="color: lightgrey; background: black;">
singularity run <repository>://<name_of_container>
</pre>

which pulls the container from the specified repository and executes it. Possible values for <code><repository></code> are:

:: <code>docker</code> (Docker Hub)
:: <code>library</code> (Singularityhub)
:: <code>path/to/container</code> if the container is local, i.e. located in the filesystem of Mufasa

As soon as the container is in execution, the terminal window used, so far, to interact with Mufasa becomes a shell ''in the container''. This shell belongs to the software environment of the container, and the user can use it to interact with the container's own software environment and filesystem.

It is easy to understand if a shell is open to Mufasa or to the container because in a container shell the system prompt becomes

<pre style="color: lightgrey; background: black;">
singularity>
</pre>

=== Interaction between container filesystem and local filesystem ===

The filesystem inside the container and the local one, i.e. Mufasa's, can interact. This means that the container can access the local filesystem to read and/or write files. However, the only parts of Mufasa's filesystem that can be accessed by the container are those that the user running the container has access rights to.

As a default, the user's <code>/home/username</code> directory on Mufasa is automatically mapped onto <code>/home/username</code> into the filesystem of the container. Whatever is done to that container directory, the changes are actually applied to the local <code>/home/username</code> directory on Mufasa.

The mapping of the home directory does not need to be explicitly requested. However, if the user needs (in addition to the home directory) other parts of the local filesystem of Mufasa to be mapped onto the container's filesystem, this is possible by using this modified version of the <code>singularity run</code> command:

<pre style="color: lightgrey; background: black;">
singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container>
</pre>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

=== How to know if your shell is running within SLURM ===
To know if your shell is a SLURM job and (if it is) the ID of the job, use command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If it provides an output, your shell is a SLURM job and the output is the ID of the job. If it doesn't provide any output, your shell is not running via SLURM.

== Using SLURM to run a non-interactive job on Mufasa ==

When the user job to be executed into a container is non-interactive, the mechanism based on an ''execution script'' already described in [[#Non-interactive jobs|Non-interactive jobs]] is employed. The command to run the script which in turn will run the container where the user job takes place is therefore

<pre style="color: lightgrey; background: black;">
sbatch <execution_script>
</pre>

The general features of a SLURM execution script and the SBATCH directives used for generic jobs have [[#Non-interactive jobs|already been described]]. Here we focus, therefore, on the SBATCH directives specifically used when SLURM is used to run a non-interactive job within a container.

Below is an '''execution script template''' to be copied and pasted into your own execution script text file.

<blockquote>
'''<nowiki>#</nowiki>!/bin/bash'''

<nowiki>#</nowiki>----------------start of preamble----------------

[[#Non-interactive jobs|#SBATCH directives already described above]]

<nowiki>#</nowiki>----------------end of preamble----------------

'''module load amd/singularity'''

'''singularity run <repository>://<name_of_container> <command_to_run>'''

</blockquote>

In the last line of the script, <code><command_to_run></code> is the command (e.g., the name of an executable script), complete with path within the container's filesystem, of the program to be run into the container. Please refer to the [[Singularity|section about Singularity]] for details about its commands.

The interactions between container filesystem and local filesystem in non-interactive jobs are exactly the same [[#Interaction between container filesystem and local filesystem|already described]] for interactive jobs. In particular, the user's home directory is mapped by default onto the filesystem of the container.

If, in addition to that, the user needs another part of the filesystem of Mufasa are to be mapped onto the container's filesystem, this is possible using this modified version of the <code>singularity run</code> command at the end of the script:

:<code>singularity run --bind </path/to/local/directory>:<path/to/container/directory> <repository>://<name_of_container></code>

If <code><path/to/container/directory></code> does not exist in the container's filesystem, it gets created by Singularity.

== Job output ==

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of Mufasa by the container where the computation takes place.

As [[#Using SLURM to run a container|explained below]], SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own <code>/home</code> directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own <code>/home</code> directory).

== Cancelling completed jobs ==

When a user process run via SLURM has completed its execution and is not needed anymore, it is important to [[User_Jobs#Canceling_a_job_with_scancel|close it with scancel]]. Especially if much time remains to the end of the execution time requested by the job.

Cancelling a SLURM job makes the resources reserved by SLURM free again for other users, and thus speeds up the execution of the jobs still queued.

Typically, one doesn't know how long a piece of code will take to complete its work. So please make sure to check from time to time if that happened, and -if there's still time before the duration of your SLURM job ends- just ''scancel'' the job. Other users will be grateful :-)

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

= Detaching from a running job with <code>screen</code> =

A consequence of the way <code>srun</code> operates is that if you launch an [[#Interactive and non-interactive user jobs|interactive user job]], the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the [[System#Accessing Mufasa|SSH connection to Mufasa]] exists.

If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then ''detach'' from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online).

Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then ''reattach'' to the ''screen''.

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Basic usage of <code>screen</code> is explained below.

== Creating a screen session, running a job in it, detaching from it ==

# Connect to the [[System#Login server|login server]] with SSH
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created (it has the look of an empty shell), launch your job with <code>srun</code>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell, while your process will go on running in the screen
# You can now close the SSH connection to the login server without damaging your running job

== Reattaching to an active screen session ==

# Connect to the [[System#Login server|login server]] with SSH
# In the login server shell, run <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you launched your job

== Closing (i.e. destroying) a screen session ==

When you do not need a screen session anymore:

# reattach to the active screen session as explained [[#Reattaching to an active screen session|above]]
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash), then confirming that you really want to proceed

Of course, any program (including SLURM jobs) running within the screen gets terminated when the screen is destroyed.

= Using <code>salloc</code> to reserve resources =

== What is <code>salloc</code>? ==

[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.

The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.

More precisely:
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them;
* when the delivery comes, SLURM creates an interactive shell session for the user;
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.

Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.

When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.

Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.

== <code>salloc</code> commands ==

<code>salloc</code> commands use a similar syntax to <code>srun</code> commands. In particular, <code>salloc</code> lets a user specify what resources they need and -importantly- a '''delivery time''' for the requested resources (delivery time can also be specified with <code>srun</code>, but in that case it is not very useful).

The typical <code>salloc</code> command has this form:'

<pre style="color: lightgrey; background: black;">
salloc [general_SLURM_options] --begin=<time>
</pre>

where

:; [general_SLURM_options]
:: represents the options already described in [[#Options of srun and sbatch|Options of srun and sbatch]]

:;<nowiki>--begin=<time></nowiki>
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time.

=== Syntax of parameter <code>--begin</code> ===

If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form

:<code>HH:MM</code>

If you want to specify a time of a different day, the form for <time> is <code>YYYY-MM-DDTHH:MM</code>, where the uppercase 'T' separates date from time.

It is also possible to specify <time> as relative to the current time, in one of the following forms:
: <code>now+Kminutes</code>
: <code>now+Khours</code>
: <code>now+Kdays</code>
where K is a (positive) integer.

Examples:
: <code>--begin=16:00</code>
: <code>--begin=now+1hours</code>
: <code>--begin=now+1days</code>
: <code>--begin=2030-01-20T12:34:00</code>

Note that Mufasa's time zone is GMT, so <nowiki><time></nowiki> must be expressed in GMT as well. If you want to know Mufasa's current time, use command

<pre style="color: lightgrey; background: black;">
date
</pre>

It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Thu Nov 10 16:43:30 UTC 2022
</pre>

== How to use <code>salloc</code> ==

In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ([[#Creating_a_screen_session.2C_running_a_job_in_it.2C_detaching_from_it|detaching from the screen]]). It is then possible to reach again the screen at a later time ([[#Reattaching_to_an_active_screen_session|reattaching to the screen]]). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>.

More precisely, the operations needed to do this are the following:

# Connect to the [[System#Login server|login server]].
# From the login server shell, run <pre style="color: lightgrey; background: black;">screen</pre>
# In the ''screen session'' ("screen") thus created run the [[#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered.
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre>
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original login server shell.
# You can now close the SSH connection to the login server without damaging your resource allocation request.
# At the delivery time you specified in the [[#salloc commands|<code>salloc</code> command]], connect to the login server with SSH.
# Once you are in the login server shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt.
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. '''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.)
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.)
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

== Cancelling a resource request made with <code>salloc</code> ==

To cancel a request for resources made as explained in [[#How to use salloc|How to use <code>salloc</code>]], follow these steps:

# Connect to the the [[System#Login server|login server]] with SSH.
# Once you are in the login server shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre>
# You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources.
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre>
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the login server shell.

= Monitoring and managing jobs =

SLURM provides Job Users with tools to inspect and manage jobs. While a [[Roles|Job User]] is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are '''[https://slurm.schedmd.com/squeue.html <code>squeue</code>]''' to inspect the scheduling queues and '''[https://slurm.schedmd.com/scancel.html <code>scancel</code>]''' to terminate queued or running jobs.

== Inspecting jobs with <code>squeue</code> ==

Running command

<pre style="color: lightgrey; background: black;">
squeue
</pre>

provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

This output comprises the following information:

:; JOBID
:: Numerical identifier of the job assigned by SLURM
:: This identifier is used to intervene on the job, for instance with <code>scancel</code>

:; PARTITION
:: the partition that the job is run on

:; NAME
:: the name assigned to the job; can be personalised using the <code>--job-name</code> option

:; USER
:: username of the user who launched the job

:; ST
:: job state (see [[SLURM#Job state|Job state]] for further information)

:; TIME
:: time that has passed since the beginning of job execution

:; NODES
:: number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)

:; NODELIST (REASON)
:: name of the nodes where the job is being executed: for Mufasa it is always <code>gn01</code>, which is the name of the node corresponding to Mufasa.

To limit the output of <code>squeue</code> to the jobs owned by user <code><username></code>, it can be used like this:

<pre style="color: lightgrey; background: black;">
squeue -u <username>
</pre>

=== Interpreting Job state as provided by <code>squeue</code> ===

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

:'''<code>PD</code>''' PENDING
:: Job is awaiting resource allocation.

:'''<code>R</code>''' RUNNING
:: Job currently has an allocation.

:'''<code>S</code>''' SUSPENDED
:: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

:'''<code>CG</code>''' COMPLETING
:: Job is in the process of completing. Some processes on some nodes may still be active.

:'''<code>CD</code>''' COMPLETED
:: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them.

== Knowing when jobs are expected to end or start ==

If you are interested in understanding when jobs are expected to start or end, use command

<pre style="color: lightgrey; background: black;">
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
</pre>

which provides an output is similar to the following:

<pre style="color: lightgrey; background: black;">
JOBID USER PARTITION ST |START_TIME | TIME_LEFT|
5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00|
5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14|
5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41|
5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05|
5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11|
5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43|
5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
</pre>

;:For running jobs (state <code>R</code>):
::column "START_TIME" tells you when the job started its execution
::column "TIME_LEFT" tells you how much remains of the running time requested by the job

;:For pending jobs (state <code>PD</code>):
::column "START_TIME" tells you when the job is expected to start its execution
::column "TIME_LEFT" tells you how much running time has been requested by the job

'''Important!''' Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.

If you simply want to know when pending jobs (state <code>PD</code>) are expected to begin execution, use

<pre style="color: lightgrey; background: black;">
squeue --start
</pre>

which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
</pre>

== Getting detailed information about a job ==

If needed, complete information about a job (either pending or running) can be obtained using command

<pre style="color: lightgrey; background: black;">
scontrol show job <JOBID>
</pre>

where <code><JOBID></code> is the number from the first column of the output of <code>squeue</code>. The output of this command is similar to the following:

<pre style="color: lightgrey; background: black;">
JobId=65 JobName=test_script.sh
UserId=gfontana(10003) GroupId=gfontana(10004) MCS_label=N/A
Priority=14208 Nice=0 Account=admin QOS=nogpu
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:55 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-11-06T10:31:10 EligibleTime=2025-11-06T10:31:10
AccrueTime=2025-11-06T10:31:10
StartTime=2025-11-06T10:31:10 EndTime=2025-11-06T11:31:10 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-11-06T10:31:10 Scheduler=Main
Partition=jobs AllocNode:Sid=mufasa2-login:42020
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gn01
BatchHost=gn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=4G,node=1,billing=1
AllocTRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=./test_script.sh
WorkDir=/home/gfontana
</pre>

In particular, the line beginning with ''"StartTime="'' provides expected times for the start and end of job execution. As explained in [[User_Jobs#Knowing_when_jobs_are_expected_to_end_or_start|Knowing when jobs are expected to end or start]], start time is only a prediction and subject to change.

== Cancelling a job with <code>scancel</code> ==

It is possible to cancel a job using command <code>scancel</code>, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it).

Please note that [[System#Job priority|job priority]] for your user depends (also) on the overall duration of the jobs that you ran on Mufasa. Therefore, '''cancelling jobs that are not needed anymore improves your future jobs' priority'''.

The following are some examples of use of <code>scancel</code> adapted from [https://slurm.schedmd.com/scancel.html SLURM's documentation].

<pre style="color: lightgrey; background: black;">
scancel <JOBID>
</pre>
removes queued job <code><JOBID></code> from the execution queue.

<pre style="color: lightgrey; background: black;">
scancel --signal=TERM <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGTERM (request to stop).

<pre style="color: lightgrey; background: black;">
scancel --signal=KILL <JOBID>
</pre>
terminates execution of job <code><JOBID></code> with signal SIGKILL (force stop).

<pre style="color: lightgrey; background: black;">
scancel --state=PENDING --user=<username> --partition=<partition_name>
</pre>
cancels all pending jobs belonging to user <code><username></code> in partition <code><partition_name></code>.

== Knowing what jobs you ran today ==

Command

<pre style="color: lightgrey; background: black;">
sacct -X
</pre>

provides a list of all jobs run today by your user.

SLURM

2026-05-04T14:34:11Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** max 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. In a normally functioning SLURM system, the passage from <code>up</code> or <code>drain</code> to <code>down</code> happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.

A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions, including the '''default values''' which are applied to jobs that do not make explicit requests, can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
=> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
=> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "=>" the most relevant default values for Mufasa users, i.e.:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:33:35Z

GiulioFontana: /* Default values */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. In a normally functioning SLURM system, the passage from <code>up</code> or <code>drain</code> to <code>down</code> happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.

A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions, including the '''default values''' which are applied to jobs that do not make explicit requests, can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
=> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
=> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "=>" the most relevant default values for Mufasa users, i.e.:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:32:06Z

GiulioFontana: /* Partition availability */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. In a normally functioning SLURM system, the passage from <code>up</code> or <code>drain</code> to <code>down</code> happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.

A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:28:02Z

GiulioFontana: /* Partition availability */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:27:25Z

GiulioFontana: /* Partition availability */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:26:14Z

GiulioFontana: /* Partition availability */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:25:20Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:25:08Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:24:19Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information can be obtained via [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:23:03Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Note that to inspect the ''jobs'' submitted to partition <code>jobs</code> you should use [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:19:46Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:18:37Z

GiulioFontana: /* SLURM partitions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:17:22Z

GiulioFontana: /* How to maximise the priority of your jobs */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:17:12Z

GiulioFontana: /* How to maximise the priority of your jobs */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

Suggestion: if you're going to run a job, it's a good idea to [[#Looking for unused GPUs|Look for unused GPUs]] before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:14:10Z

GiulioFontana: /* Elements determining job priority */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:13:37Z

GiulioFontana: /* Elements determining job priority */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others'''

The main features of FairShare are:
* the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
* the FairShare mechanism has a "fading memory", i.e. resource use weighs more if recent, less if farther in the past

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T14:06:02Z

GiulioFontana: /* Job priority */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:56:14Z

GiulioFontana: /* Job priority */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue according to descending priorities.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:54:42Z

GiulioFontana: /* Job priority */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution. The job with the highest priority is put by SLURM on top of the queue; all other jobs are queued according to their (descending) priorities.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:53:08Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QOS''''' '''''1 for all other QOSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:52:30Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QoS''''' '''''1 for all other QoSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:52:08Z

GiulioFontana:

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QoS''''' '''''1 for all other QoSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:50:51Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QoS''''' '''''1 for all other QoSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:50:29Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''''2 for'' <code>research</code> ''users''''' '''''1 for'' <code>students</code> ''users''''''
| '''''not limited directly...''''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''''not limited directly...''''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''''2 for'' <code>gpuwide</code> ''QoS''''' '''''1 for all other QoSes'''''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:48:24Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:47:54Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:46:49Z

GiulioFontana: /* Limits on jobs by the same user */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:43:58Z

GiulioFontana: /* research users and students users */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:43:02Z

GiulioFontana: /* research users and students users */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * have access to a restricted set of [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:41:59Z

GiulioFontana: /* researcher users and students users */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>research</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:41:48Z

GiulioFontana:

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>researcher</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:40:12Z

GiulioFontana: /* Research users and students users */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#Research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= <code>researcher</code> users and <code>students</code> users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:39:29Z

GiulioFontana: /* Restricted QOSes */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#Research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:38:37Z

GiulioFontana: /* QOS restrictions */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== Restricted QOSes ==

Some of the QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See [[#Research users and students users|below]] to understand the differences between <code>researcher</code> users and <code>students</code> users.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:36:38Z

GiulioFontana: /* The build QOS */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The <code>build</code> QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:36:13Z

GiulioFontana: /* The build QOS */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

This QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:34:19Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:33:01Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:30:42Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:30:04Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* this access to GPUs:
** access to a maximum of 1 GPU of type ''gpu:3g.20gb''
** no access to GPUs of type ''gpu:40gb=0''
** no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:27:38Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:26:53Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always specify the QOS that they want to use: this choice, in turn, determines what resources the job will be able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:26:08Z

GiulioFontana: /* SLURM Quality of Service (QOS) */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

SLURM provides a tool called '''Quality of Service''' ('''QOS''') to let system configurators give a name to a set of constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always specify the QOS that they want to use: this choice, in turn, determines what resources the job will be able to access and influences the [[#Job priority|priority]] of the job.

Mufasa's QOSes and their features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:21:08Z

GiulioFontana: /* SLURM in a nutshell */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].

= SLURM Quality of Service (QOS) =

In Mufasa 2.0, the different Quality of Services (QOSes) defined by SLURM correspond to different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always specify the QOS that they want to use: this choice, in turn determines what resources the job will be able to access.

The list of Mufasa's QOSes and their main features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

SLURM

2026-05-04T13:19:03Z

GiulioFontana: /* SLURM in a nutshell */

This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* multiple CPUs
* powerful CPUs
* a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

= SLURM in a nutshell =

Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].

The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's '''[[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]]''' mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

= SLURM Quality of Service (QOS) =

In Mufasa 2.0, the different Quality of Services (QOSes) defined by SLURM correspond to different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always specify the QOS that they want to use: this choice, in turn determines what resources the job will be able to access.

The list of Mufasa's QOSes and their main features can be inspected with command

<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Name Priority MaxSubmit MaxWall MaxTRES
----------- ---------- --------- ----------- --------------------------------------------------------------------------------
normal 0
nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G
gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G
gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G
gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G
gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G
build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>

The columns of this output are the following:

:; Name
:: name of the QOS

:; Priority
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details

:; MaxSubmit
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.

:; MaxWall
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.

:; MaxTRES
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]

For instance, QOS <code>gpulight</code> provides jobs that use it with:
* priority tier equal to 8
* a maximum of 1 submitted job per user
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''

As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tier'''. In Mufasa 2.0, priority tiers are used to steer users towards using the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Powerful QOSes lower the priority of the jobs that use them, to encourage users to prefer less powerful QOSes.

See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.

The <code>normal</code> QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, any job run with this QOS would never be run.

== The <code>build</code> QOS ==

This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, to allow SLURM jobs launched using this QOS to be executed quickly. On the other side, this QOS has very limited (but fully sufficient for building operations) resources, no access to GPUs and a short maximum duration for jobs: so it is not suitable for other computing activities.

See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.

== QOS restrictions ==

Some of the QOSes are not available to M.Sc. students. See [[#Research users and students users|Research users and students users]] to understand the differences between the two categories of users of Mufasa and to find out what access you have to resources.

= Research users and students users =

Users of Mufasa belong to two '''user categories''', which provide the users belonging to them with different access to system resources.

User categories are:

:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
::: * have access to all [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a higher ''base priority''
::: * the number of running jobs that the user can have is higher

:: '''<code>students</code>''', i.e. M.Sc. students
::: * do not have access to some [[#SLURM Quality of Service (QOS)|QOSes]]
::: * their jobs have a lower ''base priority''
::: * the number of running jobs that the user can have is lower

You can inspect the differences between <code>research</code> and <code>students</code> users with command

<pre style="color: lightgrey; background: black;">
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
Account Priority MaxJobs
research 4 2
students 1 1
</pre>

To know what limits apply to your own user, use command

<pre style="color: lightgrey; background: black;">
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>

The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.

= Limits on jobs by the same user =

Mufasa sets limits on the number of jobs of a user. Limits involve:
* '''submitted jobs''', i.e. jobs that the user launched and may be either running or queued
* '''running jobs''', i.e. jobs that are in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

{| class="wikitable" style="text-align:center;"
|-
!
! number of running jobs
! number of submitted jobs
|-
! rowspan="1" style="text-align:center;" | global limits (system-wide)
| '''2 for <code>research</code> users''' '''1 for <code>students</code> users'''
| '''not limited directly...''' ...but cannot exceed the sum of the limits on submitted jobs set by the QoSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for each QoS
| '''not limited directly...''' ...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
| '''2 for <code>gpuwide</code> QoS''' '''1 for all other QoSes'''
|}

Limits on the number of running jobs depend on the user category (either [[#Research users and students users|researcher or students]]) that the user belongs to; limits on the number of submitted jobs depend on the properties of the [[#SLURM Quality of Service (QOS)|SLURM QOSes]] used to launch them.

= Job priority =

Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.

SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution.

To achieves this goal, SLURM '''encourages users to avoid asking for resources or execution time that their job does not need'''. The more resources and the more time a job requests, the lower its priority in the execution queue will be.

This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.

== Elements determining job priority ==
In Mufasa, the priority of a job is computed by SLURM according to the following elements:

: '''[[#Research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
::: Used to provide higher priority to jobs run by '''research personnel'''

: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
::: Used to provide higher priority to jobs asking for '''less resources'''

: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''

: '''Job duration''', i.e. the execution time requested by the job
::: Used to provide higher priority to '''shorter jobs'''

: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''

: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa less than others''' (i.e., used less resources: CPUs, GPUs, RAM, execution time)
::: FairShare has a "fading memory", i.e. the influence of past resource usage gets lower the farther it is from now

== How to maximise the priority of your jobs ==

Every time you run a SLURM job, follow these guidelines:

:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources lead to higher priority

; Only request CPUs that your job will actually use
:: If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them

; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration

; Test and debug your code using less powerful QOSes before running it on more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve

; Cancel jobs when you don't need them anymore
:: [[User_Jobs#Cancelling_a_job_with_scancel|Use scancel]] to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve
|}

= System resources subjected to limitations =

In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"

TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.

== <code>gres</code> syntax ==

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

'''<code>gpu:Name:Type</code>'''

Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:

* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

<code>gpu:3g.20gb</code>

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is

'''<code>gpu:<Type>:<Quantity></code>'''

where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is

<code>gpu:4g.20gb:2</code>

SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

== Looking for unused GPUs ==

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
</pre>

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
GRES
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5
</pre>

To know which of the GPUs are currently in use, use command

<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)
</pre>

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

= SLURM partitions =

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant.

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].

In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with

<pre style="color: lightgrey; background: black;">
sinfo -o "%10P %5a %9T %11L %10l"
</pre>

which provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT
jobs* up idle 1:00:00 3-00:00:00</pre>

where columns correspond to the following information:

:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one

:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]

:; STATE
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use

:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''

:; TIMELIMIT
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

== Partition availability ==

The most important information that <code>sinfo</code> provides is the '''availability''' (also called ''state'') of partitions. This is shown in column "AVAIL". Possible partition states are:

:'''<code>up</code>''' = the partition is available
:: It's possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Currently queued jobs will be executed as soon as resources allow

:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state)
:: It's not possible to launch jobs on the partition
:: Currently running jobs will be completed
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

:'''<code>down</code>''' = the partition is unavailable
:: It's not possible to launch jobs on the partition
:: There are no running jobs
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)

When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. When a partition passes from any other state to <code>down</code>, running jobs (if they exist) get killed. A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.

== Default values ==

The features of SLURM partitions can be inspected with

<pre style="color: lightgrey; background: black;">
scontrol show partition
</pre>

which provides an output similar to this:

<pre style="color: lightgrey; background: black;">
PartitionName=jobs
AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
AllocNodes=ALL Default=YES QoS=N/A
-> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=gn01
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
-> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>

In the example, we have highlighted with "->" the most relevant for Mufasa users, i.e. two '''default values''' which are applied to jobs that do not make explicit requests. Precisely:

;<code>DefaultTime</code>
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)

;<code>DefMemPerNode</code>
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)