Difference between revisions of "SLURM"

From Mufasa (BioHPC)
Jump to navigation Jump to search
 
(346 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).
This page presents the features of SLURM that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).


Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require any of the following:
Users of Mufasa '''must use SLURM''' to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
* GPUs
* GPUs
* multiple CPUs
* multiple CPUs
* a significant amount of RAM.
* powerful CPUs
* a significant amount of RAM


In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).
In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the [[System#Login server|login server]] virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).


= <span style="background:#FFFF00">SLURM in a nutshell</span> =
= SLURM in a nutshell =


Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.  
Computation jobs on Mufasa needs to be launched via [[System#The SLURM job scheduling system|SLURM]]. SLURM provides jobs with access to the [[#System resources subjected to limitations|physical resources]] of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.  


When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule,
When a user runs a job, the job does not get executed immediately and is instead ''queued''. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the '''[[#Job priority|priority]]''' assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:
;: '''The greater the fraction of Mufasa's overall resources that a job asks for, the lower its priority'''.


The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.
;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''.


In Mufasa 2.0 (this was different in Mufasa 1.0) access to system resources is managed via SLURM's '''[[#Quality of Service|Quality of Service (QOS)]]''' mechanism. To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.
The priority mechanism is used to encourage users to use Mufasa's resources (i.e.: GPUs, CPUs, RAM, execution time) in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]].


= <span style="background:#FFFF00">Quality of Service (QOS)</span> =
The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, because at the end of such slot it gets killed by SLURM.


In SLURM, different Quality of Services (QOSes) define different levels of access to the server's resources. SLURM jobs must always specify the QOS that they use: this choice determines what resources the job can access.
In Mufasa 2.0 access to system resources is managed via SLURM's [[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]] mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). While [[User Jobs#Running jobs with SLURM|launching a processing job via SLURM]], the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.


The list of Mufasa's QOSes and their main features can be inspected with command
Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]].
 
= SLURM Quality of Service (QOS) =
 
Through '''Quality of Services''' ('''QOSes'''), SLURM lets system configurators assign a name to a set of related constraints.
 
In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When [[User Jobs|executing a job with SLURM]], a user must always '''specify the QOS''' that their job will use: this choice, in turn, determines what resources the job is able to access and influences the [[#Job priority|priority]] of the job.
 
Mufasa's QOSes and their features can be inspected with command


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-11,priority,MaxJobsPerUser,maxwall,maxtres%-80
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80,mintres%-18
</pre>
</pre>


Line 32: Line 40:


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
Name          Priority MaxJobsPU     MaxWall MaxTRES                                                                           
Name          Priority MaxSubmit     MaxWall MaxTRES                                                                          MinTRES           
----------- ---------- --------- ----------- --------------------------------------------------------------------------------  
----------- ---------- --------- ----------- -------------------------------------------------------------------------------- ------------------  
normal              0                                                                                                      
normal              0                                                                                                                          
nogpu                4        1  3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G  
nogpu                4        1  3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G                  
gpuheavy-20          1        1            cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G             
gpuheavy-20          1        1            cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G            gres/gpu:4g.20gb=1
gpuheavy-40          1        1            cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G             
gpuheavy-40          1        1            cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G            gres/gpu:40gb=1   
gpulight            8        1    12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G               
gpulight            8        1    12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              gres/gpu:3g.20gb=1
gpu                  2        1  1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G               
gpu                  2        1  1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              gres/gpu:3g.20gb=1
gpuwide              2        2  1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G</pre>
gpuwide              2        2  1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G             gres/gpu:4g.20gb=1
build              32        1    02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
</pre>


The columns of this output are the following:
The columns of this output are the following:


; Name
:; Name
: name of the QOS
:: name of the QOS


; Priority
:; Priority
: priority tier associated to the QOS
:: priority tier associated to the QOS (higher value = higher priority): see [[#Job priority|Job priority]] for details


; MaxJobsPU
:; MaxSubmit
: maximum number of jobs from a single user can be running with this QOS
:: maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
: (note that there are also [[Research users and students users|other limitations]] on the number of running jobs by the same user)
:: See [[#Limits on jobs by the same user|Limits on jobs by the same user]] for an overview of the limits on jobs set by Mufasa.


; MaxWall
:; MaxWall
: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: For some QOSes these are not set: it means that they are determined by the [[#SLURM partitions|partition]]. Partitions also define the [[#Default values|default duration]] of jobs.


; MaxTRES
:; MaxTRES
: maximum amount of resources ("''Trackable RESources''") available to a job using the QOS, where
:: amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") available to a job using the QOS, where
: <code>'''cpu=''K'''''</code> means that the maximum number of processor cores is ''K''
:: <code>'''cpu=''K'''''</code> means that the maximum number of CPUs (i.e., processor cores) is ''K''
: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> if not specified, the job gets the default amount of CPUs specified by the [[#SLURM partitions|partition]]
: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
:: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
::: --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
:: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes
::: --> if not specified, the job gets the default amount of RAM specified by the [[#SLURM partitions|partition]]


For instance, QOS <code>gpulight</code> provides jobs that use it with:
:; MinTRES
* priority tier 8
:: minimum amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") that a job using the QOS must request in order to actually get executed by SLURM.
* a maximum of 1 running job per user
:: (If your job does not actually need these resources, you've chosen the wrong QOS: you can use one with a higher priority.)
* a maximum of 12 hours of duration
* a maximum of 2 CPUs
* a maximum of 64 GB of RAM
* access to a maximum of 1 GPU of type ''gpu:3g.20gb''
* no access to GPUs of type ''gpu:40gb=0''
* no access to GPUs of type ''gpu:4g.20gb''




The <code>normal</code> QOS is the one applied to jobs if no QOS is specified. <code>normal</code> provides no access at all to Mufasa's resources, so '''it is always necessary to specify a QOS''' (different from <code>normal</code>) when running a job via SLURM.
The <code>normal</code> QOS is the default one, and exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run.


; Important! Some of the QOSes may be available only to a subset of users. In Mufasa, such a limitation is associated to the [[#Research users and students users|category]] that users belongs to.
The information provided by the <code>sacctmgr list qos</code> command above is summarised by the following table:


As seen in the example output from <code>sacctmgr list qos</code> above, each QOS has an associated '''priority tiers'''. As a rule, the more powerful (i.e., rich with resources) a QOS is, the lower the priority of the jobs that use such QOS. See [[#Priority|Priority]] to understand how priority affects the execution order of jobs in Mufasa 2.0.
{| class="wikitable" style="text-align:center;"
|-
! QOS
! Priority</br>tier
! Max</br>Submit
! MaxWall</br>[h]
! max # of</br>CPUs
! max RAM</br>[GB]
! max # of</br>3g.20GB</br>GPUs
! max # of</br>4g.20GB</br>GPUs
! max # of</br>40GB</br>GPUs
! MinTRES</br>(GPUs)
|-
! rowspan="1" style="text-align:center;" | build
| 32
| 1
| 2
| 2
| 16
| -
| -
| -
| -
|-
! rowspan="1" style="text-align:center;" | gpulight
| 8
| 1
| 12
| 2
| 64
| 1
| -
| -
| one 3g.20GB
|-
! rowspan="1" style="text-align:center;" | nogpu
| 4
| 1
| 72
| 16
| 128
| -
| -
| -
| -
|-
! rowspan="1" style="text-align:center;" | gpu
| 2
| 1
| 24
| 8
| 64
| 1
| -
| -
| one 3g.20GB
|-
! rowspan="1" style="text-align:center;" | gpuwide
| 2
| 2
| 24
| 8
| 64
| -
| 1
| -
| one 4g.20GB
|-
! rowspan="1" style="text-align:center;" | gpuheavy-20
| 1
| 1
| 72 (set by partition)
| 8
| 128
| -
| 2
| -
| one 4g.20GB
|-
! rowspan="1" style="text-align:center;" | gpuheavy-40
| 1
| 1
| 72 (set by partition)
| 8
| 128
| -
| -
| 1
| one 40GB
|}


The maximum amount of resources that a QOS has access to (available to the running jobs using the QOS, collectively) can be inspected with command
A key piece of information in the table above is the '''priority tier''' associated to each QOS.


<pre style="color: lightgrey; background: black;">
In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner. See [[#Job priority|Job priority]] for details about how priority affects the execution order of jobs in Mufasa 2.0.
sacctmgr list qos format=name%-11,grpTRES%-34
</pre>
 
which provides an output similar to
 
<pre style="color: lightgrey; background: black;">
Name        GrpTRES                           
----------- ----------------------------------
normal                                       
nogpu      cpu=48,mem=384G                   
gpuheavy-20 gres/gpu:4g.20gb=4               
gpuheavy-40 gres/gpu:40gb=3                   
gpulight    cpu=8,gres/gpu:3g.20gb=4,mem=256G 
gpu        cpu=24,gres/gpu:3g.20gb=3,mem=192G
gpuwide    cpu=40,gres/gpu:4g.20gb=5,mem=320G
</pre>


Note how overall resources associated to the set of all QOS exceed overall available resources. With SLURM, multiple QOS can be given access to the same physical resource (e.g., a CPU or a GPU), because SLURM guarantees that the overall request for resources from all running jobs does not exceed the overall availability of resources in the system. SLURM will only execute a job if all the resources requested by the job are not already in use at the time of request.
== The <code>build</code> QOS ==


== Partitions ==
This QOS is specifically designed to be used by Mufasa users to '''quickly build [[System#Containers|container images]]'''. Its associated priority tier is very high: this, combined with the fact that jobs using this QOS require few resources, means that such jobs usually get executed very soon.


Since in Mufasa 2.0 access to resources is controlled via QOSes, partitions are not very relevant. Partitions are another mechanism provided by SLURM to create different levels of access to system resources.  
On the other hand, the limited resources and the lack of access to the GPUs make the <code>build</code> QOS unsuitable for other tasks.


In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The partition status of Mufasa can be inspected with
See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images.


<pre style="color: lightgrey; background: black;">
== Restricted QOSes ==
sinfo
</pre>


which provides an output similar to the following:
In Mufasa, the most powerful QOSes are reserved to researchers (called <code>research</code> users: these include academic personnel and Ph.D. students). M.Sc. students, called <code>students</code> users, cannot use them while running jobs.


<pre style="color: lightgrey; background: black;">
See [[#research users and students users|below]] to understand how ''user categories'' work in Mufasa 2.0.
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
jobs*        up 3-00:00:00      1  idle gn01
</pre>


As explained, in Mufasa 2.0 partitions are not much relevant, while QOS are very relevant.
= <code>research</code> users and <code>students</code> users =


= <span style="background:#FFFF00">Research users and students users</span> =
Users of Mufasa belong to two '''user categories'''. Categories provide users with different access to  to Mufasa's resources: the idea being to provide researchers with more access while still letting students use the server.


Users of Mufasa belong to two categories, which provide the users belonging to them with different access to system resources.
User categories in Mufasa are the following:
The categories are:


'''Research''' users, i.e. academic personnel and Ph.D. students
:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students
* have access to all [[#Quality of Service (QOS)|QOSes]]
::: * can use all QOSes, including the the [[#Restricted QOSes|restricted ones]]
* their jobs have a higher ''base priority''
::: * their jobs have a higher ''base priority''
::: * have a higher number of jobs that can be running at the same time


'''Students''' users, i.e. M.Sc. students
:: '''<code>students</code>''', i.e. M.Sc. students
* do not have access to QOS gpuheavy-20 and gpuheavy-40
::: * cannot use the [[#Restricted QOSes|restricted QOSes]]
* their jobs have a lower ''base priority''
::: * their jobs have a lower ''base priority''
::: * have a lower number of jobs that can be running at the same time


You can inspect the differences between researcher and student users with command
You can inspect the differences in priority and running jobs between <code>research</code> and <code>students</code> users with command


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
sacctmgr list association format=account,priority,maxjobs,maxsubmit | grep -E 'Priority|research|students'
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
</pre>
</pre>


Line 143: Line 220:


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
   Account  Priority MaxJobs MaxSubmit
   Account  Priority MaxJobs  
   research          4      2         4
   research          4      2  
   students          1      1         2
   students          1      1
</pre>
</pre>


This example output shows that the differences between research and students are the following:
To know what limits apply to your own user, use command


* '''base priority''' is 4 for jobs run by ''research'' users, while it is 1 for jobs run by ''students'' users
<pre style="color: lightgrey; background: black;">
* the '''number of running jobs''' is 2 for ''research'' users, while it is 1 for jobs run by ''students'' users
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
* the '''number of queued jobs''' (i.e., of jobs submitted to SLURM for execution but not yet running) is 4 for ''research'' users, while it is 1 for ''student'' users
</pre>


= <span style="background:#FFFF00">Job priority</span> =
which provides an output similar to the following:


Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available. The order of the jobs in the queue depends on the '''priority''' of the jobs, and defines the order in which each job will reach execution.
<pre style="color: lightgrey; background: black;">
      User  Priority MaxJobs QOS                                                         
---------- ---------- ------- ------------------------------------------------------------
    preali          4      2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
</pre>


SLURM is configured to maximise resource availability, i.e. to ensure the shorter possible wait time before job execution. This goal is achieved by '''encouraging users to avoid asking for resources or execution time that their job does not need'''.
The list under "QOS" shows what QOSes your user is allowed to use when [[User Jobs|running jobs]]. <code>research</code> users can use all of them, while <code>students</code> users can only access a subset of them.


This mechanism creates a '''virtuous cycle'''. By carefully choosing what to ask for, a user ensures that their job will be executed as soon as possible; at the same time, users limiting their requests to what their jobs really need leave more resources available to other jobs in the queue, which will then be executed sooner.
= Limits on jobs by the same user =


=== Elements influencing job priority ===
Mufasa enforces limits on the number of jobs from a single user. Such limits aim at preventing users from "hogging" system resources, and apply to:
In Mufasa, the priority of a job is computed by SLURM according to the following elements:
* '''submitted jobs''', i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
* '''running jobs''', i.e. jobs that are currently in execution


; Job duration, i.e. the execution time requested by the job
Limits on submitted jobs are set via [[#SLURM Quality of Service (QOS)|QOSes]], while limits on running jobs are set via [[#research users and students users|user category]].
: This is used to assign higher priority to shorter jobs


; Job size, i.e. the number of CPUs requested by the job
The following table summarises what limits exist:
: This is used to assign higher priority to jobs requiring less CPUs.


; Age, i.e. the length of time that the job has been waiting in the queue
{| class="wikitable" style="text-align:center;"
: This is used to increase the priority of a job the longer it has been waiting for execution.
|-
!
! on the number of <u>running</u> jobs</br>by a single user
! on the number of <u>submitted</u> jobs</br>by a single user
|-
! rowspan="1" style="text-align:center;" | global limits<br/>(system-wide)
| '''''2 for'' <code>research</code> ''users'''''<br/>'''''1 for'' <code>students</code> ''users'''''
| '''''not limited directly...'''''<br/>...but subject to limits set by individual QOSes (below)
|-
! rowspan="1" style="text-align:center;" | limits for<br/>individual</br>QOSes
| '''''not limited directly...'''''<br/>...but subject to global limits (above)
| '''''2 for the'' <code>gpuwide</code> ''QOS'''''<br/>'''''1 for each of the other QOSes'''''
|}


; QOS (Quality Of Service), i.e. a factor associated to the resources requested by the job
= Job priority =
: This is used to implement two different mechanisms that influence job priority, i.e.:
:: - to assign higher priority to jobs run on less powerful partitions
:: - to assign higher priority to jobs run by researchers (e.g., Ph.D. students) wrt jobs run by M.Sc. students


QOS is also used to set limits to the number of jobs by the same user that can be in execution at a given time, as well as the number of jobs by the same user that can be queued at a given time.
Once the execution of a job has been requested, the job is not run immediately: it is instead ''queued'' by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.
It is possible to get a list of the QOS that are defined in SLURM with command


<pre style="color: lightgrey; background: black;">
The order of the items in the job queue depends on their '''priority'''. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.
sacctmgr list qos
</pre>


= System resources subjected to limitations =
The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to '''encourage users not to ask for resources or execution time that their job doesn't need'''. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.


The hardware resources of Mufasa are limited. For this reason, some of them are subjected to limitations, i.e. (these are SLURM's own terms):
Priority management in Mufasa is designed to set up a '''virtuous cycle''' where users, by carefully choosing what to ask for, obtain two results:
* they ensure that their job is executed as soon as possible;
* they leave as much as possible of Mufasa's resources free for other users's jobs.


; cpu
== Elements determining job priority ==
: the number of processor cores that a job uses
In Mufasa, the priority of a job is computed by SLURM according to the following elements:


; mem
: '''[[#research users and students users|User category]]''' (i.e., <code>research</code> or <code>students</code>)
: the amount of RAM that a job uses
::: Used to provide higher priority to jobs run by '''research personnel'''


;gres
: '''[[#SLURM Quality of Service (QOS)|QOS]]''' used by the job
: the amount of ''generic resources'' that a job uses: in Mufasa, gres are '''GPUs'''
::: Used to provide higher priority to jobs asking for '''less resources'''


These are some of the '''TRES (Trackable RESources)''' defined by SLURM. From [https://slurm.schedmd.com/tres.html SLURM's documentation]: "''A TRES is a resource that can be tracked for usage or used to enforce limits against.''"
: '''Number of CPUs''' requested by the job (also called "job size")
::: Used to provide higher priority to jobs asking for '''a lower number of CPUs'''


SLURM provides jobs with access to resources only for a limited time: i.e.,
: '''Job duration''', i.e. the execution time requested by the job
; execution time
::: Used to provide higher priority to '''shorter jobs'''
: is itself a limited resource.


When a resource is limited, a job cannot use arbitrary quantities of it. On the contrary, the job must specify how much of the resource it requests. Resource requests are done either by running the job on a [[User Jobs#SLURM partitions|partition]] for which a default amount of resources has been defined, or via the options of the [[#Running_jobs_with_SLURM:_generalities|command]] used to launch the job via SLURM.
: '''Job Age''', i.e. the time that the job has been waiting in the queue
::: Used to provide higher priority to jobs which have been '''queued for a longer time'''


== <code>gres</code> syntax ==
: '''FairShare''', i.e. a factor computed by SLURM to balance use of the system by different users
::: Used to provide higher priority to jobs by users who '''used Mufasa's resources less than others''' (see [[#How FairShare works|below]])


Whenever it is necessary to specify the quantity of <code>gres</code>, i.e. generic resources, a special syntax must be used. In Mufasa <code>gres</code> resources are GPUs, so this syntax applies to GPUs. Number and type of Mufasa's GPUs is described [[System#CPUs and GPUs|here]].
=== How FairShare works ===
In Mufasa 2.0, the FairShare of each user is computed according to these rules:
# It considers the quantity used by (current and past) user jobs of:
#* '''CPUs''' - impact on FairShare is proportional to the number of CPUs used
#* '''RAM''' - impact on FairShare is proportional to the amount of RAM used
#* '''GPUs''' - impact on FairShare is proportional to the number of GPUs used
#** 40 GB GPUs have double the impact on FairShare than 20 GB GPUs
# The impact on FairShare of any use of resources is '''proportional to its duration'''
#* example: using 32 GB of RAM for 48h has the same impact of using 64 GB of RAM for 24h
# FairShare has a "fading memory": i.e., resource use has less and less impact on FairShare the farther in the past it is
#* consequence: over time, the "history" of a user gets forgotten by FairShare


The name of each GPU resource takes the form
== How to maximise the priority of your jobs ==


'''<code>Name:Type</code>'''
Every time you run a SLURM job, follow these guidelines:


where <code>Name</code> is '''<code>gpu</code>''' and <code>Type</code> takes the following values <span style="background:#00FF00">[to be updated]</span>:
:{|class="wikitable"
|
; Choose the less powerful QOS compatible with the needs of your job
:: QOSes with access to less resources have higher base priority


* '''<code>40gb</code>''' for GPUs with 40 Gbytes of onboard RAM
; Only request CPUs and RAM that your job really needs
* '''<code>20gb</code>''' for GPUs with 20 Gbytes of onboard RAM
:: Usually, code exploits multiple CPUs only if designed to do so: if unsure, only ask for 1 CPU
:: The fewer resources your job asks for, the less it will wait before they become available
:: Asking for fewer CPUs and/or less RAM improves your FairShare


So, for instance,
; Do not request more time than your jobs needs to complete
:: Make a worst-case estimate and only ask for that duration
:: Asking for less of Mufasa's time improves your FairShare


<code>gpu:20gb</code>
; Debug and test code using less powerful QOSes before running it with more powerful QOSes
:: Your test jobs will get a higher priority and your FairShare will improve


identifies the resource corresponding to GPUs with 20 GB of RAM.
; Cancel jobs that you don't need them anymore
:: Always use [[User_Jobs#Cancelling_a_job_with_scancel|scancel]] to delete completed (or crashed) jobs: your Fairshare will improve
|}


When asking for a <code>gres</code> resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is
Protip: it's a good idea to [[User Jobs#Looking for unused GPUs|check for unused GPUs]] before choosing what to request. Requesting a GPU that is currently idle will help your job get executed sooner.


'''<code><Name>:<Type>:<Quantity></code>'''
= SLURM partitions =
 
where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>20gb</code> the syntax is


<code>gpu:20gb:2</code>
Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via [[#SLURM Quality of Service (QOS)|QOSes]], partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)


SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].
Note, however, that the default values for some features of SLURM jobs (e.g., duration) are [[#Default values|set by the partition]].


== Looking for unused GPUs ==
In Mufasa 2.0, there is a single SLURM partition, called <code>jobs</code>, and all jobs run on it. The state of <code>jobs</code> can be inspected with


GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to request a GPU that is not currently in use. This command
<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
sinfo -O Gres:100
sinfo -o "%10P %5a %9T %11L %10l"
</pre>
provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following <span style="background:#00FF00">[to be updated]</span>:
<pre style="color: lightgrey; background: black;">
GRES                                                                                               
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)
</pre>
</pre>


To know which of the GPUs are currently in use, use command
which provides an output similar to the following:
 
<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
sinfo -O GresUsed:100
PARTITION  AVAIL STATE    DEFAULTTIME TIMELIMIT
</pre>
jobs*      up    idle      1:00:00    3-00:00:00</pre>
which provides an output similar to this:
<pre style="color: lightgrey; background: black;">
GRES_USED                                                                                         
gpu:40gb:2(IDX:0-1),gpu:20gb:2(IDX:5,8),gpu:10gb:3(IDX:3-4,6)
</pre>
By comparing the two lists (GRES and GRES_USED) in the examples above, you can see that in this example:


* the system has 2 40 GB GPUs, all of which are in use
where columns correspond to the following information:
* the system has 3 20 GB GPUs, of which one is not in use
* the system has 6 10 GB GPUs, of which 3 are not in use


= SLURM partitions =
:; PARTITION
:: name of the partition; the asterisks indicates that it's the default one


Execution queues for jobs in SLURM are called '''partitions'''. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command
:; AVAIL
:: state/availability of the partition: see [[#Partition availability|below]]


<pre style="color: lightgrey; background: black;">
:; STATE
sinfo
:: state (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes])
</pre>
:: typical values are '''<code>mixed</code>''' - meaning that some of the resources are busy executing jobs while other are idle, and '''<code>allocated</code>''' - meaning that all of the resources are in use


([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this <span style="background:#00FF00">[to be updated]</span>:
:; DEFAULTTIME
:: default runtime of a job, in format ''[days-]hours:minutes:seconds''


<pre style="color: lightgrey; background: black;">
:; TIMELIMIT
PARTITION  AVAIL  TIMELIMIT NODES  STATE NODELIST
:: maximum runtime of a job, in format ''[days-]hours:minutes:seconds''
debug*        up      20:00      1    mix gn01
small        up  12:00:00      1    mix gn01
normal        up 1-00:00:00      1    mix gn01
longnormal    up 3-00:00:00      1    mix gn01
gpu          up 1-00:00:00      1    mix gn01
gpulong      up 3-00:00:00      1    mix gn01
fat          up 3-00:00:00      1    mix gn01
</pre>


In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "debug" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified. On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.
The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.


The columns in the standard output of <code>sinfo</code> shown above correspond to the following information:
Command <code>sinfo</code> does not tell you about the ''jobs'' submitted to a partition. This information is obtained, instead, with [[User Jobs#Inspecting jobs with squeue|command <code>squeue</code>]].


; PARTITION
== Partition availability ==
: name of the partition


; AVAIL
The most important information that <code>sinfo</code> provides is the '''availability''' (also called '''state''') of partitions. This is shown in column "AVAIL". Possible partition states are:
: state/availability of the partition: see [[User Jobs#Partition availability|below]]


; TIMELIMIT
:'''<code>up</code>''' = the partition is available
: maximum runtime of a job allowed by the partition, in format ''[days-]hours:minutes:seconds''
:: Currently running jobs will be completed
:: It's possible to launch jobs on the partition
:: Queued jobs will be executed as soon as resources allow


; NODES
:'''<code>drain</code>''' = the partition is in the process of becoming unavailable (i.e., of entering the <code>down</code> state: see below)
: number of nodes available to jobs run on the partition: for Mufasa, this is always 1 since [[System#The SLURM job scheduling system|there is only 1 node in the computing cluster]]
:: Currently running jobs will be completed
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)


; STATE
:'''<code>down</code>''' = the partition is unavailable
: state of the node (using [https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES these codes]); typical values are <code>mixed</code> - meaning that some of the resources of the node are busy executing jobs while other are free, and <code>allocated</code> - meaning that all of the resources of the node are busy
:: There are no running jobs
:: It's not possible to launch jobs on the partition
:: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the <code>up</code> state)


; NODELIST
When a partition goes from <code>up</code> to <code>drain</code> no harm is done to running jobs. In a normally functioning SLURM system, the passage from <code>up</code> or <code>drain</code> to <code>down</code> happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.
: list of nodes available to the partition: for Mufasa this field always contains <code>gn01</code> since [[System#The SLURM job scheduling system|Mufasa is the only node in the computing cluster]] <span style="background:#00FF00">[to be updated]</span>


For what concerns hardware resources (such as CPUs, GPUs and RAM) the amounts of each resource available to Mufasa's partitions are set by SLURM's accounting system, and are not visible to <code>sinfo</code>. See [[User Jobs#Partition features|Partition features]] for a description of these amounts.
A partition in state <code>drain</code> or <code>down</code> requires intervention by a [[Roles|Job Administrator]] to be restored to <code>up</code>.


== Partition features ==
== Default values ==


The output of <code>sinfo</code> ([[User Jobs#SLURM partitions|see above]]) provides a list of available partitions, but (except for time) it does not provide information about the amount of resources that a partition makes available to the user jobs which are run on it. The amount of resources is visible through command
The features of SLURM partitions, including the '''default values''' which are applied to jobs that do not make explicit requests, can be inspected with


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
sacctmgr list qos format=name%-10,maxwall,maxtres%-64
scontrol show partition
</pre>
</pre>


which provides an output similar to the following <span style="background:#00FF00">[to be updated]</span>:
which provides an output similar to this:


<pre style="color: lightgrey; background: black;">
<pre style="color: lightgrey; background: black;">
Name          MaxWall MaxTRES                                                         
PartitionName=jobs
---------- ----------- ----------------------------------------------------------------
  AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
normal      1-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G 
  AllocNodes=ALL Default=YES QoS=N/A
small        12:00:00 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G    
=> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
longnormal  3-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G 
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
gpu        1-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G                   
  Nodes=gn01
gpulong    3-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G                   
  PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
fat        3-00:00:00 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G
  OverTimeLimit=NONE PreemptMode=OFF
  State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
  JobDefaults=(null)
=> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
  TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
  TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
</pre>
</pre>


Its elements are the following (for more information, see [https://slurm.schedmd.com/qos.html SLURM's documentation]):
In the example, we have highlighted with  '''=>'''  the most relevant default values for Mufasa users, i.e.:


; Name
;<code>DefaultTime</code>
: name of the partition
:: the default execution time assigned to a job run on the partition (e.g., 1 hour)


; MaxWall
;<code>DefMemPerNode</code>
: maximum wall clock duration of the jobs run on the partition (after which they are killed by SLURM), in format ''[days-]hours:minutes:seconds''
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB)
 
= System resources subjected to limitations =
 
In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''"
 
TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs.
 
== <code>gres</code> syntax ==


; MaxTRES
To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form
: maximum amount of resources ("''Trackable RESources''") available to a job running on the partition, where
: <code>'''cpu=''K'''''</code> means that the maximum number of processor cores is ''K''
: <code>'''gres/''gpu:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K''
: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes


Note that there may be additional limits to the possibility to fully exploit the resources of a partition. For instance, there may be a cap on the maximum number of GPUs that can be used at the same time by a single job and/or a single user.
'''<code>gpu:Name:Type</code>'''


=== Partitions of Mufasa 2.0 ===
Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values:


The features of the SLURM partitions of Mufasa 2.0 are the following:
* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units


{| class=wikitable
So, for instance,
!align="center"| Name of partition
!align="center"| example of use
!align="center"| max running jobs per user
!align="center"| GPU configurations available to each job
!align="center"| GPUs that the partition has access to
!align="center"| max CPUs per job
!align="center"| max RAM per job [GB]
!align="center"| max wall clock time per job [h]
!align="center"| default resources assigned to jobs
!align="center"| allowed users
|-
!align="center"| gpulight
|align="center"| debug code
|align="center"| 1
|align="center"| 1 x 20 GB
|align="center"| 4 x 20 GB
|align="center"| 2
|align="center"| 64
|align="center"| 6
|align="center"| <span style="background:#00FF00">to be specified</span>
|align="center"| researchers, students
|-
!align="center"| nogpu
|align="center"| tasks not requiring GPUs (in particular: not AI)
|align="center"| 1
|align="center"| -
|align="center"| none
|align="center"| 16
|align="center"| 128
|align="center"| 72
|align="center"| <span style="background:#00FF00">to be specified</span>
|align="center"| researchers, students
|-
!align="center"| gpu
|align="center"| AI: train an already debugged model
|align="center"| 1
|align="center"| 1 x 20 GB
|align="center"| 3 x 20 GB
|align="center"| 8
|align="center"| 64
|align="center"| 24
|align="center"| <span style="background:#00FF00">to be specified</span>
|align="center"| researchers, students
|-
!align="center"| gpuwide
|align="center"| AI: search for optimal hyperparameter values
|align="center"| 2
|align="center"| 1 x 20 GB
|align="center"| 5 x 20 GB
|align="center"| 8
|align="center"| 64
|align="center"| 24
|align="center"| <span style="background:#00FF00">to be specified</span>
|align="center"| researchers, students
|-
!align="center"| gpuheavy
|align="center"| AI: train an already optimised model
|align="center"| 1
|align="center"| 1 x 20 GB or 2 x 20 GB or 1 x 40 GB
|align="center"| 3 x 40 GB + 4 x 20 GB
|align="center"| 8
|align="center"| 128
|align="center"| 72
|align="center"| <span style="background:#00FF00">to be specified</span>
|align="center"| researchers
|}


Overall resources associated to the set of all partitions exceed overall available resources, as multiple partitions can be given access to the same resource (e.g., a CPU or a GPU). SLURM will only execute a job if all the resources requested by the job are not already in use at the time of request.
<code>gpu:3g.20gb</code>


== Partition availability ==
identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.


An important information that ''sinfo'' provides (column "AVAIL") is the ''availability'' (also called ''state'') of partitions. Possible partition states are:
When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is


; up = the partition is available
'''<code>gpu:<Type>:<Quantity></code>'''
: Currently running jobs will be completed
: Currently queued jobs will be executed as soon as resources allow


; drain = the partition is in the process of becoming unavailable (i.e., to go in the ''down'' state)
where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is
: Currently running jobs will be completed
: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the ''up'' state)


; down = the partition is unavailable
<code>gpu:4g.20gb:2</code>
: There are no running jobs
: Queued jobs will be executed when the partition becomes available again (i.e. goes back to the ''up'' state)


When a partition goes from ''up'' to ''drain'' no harm is done to running jobs. When a partition passes from any other state to ''down'', running jobs (if they exist) get killed. A partition in state ''drain'' or ''down'' requires intervention by a [[Roles|Job Administrator]] to be restored to ''up''.
SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation].

Latest revision as of 13:36, 26 May 2026

This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa must use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:

  • GPUs
  • multiple CPUs
  • powerful CPUs
  • a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the login server virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

SLURM in a nutshell

Computation jobs on Mufasa needs to be launched via SLURM. SLURM provides jobs with access to the physical resources of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead queued. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the priority assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be.

The priority mechanism is used to encourage users to use Mufasa's resources (i.e.: GPUs, CPUs, RAM, execution time) in an effective and equitable manner. This page includes a chart explaining how to maximise the priority of your jobs.

The time available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, because at the end of such slot it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's Quality of Service (QOS) mechanism (Mufasa 1.0 used partitions instead). While launching a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a table summarising such limits.

SLURM Quality of Service (QOS)

Through Quality of Services (QOSes), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When executing a job with SLURM, a user must always specify the QOS that their job will use: this choice, in turn, determines what resources the job is able to access and influences the priority of the job.

Mufasa's QOSes and their features can be inspected with command

sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80,mintres%-18

which provides an output similar to the following:

Name          Priority MaxSubmit     MaxWall MaxTRES                                                                          MinTRES            
----------- ---------- --------- ----------- -------------------------------------------------------------------------------- ------------------ 
normal               0                                                                                                                           
nogpu                4         1  3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G                    
gpuheavy-20          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G             gres/gpu:4g.20gb=1 
gpuheavy-40          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G             gres/gpu:40gb=1    
gpulight             8         1    12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              gres/gpu:3g.20gb=1 
gpu                  2         1  1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              gres/gpu:3g.20gb=1 
gpuwide              2         2  1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G              gres/gpu:4g.20gb=1 
build               32         1    02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G

The columns of this output are the following:

Name
name of the QOS
Priority
priority tier associated to the QOS (higher value = higher priority): see Job priority for details
MaxSubmit
maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
See Limits on jobs by the same user for an overview of the limits on jobs set by Mufasa.
MaxWall
maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format [days-]hours:minutes:seconds
For some QOSes these are not set: it means that they are determined by the partition. Partitions also define the default duration of jobs.
MaxTRES
amount of resources subjected to limitations ("Trackable RESources") available to a job using the QOS, where
cpu=K means that the maximum number of CPUs (i.e., processor cores) is K
--> if not specified, the job gets the default amount of CPUs specified by the partition
gres/gpu:Type=K means that the maximum number of GPUs of class Type (see gres syntax) is K
--> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
mem=KG means that the maximum amount of system RAM is K GBytes
--> if not specified, the job gets the default amount of RAM specified by the partition
MinTRES
minimum amount of resources subjected to limitations ("Trackable RESources") that a job using the QOS must request in order to actually get executed by SLURM.
(If your job does not actually need these resources, you've chosen the wrong QOS: you can use one with a higher priority.)


The normal QOS is the default one, and exists only to ensure that users always specify a QOS when running a job. Since normal has zero priority and no resources, a job run using this QOS would never be run.

The information provided by the sacctmgr list qos command above is summarised by the following table:

QOS Priority
tier
Max
Submit
MaxWall
[h]
max # of
CPUs
max RAM
[GB]
max # of
3g.20GB
GPUs
max # of
4g.20GB
GPUs
max # of
40GB
GPUs
MinTRES
(GPUs)
build 32 1 2 2 16 - - - -
gpulight 8 1 12 2 64 1 - - one 3g.20GB
nogpu 4 1 72 16 128 - - - -
gpu 2 1 24 8 64 1 - - one 3g.20GB
gpuwide 2 2 24 8 64 - 1 - one 4g.20GB
gpuheavy-20 1 1 72 (set by partition) 8 128 - 2 - one 4g.20GB
gpuheavy-40 1 1 72 (set by partition) 8 128 - - 1 one 40GB

A key piece of information in the table above is the priority tier associated to each QOS.

In Mufasa 2.0, priority tiers are used to encourage users to use the least powerful QOS that is compatible with their needs, where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner. See Job priority for details about how priority affects the execution order of jobs in Mufasa 2.0.

The build QOS

This QOS is specifically designed to be used by Mufasa users to quickly build container images. Its associated priority tier is very high: this, combined with the fact that jobs using this QOS require few resources, means that such jobs usually get executed very soon.

On the other hand, the limited resources and the lack of access to the GPUs make the build QOS unsuitable for other tasks.

See Building Singularity images for directions about building Singularity container images.

Restricted QOSes

In Mufasa, the most powerful QOSes are reserved to researchers (called research users: these include academic personnel and Ph.D. students). M.Sc. students, called students users, cannot use them while running jobs.

See below to understand how user categories work in Mufasa 2.0.

research users and students users

Users of Mufasa belong to two user categories. Categories provide users with different access to to Mufasa's resources: the idea being to provide researchers with more access while still letting students use the server.

User categories in Mufasa are the following:

research, i.e. academic personnel and Ph.D. students
* can use all QOSes, including the the restricted ones
* their jobs have a higher base priority
* have a higher number of jobs that can be running at the same time
students, i.e. M.Sc. students
* cannot use the restricted QOSes
* their jobs have a lower base priority
* have a lower number of jobs that can be running at the same time

You can inspect the differences in priority and running jobs between research and students users with command

sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'

which provides an output similar to the following:

   Account   Priority MaxJobs 
  research          4       2 
  students          1       1

To know what limits apply to your own user, use command

sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"

which provides an output similar to the following:

      User   Priority MaxJobs QOS                                                          
---------- ---------- ------- ------------------------------------------------------------ 
    preali          4       2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu

The list under "QOS" shows what QOSes your user is allowed to use when running jobs. research users can use all of them, while students users can only access a subset of them.

Limits on jobs by the same user

Mufasa enforces limits on the number of jobs from a single user. Such limits aim at preventing users from "hogging" system resources, and apply to:

  • submitted jobs, i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
  • running jobs, i.e. jobs that are currently in execution

Limits on submitted jobs are set via QOSes, while limits on running jobs are set via user category.

The following table summarises what limits exist:

on the number of running jobs
by a single user
on the number of submitted jobs
by a single user
global limits
(system-wide)
2 for research users
1 for students users
not limited directly...
...but subject to limits set by individual QOSes (below)
limits for
individual
QOSes
not limited directly...
...but subject to global limits (above)
2 for the gpuwide QOS
1 for each of the other QOSes

Job priority

Once the execution of a job has been requested, the job is not run immediately: it is instead queued by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their priority. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to encourage users not to ask for resources or execution time that their job doesn't need. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a virtuous cycle where users, by carefully choosing what to ask for, obtain two results:

  • they ensure that their job is executed as soon as possible;
  • they leave as much as possible of Mufasa's resources free for other users's jobs.

Elements determining job priority

In Mufasa, the priority of a job is computed by SLURM according to the following elements:

User category (i.e., research or students)
Used to provide higher priority to jobs run by research personnel
QOS used by the job
Used to provide higher priority to jobs asking for less resources
Number of CPUs requested by the job (also called "job size")
Used to provide higher priority to jobs asking for a lower number of CPUs
Job duration, i.e. the execution time requested by the job
Used to provide higher priority to shorter jobs
Job Age, i.e. the time that the job has been waiting in the queue
Used to provide higher priority to jobs which have been queued for a longer time
FairShare, i.e. a factor computed by SLURM to balance use of the system by different users
Used to provide higher priority to jobs by users who used Mufasa's resources less than others (see below)

How FairShare works

In Mufasa 2.0, the FairShare of each user is computed according to these rules:

  1. It considers the quantity used by (current and past) user jobs of:
    • CPUs - impact on FairShare is proportional to the number of CPUs used
    • RAM - impact on FairShare is proportional to the amount of RAM used
    • GPUs - impact on FairShare is proportional to the number of GPUs used
      • 40 GB GPUs have double the impact on FairShare than 20 GB GPUs
  2. The impact on FairShare of any use of resources is proportional to its duration
    • example: using 32 GB of RAM for 48h has the same impact of using 64 GB of RAM for 24h
  3. FairShare has a "fading memory": i.e., resource use has less and less impact on FairShare the farther in the past it is
    • consequence: over time, the "history" of a user gets forgotten by FairShare

How to maximise the priority of your jobs

Every time you run a SLURM job, follow these guidelines:

Choose the less powerful QOS compatible with the needs of your job
QOSes with access to less resources have higher base priority
Only request CPUs and RAM that your job really needs
Usually, code exploits multiple CPUs only if designed to do so: if unsure, only ask for 1 CPU
The fewer resources your job asks for, the less it will wait before they become available
Asking for fewer CPUs and/or less RAM improves your FairShare
Do not request more time than your jobs needs to complete
Make a worst-case estimate and only ask for that duration
Asking for less of Mufasa's time improves your FairShare
Debug and test code using less powerful QOSes before running it with more powerful QOSes
Your test jobs will get a higher priority and your FairShare will improve
Cancel jobs that you don't need them anymore
Always use scancel to delete completed (or crashed) jobs: your Fairshare will improve

Protip: it's a good idea to check for unused GPUs before choosing what to request. Requesting a GPU that is currently idle will help your job get executed sooner.

SLURM partitions

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via QOSes, partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are set by the partition.

In Mufasa 2.0, there is a single SLURM partition, called jobs, and all jobs run on it. The state of jobs can be inspected with

sinfo -o "%10P %5a %9T %11L %10l"

which provides an output similar to the following:

PARTITION  AVAIL STATE     DEFAULTTIME TIMELIMIT 
jobs*      up    idle      1:00:00     3-00:00:00

where columns correspond to the following information:

PARTITION
name of the partition; the asterisks indicates that it's the default one
AVAIL
state/availability of the partition: see below
STATE
state (using these codes)
typical values are mixed - meaning that some of the resources are busy executing jobs while other are idle, and allocated - meaning that all of the resources are in use
DEFAULTTIME
default runtime of a job, in format [days-]hours:minutes:seconds
TIMELIMIT
maximum runtime of a job, in format [days-]hours:minutes:seconds

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command sinfo does not tell you about the jobs submitted to a partition. This information is obtained, instead, with command squeue.

Partition availability

The most important information that sinfo provides is the availability (also called state) of partitions. This is shown in column "AVAIL". Possible partition states are:

up = the partition is available
Currently running jobs will be completed
It's possible to launch jobs on the partition
Queued jobs will be executed as soon as resources allow
drain = the partition is in the process of becoming unavailable (i.e., of entering the down state: see below)
Currently running jobs will be completed
It's not possible to launch jobs on the partition
Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)
down = the partition is unavailable
There are no running jobs
It's not possible to launch jobs on the partition
Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)

When a partition goes from up to drain no harm is done to running jobs. In a normally functioning SLURM system, the passage from up or drain to down happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.

A partition in state drain or down requires intervention by a Job Administrator to be restored to up.

Default values

The features of SLURM partitions, including the default values which are applied to jobs that do not make explicit requests, can be inspected with

scontrol show partition

which provides an output similar to this:

PartitionName=jobs
   AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
   AllocNodes=ALL Default=YES QoS=N/A
=> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=gn01
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
=> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
   TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
   TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g

In the example, we have highlighted with => the most relevant default values for Mufasa users, i.e.:

DefaultTime
the default execution time assigned to a job run on the partition (e.g., 1 hour)
DefMemPerNode
the default amount of RAM assigned to a job run on the partition (e.g., 4GB)

System resources subjected to limitations

In systems based on SLURM like Mufasa, TRES (Trackable RESources) are (from SLURM's documentation "resources that can be tracked for usage or used to enforce limits against."

TRES include CPUs, RAM and GRES. The last term stands for Generic RESources that a job may need for its execution. In Mufasa, the only gres resources are the GPUs.

gres syntax

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

gpu:Name:Type

Considering the GPU complement of Mufasa, Type takes the following values:

  • gpu:40gb for GPUs with 40 Gbytes of RAM
  • gpu:4g.20gb for GPUs with 20 Gbytes of RAM and 4 compute units
  • gpu:3g.20gb for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

gpu:3g.20gb

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an srun command or an SBATCH directive of an execution script), the syntax required by SLURM is

gpu:<Type>:<Quantity>

where Quantity is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 4g.20gb the syntax is

gpu:4g.20gb:2

SLURM's generic resources are defined in /etc/slurm/gres.conf. In order to make GPUs available to SLURM's gres management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.