Difference between revisions of "SLURM"
| (57 intermediate revisions by the same user not shown) | |||
| Line 17: | Line 17: | ||
;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''. | ;: '''the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be'''. | ||
The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]]. | The priority mechanism is used to encourage users to use Mufasa's resources (i.e.: GPUs, CPUs, RAM, execution time) in an effective and equitable manner. This page includes a [[#How_to_maximise_the_priority_of_your_jobs|chart explaining how to maximise the priority of your jobs]]. | ||
The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, | The '''time''' available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, because at the end of such slot it gets killed by SLURM. | ||
In Mufasa 2.0 access to system resources is managed via SLURM's | In Mufasa 2.0 access to system resources is managed via SLURM's [[#SLURM Quality of Service (QOS)|Quality of Service (QOS)]] mechanism (Mufasa 1.0 used [[#SLURM_partitions|partitions]] instead). While [[User Jobs#Running jobs with SLURM|launching a processing job via SLURM]], the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job. | ||
Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]]. | Mufasa sets limits to the number of jobs by the same user. This page includes a [[#Limits on jobs by the same user|table summarising such limits]]. | ||
| Line 80: | Line 80: | ||
:: minimum amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") that a job using the QOS must request in order to actually get executed by SLURM. | :: minimum amount of [[#System resources subjected to limitations|resources subjected to limitations]] ("''Trackable RESources''") that a job using the QOS must request in order to actually get executed by SLURM. | ||
:: (If your job does not actually need these resources, you've chosen the wrong QOS: you can use one with a higher priority.) | :: (If your job does not actually need these resources, you've chosen the wrong QOS: you can use one with a higher priority.) | ||
The <code>normal</code> QOS is the default one, and exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run. | The <code>normal</code> QOS is the default one, and exists only to ensure that users always specify a QOS when running a job. Since <code>normal</code> has zero priority and no resources, a job run using this QOS would never be run. | ||
| Line 89: | Line 90: | ||
! QOS | ! QOS | ||
! Priority</br>tier | ! Priority</br>tier | ||
! | ! Max</br>Submit | ||
! MaxWall | ! MaxWall</br>[h] | ||
! max CPUs | ! max # of</br>CPUs | ||
! max RAM | ! max RAM</br>[GB] | ||
! max # of</br>3g.20GB GPUs | ! max # of</br>3g.20GB</br>GPUs | ||
! max # of</br>4g.20GB GPUs | ! max # of</br>4g.20GB</br>GPUs | ||
! max # of</br>40GB GPUs | ! max # of</br>40GB</br>GPUs | ||
! MinTRES | ! MinTRES</br>(GPUs) | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | build | ! rowspan="1" style="text-align:center;" | build | ||
| 32 | | 32 | ||
| 1 | | 1 | ||
| 2 | | 2 | ||
| | | 2 | ||
| 16 | |||
| - | | - | ||
| - | | - | ||
| Line 112: | Line 113: | ||
| 8 | | 8 | ||
| 1 | | 1 | ||
| | | 12 | ||
| 2 | | 2 | ||
| | | 64 | ||
| 1 | | 1 | ||
| - | | - | ||
| - | | - | ||
| one 3g.20GB | | one 3g.20GB | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | nogpu | ! rowspan="1" style="text-align:center;" | nogpu | ||
| 4 | | 4 | ||
| 1 | | 1 | ||
| | | 72 | ||
| 16 | | 16 | ||
| | | 128 | ||
| - | | - | ||
| - | | - | ||
| Line 134: | Line 135: | ||
| 2 | | 2 | ||
| 1 | | 1 | ||
| | | 24 | ||
| 8 | | 8 | ||
| | | 64 | ||
| 1 | | 1 | ||
| - | | - | ||
| - | | - | ||
| one 3g.20GB | | one 3g.20GB | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | gpuwide | ! rowspan="1" style="text-align:center;" | gpuwide | ||
| 2 | | 2 | ||
| 2 | | 2 | ||
| | | 24 | ||
| 8 | | 8 | ||
| | | 64 | ||
| - | | - | ||
| 1 | | 1 | ||
| - | | - | ||
| one 4g.20GB | | one 4g.20GB | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | gpuheavy-20 | ! rowspan="1" style="text-align:center;" | gpuheavy-20 | ||
| 1 | | 1 | ||
| 1 | | 1 | ||
| | | 72 (set by partition) | ||
| 8 | | 8 | ||
| | | 128 | ||
| - | | - | ||
| 2 | | 2 | ||
| - | | - | ||
| one 4g.20GB | | one 4g.20GB | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | gpuheavy-40 | ! rowspan="1" style="text-align:center;" | gpuheavy-40 | ||
| 1 | | 1 | ||
| 1 | | 1 | ||
| | | 72 (set by partition) | ||
| 8 | | 8 | ||
| | | 128 | ||
| - | | - | ||
| - | | - | ||
| 1 | | 1 | ||
| one 40GB | | one 40GB | ||
|} | |} | ||
A key piece of information in the table above is the '''priority tier''' associated to each QOS. | A key piece of information in the table above is the '''priority tier''' associated to each QOS. | ||
In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner. | In Mufasa 2.0, priority tiers are used to encourage users to use the '''least powerful QOS that is compatible with their needs''', where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner. See [[#Job priority|Job priority]] for details about how priority affects the execution order of jobs in Mufasa 2.0. | ||
See [[#Job priority|Job priority]] for details about how priority affects the execution order of jobs in Mufasa 2.0. | |||
== The <code>build</code> QOS == | == The <code>build</code> QOS == | ||
This QOS is specifically designed to be used by Mufasa users to '''build [[System#Containers|container images]]'''. Its associated priority tier is very high, | This QOS is specifically designed to be used by Mufasa users to '''quickly build [[System#Containers|container images]]'''. Its associated priority tier is very high: this, combined with the fact that jobs using this QOS require few resources, means that such jobs usually get executed very soon. | ||
On the other hand, the limited resources and the lack of access to the GPUs make the <code>build</code> QOS unsuitable for other tasks. | |||
See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images. | See [[Singularity#Building Singularity images|Building Singularity images]] for directions about building Singularity container images. | ||
| Line 203: | Line 202: | ||
:: '''<code>research</code>''', i.e. academic personnel and Ph.D. students | :: '''<code>research</code>''', i.e. academic personnel and Ph.D. students | ||
::: * can use all [[# | ::: * can use all QOSes, including the the [[#Restricted QOSes|restricted ones]] | ||
::: * their jobs have a higher ''base priority'' | ::: * their jobs have a higher ''base priority'' | ||
::: * | ::: * have a higher number of jobs that can be running at the same time | ||
:: '''<code>students</code>''', i.e. M.Sc. students | :: '''<code>students</code>''', i.e. M.Sc. students | ||
::: * cannot use the [[#Restricted QOSes|restricted QOSes]] | ::: * cannot use the [[#Restricted QOSes|restricted QOSes]] | ||
::: * their jobs have a lower ''base priority'' | ::: * their jobs have a lower ''base priority'' | ||
::: * | ::: * have a lower number of jobs that can be running at the same time | ||
You can inspect the differences in priority and running jobs between <code>research</code> and <code>students</code> users with command | You can inspect the differences in priority and running jobs between <code>research</code> and <code>students</code> users with command | ||
| Line 248: | Line 247: | ||
* '''running jobs''', i.e. jobs that are currently in execution | * '''running jobs''', i.e. jobs that are currently in execution | ||
The following table summarises | Limits on submitted jobs are set via [[#SLURM Quality of Service (QOS)|QOSes]], while limits on running jobs are set via [[#research users and students users|user category]]. | ||
The following table summarises what limits exist: | |||
{| class="wikitable" style="text-align:center;" | {| class="wikitable" style="text-align:center;" | ||
|- | |- | ||
! | ! | ||
! on the number of running jobs | ! on the number of <u>running</u> jobs</br>by a single user | ||
! on the number of submitted jobs | ! on the number of <u>submitted</u> jobs</br>by a single user | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | global limits<br/>(system-wide) | ! rowspan="1" style="text-align:center;" | global limits<br/>(system-wide) | ||
| '''''2 for'' <code>research</code> ''users'''''<br/>'''''1 for'' <code>students</code> ''users''''' | | '''''2 for'' <code>research</code> ''users'''''<br/>'''''1 for'' <code>students</code> ''users''''' | ||
| '''''not limited directly...'''''<br/>...but subject to limits set by individual QOSes (below) | | '''''not limited directly...'''''<br/>...but subject to limits set by individual QOSes (below) | ||
|- | |- | ||
! rowspan="1" style="text-align:center;" | limits for<br/>individual QOSes | ! rowspan="1" style="text-align:center;" | limits for<br/>individual</br>QOSes | ||
| '''''not limited directly...'''''<br/>...but subject to global limits | | '''''not limited directly...'''''<br/>...but subject to global limits (above) | ||
| '''''2 for the'' <code>gpuwide</code> ''QOS'''''<br/>'''''1 for each of the other QOSes''''' | | '''''2 for the'' <code>gpuwide</code> ''QOS'''''<br/>'''''1 for each of the other QOSes''''' | ||
|} | |} | ||
| Line 310: | Line 311: | ||
#* consequence: over time, the "history" of a user gets forgotten by FairShare | #* consequence: over time, the "history" of a user gets forgotten by FairShare | ||
== How to | == How to maximise the priority of your jobs == | ||
Every time you run a SLURM job, follow these guidelines: | Every time you run a SLURM job, follow these guidelines: | ||
| Line 317: | Line 318: | ||
| | | | ||
; Choose the less powerful QOS compatible with the needs of your job | ; Choose the less powerful QOS compatible with the needs of your job | ||
:: QOSes with access to less resources | :: QOSes with access to less resources have higher base priority | ||
; Only request CPUs and RAM that your job | ; Only request CPUs and RAM that your job really needs | ||
:: | :: Usually, code exploits multiple CPUs only if designed to do so: if unsure, only ask for 1 CPU | ||
:: | :: The fewer resources your job asks for, the less it will wait before they become available | ||
:: Asking for fewer CPUs and/or less RAM improves your FairShare | |||
; Do not request more time than your jobs needs to complete | ; Do not request more time than your jobs needs to complete | ||
:: Make a worst-case estimate and only ask for that duration | :: Make a worst-case estimate and only ask for that duration | ||
:: Asking for less of Mufasa's time improves your FairShare | |||
; | ; Debug and test code using less powerful QOSes before running it with more powerful QOSes | ||
:: Your test jobs will get a higher priority and your FairShare will improve | :: Your test jobs will get a higher priority and your FairShare will improve | ||
; Cancel jobs | ; Cancel jobs that you don't need them anymore | ||
:: | :: Always use [[User_Jobs#Cancelling_a_job_with_scancel|scancel]] to delete completed (or crashed) jobs: your Fairshare will improve | ||
|} | |} | ||
Protip: it's a good idea to [[User Jobs#Looking for unused GPUs|check for unused GPUs]] before choosing what to request. Requesting a GPU that is currently idle will help your job get executed sooner. | |||
= SLURM partitions = | = SLURM partitions = | ||
| Line 495: | Line 434: | ||
;<code>DefMemPerNode</code> | ;<code>DefMemPerNode</code> | ||
:: the default amount of RAM assigned to a job run on the partition (e.g., 4GB) | :: the default amount of RAM assigned to a job run on the partition (e.g., 4GB) | ||
= System resources subjected to limitations = | |||
In systems based on SLURM like Mufasa, '''TRES (Trackable RESources)''' are (from [https://slurm.schedmd.com/tres.html SLURM's documentation] "''resources that can be tracked for usage or used to enforce limits against.''" | |||
TRES include CPUs, RAM and '''GRES'''. The last term stands for ''Generic RESources'' that a job may need for its execution. In Mufasa, the only <code>gres</code> resources are the GPUs. | |||
== <code>gres</code> syntax == | |||
To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form | |||
'''<code>gpu:Name:Type</code>''' | |||
Considering the [[System#CPUs and GPUs|GPU complement of Mufasa]], <code>Type</code> takes the following values: | |||
* '''<code>gpu:40gb</code>''' for GPUs with 40 Gbytes of RAM | |||
* '''<code>gpu:4g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 4 compute units | |||
* '''<code>gpu:3g.20gb</code>''' for GPUs with 20 Gbytes of RAM and 3 compute units | |||
So, for instance, | |||
<code>gpu:3g.20gb</code> | |||
identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units. | |||
When asking for a GRES resource (e.g., in an <code>srun</code> command or an <code>SBATCH</code> directive of an [[User Jobs#Using execution scripts to run jobs|execution script]]), the syntax required by SLURM is | |||
'''<code>gpu:<Type>:<Quantity></code>''' | |||
where <code>Quantity</code> is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type <code>4g.20gb</code> the syntax is | |||
<code>gpu:4g.20gb:2</code> | |||
SLURM's ''generic resources'' are defined in <code>/etc/slurm/gres.conf</code>. In order to make GPUs available to SLURM's <code>gres</code> management, Mufasa makes use of Nvidia's [https://developer.nvidia.com/nvidia-management-library-nvml NVML library]. For additional information see [https://slurm.schedmd.com/gres.html SLURM's documentation]. | |||
Latest revision as of 13:36, 26 May 2026
This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).
Users of Mufasa must use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:
- GPUs
- multiple CPUs
- powerful CPUs
- a significant amount of RAM
In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the login server virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).
SLURM in a nutshell
Computation jobs on Mufasa needs to be launched via SLURM. SLURM provides jobs with access to the physical resources of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.
When a user runs a job, the job does not get executed immediately and is instead queued. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the priority assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:
- the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be.
The priority mechanism is used to encourage users to use Mufasa's resources (i.e.: GPUs, CPUs, RAM, execution time) in an effective and equitable manner. This page includes a chart explaining how to maximise the priority of your jobs.
The time available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, because at the end of such slot it gets killed by SLURM.
In Mufasa 2.0 access to system resources is managed via SLURM's Quality of Service (QOS) mechanism (Mufasa 1.0 used partitions instead). While launching a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.
Mufasa sets limits to the number of jobs by the same user. This page includes a table summarising such limits.
SLURM Quality of Service (QOS)
Through Quality of Services (QOSes), SLURM lets system configurators assign a name to a set of related constraints.
In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When executing a job with SLURM, a user must always specify the QOS that their job will use: this choice, in turn, determines what resources the job is able to access and influences the priority of the job.
Mufasa's QOSes and their features can be inspected with command
sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80,mintres%-18
which provides an output similar to the following:
Name Priority MaxSubmit MaxWall MaxTRES MinTRES ----------- ---------- --------- ----------- -------------------------------------------------------------------------------- ------------------ normal 0 nogpu 4 1 3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G gpuheavy-20 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G gres/gpu:4g.20gb=1 gpuheavy-40 1 1 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G gres/gpu:40gb=1 gpulight 8 1 12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G gres/gpu:3g.20gb=1 gpu 2 1 1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G gres/gpu:3g.20gb=1 gpuwide 2 2 1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G gres/gpu:4g.20gb=1 build 32 1 02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G
The columns of this output are the following:
- Name
- name of the QOS
- Priority
- priority tier associated to the QOS (higher value = higher priority): see Job priority for details
- MaxSubmit
- maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
- See Limits on jobs by the same user for an overview of the limits on jobs set by Mufasa.
- MaxWall
- maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format [days-]hours:minutes:seconds
- For some QOSes these are not set: it means that they are determined by the partition. Partitions also define the default duration of jobs.
- MaxTRES
- amount of resources subjected to limitations ("Trackable RESources") available to a job using the QOS, where
cpu=Kmeans that the maximum number of CPUs (i.e., processor cores) is K- --> if not specified, the job gets the default amount of CPUs specified by the partition
gres/gpu:Type=Kmeans that the maximum number of GPUs of classType(seegressyntax) is K- --> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
mem=KGmeans that the maximum amount of system RAM is K GBytes- --> if not specified, the job gets the default amount of RAM specified by the partition
- MinTRES
- minimum amount of resources subjected to limitations ("Trackable RESources") that a job using the QOS must request in order to actually get executed by SLURM.
- (If your job does not actually need these resources, you've chosen the wrong QOS: you can use one with a higher priority.)
The normal QOS is the default one, and exists only to ensure that users always specify a QOS when running a job. Since normal has zero priority and no resources, a job run using this QOS would never be run.
The information provided by the sacctmgr list qos command above is summarised by the following table:
| QOS | Priority tier |
Max Submit |
MaxWall [h] |
max # of CPUs |
max RAM [GB] |
max # of 3g.20GB GPUs |
max # of 4g.20GB GPUs |
max # of 40GB GPUs |
MinTRES (GPUs) |
|---|---|---|---|---|---|---|---|---|---|
| build | 32 | 1 | 2 | 2 | 16 | - | - | - | - |
| gpulight | 8 | 1 | 12 | 2 | 64 | 1 | - | - | one 3g.20GB |
| nogpu | 4 | 1 | 72 | 16 | 128 | - | - | - | - |
| gpu | 2 | 1 | 24 | 8 | 64 | 1 | - | - | one 3g.20GB |
| gpuwide | 2 | 2 | 24 | 8 | 64 | - | 1 | - | one 4g.20GB |
| gpuheavy-20 | 1 | 1 | 72 (set by partition) | 8 | 128 | - | 2 | - | one 4g.20GB |
| gpuheavy-40 | 1 | 1 | 72 (set by partition) | 8 | 128 | - | - | 1 | one 40GB |
A key piece of information in the table above is the priority tier associated to each QOS.
In Mufasa 2.0, priority tiers are used to encourage users to use the least powerful QOS that is compatible with their needs, where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner. See Job priority for details about how priority affects the execution order of jobs in Mufasa 2.0.
The build QOS
This QOS is specifically designed to be used by Mufasa users to quickly build container images. Its associated priority tier is very high: this, combined with the fact that jobs using this QOS require few resources, means that such jobs usually get executed very soon.
On the other hand, the limited resources and the lack of access to the GPUs make the build QOS unsuitable for other tasks.
See Building Singularity images for directions about building Singularity container images.
Restricted QOSes
In Mufasa, the most powerful QOSes are reserved to researchers (called research users: these include academic personnel and Ph.D. students). M.Sc. students, called students users, cannot use them while running jobs.
See below to understand how user categories work in Mufasa 2.0.
research users and students users
Users of Mufasa belong to two user categories. Categories provide users with different access to to Mufasa's resources: the idea being to provide researchers with more access while still letting students use the server.
User categories in Mufasa are the following:
research, i.e. academic personnel and Ph.D. students- * can use all QOSes, including the the restricted ones
- * their jobs have a higher base priority
- * have a higher number of jobs that can be running at the same time
students, i.e. M.Sc. students- * cannot use the restricted QOSes
- * their jobs have a lower base priority
- * have a lower number of jobs that can be running at the same time
You can inspect the differences in priority and running jobs between research and students users with command
sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'
which provides an output similar to the following:
Account Priority MaxJobs research 4 2 students 1 1
To know what limits apply to your own user, use command
sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"
which provides an output similar to the following:
User Priority MaxJobs QOS
---------- ---------- ------- ------------------------------------------------------------
preali 4 2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu
The list under "QOS" shows what QOSes your user is allowed to use when running jobs. research users can use all of them, while students users can only access a subset of them.
Limits on jobs by the same user
Mufasa enforces limits on the number of jobs from a single user. Such limits aim at preventing users from "hogging" system resources, and apply to:
- submitted jobs, i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
- running jobs, i.e. jobs that are currently in execution
Limits on submitted jobs are set via QOSes, while limits on running jobs are set via user category.
The following table summarises what limits exist:
| on the number of running jobs by a single user |
on the number of submitted jobs by a single user | |
|---|---|---|
| global limits (system-wide) |
2 for research users1 for students users
|
not limited directly... ...but subject to limits set by individual QOSes (below) |
| limits for individual QOSes |
not limited directly... ...but subject to global limits (above) |
2 for the gpuwide QOS1 for each of the other QOSes |
Job priority
Once the execution of a job has been requested, the job is not run immediately: it is instead queued by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.
The order of the items in the job queue depends on their priority. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.
The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to encourage users not to ask for resources or execution time that their job doesn't need. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.
Priority management in Mufasa is designed to set up a virtuous cycle where users, by carefully choosing what to ask for, obtain two results:
- they ensure that their job is executed as soon as possible;
- they leave as much as possible of Mufasa's resources free for other users's jobs.
Elements determining job priority
In Mufasa, the priority of a job is computed by SLURM according to the following elements:
- User category (i.e.,
researchorstudents)- Used to provide higher priority to jobs run by research personnel
- QOS used by the job
- Used to provide higher priority to jobs asking for less resources
- Number of CPUs requested by the job (also called "job size")
- Used to provide higher priority to jobs asking for a lower number of CPUs
- Job duration, i.e. the execution time requested by the job
- Used to provide higher priority to shorter jobs
- Job Age, i.e. the time that the job has been waiting in the queue
- Used to provide higher priority to jobs which have been queued for a longer time
- FairShare, i.e. a factor computed by SLURM to balance use of the system by different users
- Used to provide higher priority to jobs by users who used Mufasa's resources less than others (see below)
In Mufasa 2.0, the FairShare of each user is computed according to these rules:
- It considers the quantity used by (current and past) user jobs of:
- CPUs - impact on FairShare is proportional to the number of CPUs used
- RAM - impact on FairShare is proportional to the amount of RAM used
- GPUs - impact on FairShare is proportional to the number of GPUs used
- 40 GB GPUs have double the impact on FairShare than 20 GB GPUs
- The impact on FairShare of any use of resources is proportional to its duration
- example: using 32 GB of RAM for 48h has the same impact of using 64 GB of RAM for 24h
- FairShare has a "fading memory": i.e., resource use has less and less impact on FairShare the farther in the past it is
- consequence: over time, the "history" of a user gets forgotten by FairShare
How to maximise the priority of your jobs
Every time you run a SLURM job, follow these guidelines:
- Choose the less powerful QOS compatible with the needs of your job
-
- QOSes with access to less resources have higher base priority
- Only request CPUs and RAM that your job really needs
-
- Usually, code exploits multiple CPUs only if designed to do so: if unsure, only ask for 1 CPU
- The fewer resources your job asks for, the less it will wait before they become available
- Asking for fewer CPUs and/or less RAM improves your FairShare
- Do not request more time than your jobs needs to complete
-
- Make a worst-case estimate and only ask for that duration
- Asking for less of Mufasa's time improves your FairShare
- Debug and test code using less powerful QOSes before running it with more powerful QOSes
-
- Your test jobs will get a higher priority and your FairShare will improve
- Cancel jobs that you don't need them anymore
-
- Always use scancel to delete completed (or crashed) jobs: your Fairshare will improve
Protip: it's a good idea to check for unused GPUs before choosing what to request. Requesting a GPU that is currently idle will help your job get executed sooner.
SLURM partitions
Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via QOSes, partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)
Note, however, that the default values for some features of SLURM jobs (e.g., duration) are set by the partition.
In Mufasa 2.0, there is a single SLURM partition, called jobs, and all jobs run on it. The state of jobs can be inspected with
sinfo -o "%10P %5a %9T %11L %10l"
which provides an output similar to the following:
PARTITION AVAIL STATE DEFAULTTIME TIMELIMIT jobs* up idle 1:00:00 3-00:00:00
where columns correspond to the following information:
- PARTITION
- name of the partition; the asterisks indicates that it's the default one
- AVAIL
- state/availability of the partition: see below
- STATE
- state (using these codes)
- typical values are
mixed- meaning that some of the resources are busy executing jobs while other are idle, andallocated- meaning that all of the resources are in use
- DEFAULTTIME
- default runtime of a job, in format [days-]hours:minutes:seconds
- TIMELIMIT
- maximum runtime of a job, in format [days-]hours:minutes:seconds
The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.
Command sinfo does not tell you about the jobs submitted to a partition. This information is obtained, instead, with command squeue.
Partition availability
The most important information that sinfo provides is the availability (also called state) of partitions. This is shown in column "AVAIL". Possible partition states are:
up= the partition is available- Currently running jobs will be completed
- It's possible to launch jobs on the partition
- Queued jobs will be executed as soon as resources allow
drain= the partition is in the process of becoming unavailable (i.e., of entering thedownstate: see below)- Currently running jobs will be completed
- It's not possible to launch jobs on the partition
- Queued jobs will be executed when the partition becomes available again (i.e. goes back to the
upstate)
down= the partition is unavailable- There are no running jobs
- It's not possible to launch jobs on the partition
- Queued jobs will be executed when the partition becomes available again (i.e. goes back to the
upstate)
When a partition goes from up to drain no harm is done to running jobs. In a normally functioning SLURM system, the passage from up or drain to down happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.
A partition in state drain or down requires intervention by a Job Administrator to be restored to up.
Default values
The features of SLURM partitions, including the default values which are applied to jobs that do not make explicit requests, can be inspected with
scontrol show partition
which provides an output similar to this:
PartitionName=jobs AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40 AllocNodes=ALL Default=YES QoS=N/A => DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=gn01 PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) => DefMemPerNode=4096 MaxMemPerNode=UNLIMITED TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5 TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g
In the example, we have highlighted with => the most relevant default values for Mufasa users, i.e.:
DefaultTime-
- the default execution time assigned to a job run on the partition (e.g., 1 hour)
DefMemPerNode-
- the default amount of RAM assigned to a job run on the partition (e.g., 4GB)
System resources subjected to limitations
In systems based on SLURM like Mufasa, TRES (Trackable RESources) are (from SLURM's documentation "resources that can be tracked for usage or used to enforce limits against."
TRES include CPUs, RAM and GRES. The last term stands for Generic RESources that a job may need for its execution. In Mufasa, the only gres resources are the GPUs.
gres syntax
To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form
gpu:Name:Type
Considering the GPU complement of Mufasa, Type takes the following values:
gpu:40gbfor GPUs with 40 Gbytes of RAMgpu:4g.20gbfor GPUs with 20 Gbytes of RAM and 4 compute unitsgpu:3g.20gbfor GPUs with 20 Gbytes of RAM and 3 compute units
So, for instance,
gpu:3g.20gb
identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.
When asking for a GRES resource (e.g., in an srun command or an SBATCH directive of an execution script), the syntax required by SLURM is
gpu:<Type>:<Quantity>
where Quantity is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 4g.20gb the syntax is
gpu:4g.20gb:2
SLURM's generic resources are defined in /etc/slurm/gres.conf. In order to make GPUs available to SLURM's gres management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.