Difference between revisions of "SLURM"

From Mufasa (BioHPC)
Jump to navigation Jump to search
 
Line 84: Line 84:
* a maximum of 64 GB of RAM
* a maximum of 64 GB of RAM
* this access to GPUs:
* this access to GPUs:
** maximum 1 GPU of type ''gpu:3g.20gb''
** max 1 GPU of type ''gpu:3g.20gb''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:40gb=0''
** no GPUs of type ''gpu:4g.20gb''
** no GPUs of type ''gpu:4g.20gb''

Latest revision as of 16:34, 4 May 2026

This page presents the features of SLURM that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Users of Mufasa must use SLURM to run resource-heavy processes, i.e. computing jobs that require one or more of the following:

  • GPUs
  • multiple CPUs
  • powerful CPUs
  • a significant amount of RAM

In fact, only processes run via SLURM have access to all the resources of Mufasa. Processes run outside SLURM are executed by the login server virtual machine, which has minimal resources and no GPUs. Using SLURM is therefore the only way to execute resource-heavy jobs on Mufasa (this is a key difference between Mufasa 1.0 and Mufasa 2.0).

SLURM in a nutshell

Computation jobs on Mufasa needs to be launched via SLURM. SLURM provides jobs with access to the physical resources of Mufasa, such as CPUs, GPUs and RAM. Thanks to SLURM, processing jobs share system resources, optimising their occupation and availability.

When a user runs a job, the job does not get executed immediately and is instead queued. SLURM executes jobs according to their order in the queue: the top job in the queue gets executed as soon as the necessary resources are available, while jobs lower in the queue wait longer. The position of a job in the queue is due to the priority assigned to it by SLURM, with higher-priority jobs closer to the top. As a general rule:

the greater the fraction of Mufasa's overall resources that a job asks for, the lower the job's priority will be.

The priority mechanism is used to encourage users to use Mufasa's resources in an effective and equitable manner. This page includes a chart explaining how to maximise the priority of your jobs.

The time available to a job for its execution is controlled by SLURM. When a user requests execution of a job, they must specify the duration of the time slot that the job needs. The job must complete its execution before the end of the requested time slot, otherwise it gets killed by SLURM.

In Mufasa 2.0 access to system resources is managed via SLURM's Quality of Service (QOS) mechanism (Mufasa 1.0 used partitions instead). To launch a processing job via SLURM, the user must always specify the chosen QOS. QOSes differ in the set of resources that they provide access to because each of them is designed to fit a given type of job.

Mufasa sets limits to the number of jobs by the same user. This page includes a table summarising such limits.

SLURM Quality of Service (QOS)

Through Quality of Services (QOSes), SLURM lets system configurators assign a name to a set of related constraints.

In Mufasa 2.0, QOSes are used to define different levels of access to the server's resources. When executing a job with SLURM, a user must always specify the QOS that their job will use: this choice, in turn, determines what resources the job is able to access and influences the priority of the job.

Mufasa's QOSes and their features can be inspected with command

sacctmgr list qos format=name%-11,priority,MaxSubmit,maxwall,maxtres%-80

which provides an output similar to the following:

Name          Priority MaxSubmit     MaxWall MaxTRES                                                                          
----------- ---------- --------- ----------- -------------------------------------------------------------------------------- 
normal               0                                                                                                        
nogpu                4         1  3-00:00:00 cpu=16,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=128G 
gpuheavy-20          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=2,mem=128G             
gpuheavy-40          1         1             cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=1,gres/gpu:4g.20gb=0,mem=128G             
gpulight             8         1    12:00:00 cpu=2,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              
gpu                  2         1  1-00:00:00 cpu=8,gres/gpu:3g.20gb=1,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,mem=64G              
gpuwide              2         2  1-00:00:00 cpu=8,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=1,mem=64G              
build               32         1    02:00:00 cpu=2,gres/gpu:3g.20gb=0,gres/gpu:40gb=0,gres/gpu:4g.20gb=0,gres/gpu=0,mem=16G

The columns of this output are the following:

Name
name of the QOS
Priority
priority tier associated to the QOS (higher value = higher priority): see Job priority for details
MaxSubmit
maximum number of jobs from a single user that can be submitted to SLURM with this QOS; submitted jobs include both running and queued jobs
See Limits on jobs by the same user for an overview of the limits on jobs set by Mufasa.
MaxWall
maximum wall clock duration of the jobs using the QOS (after which they are killed by SLURM), in format [days-]hours:minutes:seconds
For some QOSes these are not set: it means that they are determined by the partition. Partitions also define the default duration of jobs.
MaxTRES
amount of resources subjected to limitations ("Trackable RESources") available to a job using the QOS, where
cpu=K means that the maximum number of CPUs (i.e., processor cores) is K
--> if not specified, the job gets the default amount of CPUs specified by the partition
gres/gpu:Type=K means that the maximum number of GPUs of class Type (see gres syntax) is K
--> (for QOSes that allow access to GPUs) if not specified, the job cannot be launched
mem=KG means that the maximum amount of system RAM is K GBytes
--> if not specified, the job gets the default amount of RAM specified by the partition

For instance, QOS gpulight provides jobs that use it with:

  • priority tier equal to 8
  • a maximum of 1 submitted job per user
  • a maximum of 12 hours of duration
  • a maximum of 2 CPUs
  • a maximum of 64 GB of RAM
  • this access to GPUs:
    • max 1 GPU of type gpu:3g.20gb
    • no GPUs of type gpu:40gb=0
    • no GPUs of type gpu:4g.20gb

As seen in the example output from sacctmgr list qos above, each QOS has an associated priority tier. In Mufasa 2.0, priority tiers are used to encourage users to use the least powerful QOS that is compatible with their needs, where "powerful" means "rich with resources". Less powerful QOSes increase the priority of the jobs that use them, so these jobs tend to be executed sooner.

See Priority to understand how priority affects the execution order of jobs in Mufasa 2.0.

The normal QOS is the default one: it exists only to ensure that users always specify a QOS when running a job. Since normal has zero priority and no resources, a job run using this QOS would never be run.

The build QOS

This QOS is specifically designed to be used by Mufasa users to build container images. Its associated priority tier is very high, so SLURM jobs launched using this QOS are executed quickly.

The build QOS, though, has resources that are strictly limited to those needed for building operations; additionally, it has no access to GPUs and a short maximum duration for jobs. This makes it unsuitable for computing activities different from building containers.

See Building Singularity images for directions about building Singularity container images.

Restricted QOSes

In Mufasa, the most powerful QOSes are reserved to researchers (including Ph.D. students), and not available to M.Sc. students.

See below to understand the differences between researcher users and students users.

research users and students users

Users of Mufasa belong to two user categories, which provide the users belonging to them with different access to system resources. The idea behind these categories is to provide researchers with more access to Mufasa's resources, without preventing students from using the server.

User categories are:

research, i.e. academic personnel and Ph.D. students
* have access to all QOSes
* their jobs have a higher base priority
* the number of running jobs that the user can have is higher
students, i.e. M.Sc. students
* have access to a restricted set of QOSes
* their jobs have a lower base priority
* the number of running jobs that the user can have is lower

You can inspect the differences between research and students users with command

sacctmgr list association format="account,priority,maxjobs" | grep -E 'Account|research|students'

which provides an output similar to the following:

   Account   Priority MaxJobs 
  research          4       2 
  students          1       1

To know what limits apply to your own user, use command

sacctmgr list association where user=$USER format="user,priority,maxjobs,qos%-60"

which provides an output similar to the following:

      User   Priority MaxJobs QOS                                                          
---------- ---------- ------- ------------------------------------------------------------ 
    preali          4       2 build,gpu,gpuheavy-20,gpuheavy-40,gpulight,gpuwide,nogpu

The list under "QOS" shows what QOSes your user is allowed to use when running jobs. research users can use all of them, while students users can only access a subset of them.

Limits on jobs by the same user

Mufasa sets limits on the number of jobs from a single user. Such limits aim at preventing users from "hoarding" system resources, and apply to:

  • submitted jobs, i.e. jobs that the user asked SLURM to execute, each of which may currently be either running or queued
  • running jobs, i.e. jobs that are currently in execution

The following table summarises the limits that Mufasa sets on the number of jobs by the same user:

number of running jobs number of submitted jobs
global limits
(system-wide)
2 for research users
1 for students users
not limited directly...
...but cannot exceed the sum of the limits on submitted jobs set by the QOSes (below)
limits for each QoS not limited directly...
...but cannot exceed the global limit on running jobs (above) nor the QoS limit on submitted jobs (on the right)
2 for gpuwide QOS
1 for all other QOSes

Limits on the number of running jobs depend on the user category (either researcher or students) that the user belongs to; limits on the number of submitted jobs depend on the properties of the SLURM QOSes used to launch them.

Job priority

Once the execution of a job has been requested, the job is not run immediately: it is instead queued by SLURM, together with all the other jobs awaiting execution. The job on top of the queue at any time is the first to be put into execution as soon as the resources it requires are available.

The order of the items in the job queue depends on their priority. The job with the highest priority is put by SLURM on top of the queue, while all other jobs are put in the queue in order of descending priority.

The goal of SLURM is to maximise resource availability: i.e., to ensure the shorter possible wait time before job execution. To achieve this goal, in Mufasa SLURM is configured to encourage users not to ask for resources or execution time that their job doesn't need. This is done via the priority mechanism: the more resources and/or time a job requests, the lower its priority will be; and the later it will be executed.

Priority management in Mufasa is designed to set up a virtuous cycle where users, by carefully choosing what to ask for, obtain two results:

  • they ensure that their job is executed as soon as possible;
  • they leave as much as possible of Mufasa's resources free for other users's jobs.

Elements determining job priority

In Mufasa, the priority of a job is computed by SLURM according to the following elements:

User category (i.e., research or students)
Used to provide higher priority to jobs run by research personnel
QOS used by the job
Used to provide higher priority to jobs asking for less resources
Number of CPUs requested by the job (also called "job size")
Used to provide higher priority to jobs asking for a lower number of CPUs
Job duration, i.e. the execution time requested by the job
Used to provide higher priority to shorter jobs
Job Age, i.e. the time that the job has been waiting in the queue
Used to provide higher priority to jobs which have been queued for a longer time
FairShare, i.e. a factor computed by SLURM to balance use of the system by different users
Used to provide higher priority to jobs by users who used Mufasa less than others

The main features of FairShare are:

  • the Fairshare value is higher for users whose jobs used less CPUs, GPUs, RAM, execution time.
  • the FairShare mechanism has a "fading memory", i.e. resource use has more impact on it if recent, less if farther in the past

How to maximise the priority of your jobs

Every time you run a SLURM job, follow these guidelines:

Choose the less powerful QOS compatible with the needs of your job
QOSes with access to less resources lead to higher priority
Only request CPUs that your job will actually use
If you didn't design your code to exploit multiple CPUs, check that it does! If it doesn't, do not ask for them
Do not request more time than your jobs needs to complete
Make a worst-case estimate and only ask for that duration
Test and debug your code using less powerful QOSes before running it on more powerful QOSes
Your test jobs will get a higher priority and your FairShare will improve
Cancel jobs when you don't need them anymore
Use scancel to delete your jobs when finished (or if they become useless due to a bug): your Fairshare will improve

Suggestion: if you're going to run a job, it's a good idea to look for unused GPUs before choosing what GPU to request. Choosing a GPU that is currently idle should help your job get run sooner.

System resources subjected to limitations

In systems based on SLURM like Mufasa, TRES (Trackable RESources) are (from SLURM's documentation "resources that can be tracked for usage or used to enforce limits against."

TRES include CPUs, RAM and GRES. The last term stands for Generic RESources that a job may need for its execution. In Mufasa, the only gres resources are the GPUs.

gres syntax

To ask SLURM to assign GRES resources (i.e., GPUs) to a job, a special syntax must be used. Precisely, the name of each GPU resource takes the form

gpu:Name:Type

Considering the GPU complement of Mufasa, Type takes the following values:

  • gpu:40gb for GPUs with 40 Gbytes of RAM
  • gpu:4g.20gb for GPUs with 20 Gbytes of RAM and 4 compute units
  • gpu:3g.20gb for GPUs with 20 Gbytes of RAM and 3 compute units

So, for instance,

gpu:3g.20gb

identifies a resource corresponding to a GPU with 20 GB of RAM and 3 compute units.

When asking for a GRES resource (e.g., in an srun command or an SBATCH directive of an execution script), the syntax required by SLURM is

gpu:<Type>:<Quantity>

where Quantity is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 4g.20gb the syntax is

gpu:4g.20gb:2

SLURM's generic resources are defined in /etc/slurm/gres.conf. In order to make GPUs available to SLURM's gres management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.

Looking for unused GPUs

GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to use a QOS associated to a type of GPU of which there are one or more that aren't currently in use. This command

sinfo -O Gres:100

provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides an output similar to the following:

GRES                                                                                                
gpu:40gb:3,gpu:4g.20gb:5,gpu:3g.20gb:5

To know which of the GPUs are currently in use, use command

sinfo -O GresUsed:100

which provides an output similar to this:

GRES_USED
gpu:40gb:2(IDX:0-1),gpu:4g.20gb:2(IDX:5,8),gpu:3g.20gb:3(IDX:3-4,6)

By comparing the two lists (GRES and GRES_USED) you can easily spot unused GPUs.

SLURM partitions

Partitions are another mechanism provided by SLURM to create different levels of access to system resources. Since in Mufasa 2.0 access to resources is controlled via QOSes, partitions are not very relevant. (This is one of the main differences between Mufasa 1.0 and Mufasa 2.0.)

Note, however, that the default values for some features of SLURM jobs (e.g., duration) are set by the partition.

In Mufasa 2.0, there is a single SLURM partition, called jobs, and all jobs run on it. The state of jobs can be inspected with

sinfo -o "%10P %5a %9T %11L %10l"

which provides an output similar to the following:

PARTITION  AVAIL STATE     DEFAULTTIME TIMELIMIT 
jobs*      up    idle      1:00:00     3-00:00:00

where columns correspond to the following information:

PARTITION
name of the partition; the asterisks indicates that it's the default one
AVAIL
state/availability of the partition: see below
STATE
state (using these codes)
typical values are mixed - meaning that some of the resources are busy executing jobs while other are idle, and allocated - meaning that all of the resources are in use
DEFAULTTIME
default runtime of a job, in format [days-]hours:minutes:seconds
TIMELIMIT
maximum runtime of a job, in format [days-]hours:minutes:seconds

The asterisk at the end of the partition name indicates the default partition, i.e. the one on which jobs which do not ask for a specific partition are run.

Command sinfo does not tell you about the jobs submitted to a partition. This information is obtained, instead, with command squeue.

Partition availability

The most important information that sinfo provides is the availability (also called state) of partitions. This is shown in column "AVAIL". Possible partition states are:

up = the partition is available
Currently running jobs will be completed
It's possible to launch jobs on the partition
Queued jobs will be executed as soon as resources allow
drain = the partition is in the process of becoming unavailable (i.e., of entering the down state: see below)
Currently running jobs will be completed
It's not possible to launch jobs on the partition
Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)
down = the partition is unavailable
There are no running jobs
It's not possible to launch jobs on the partition
Queued jobs will be executed when the partition becomes available again (i.e. goes back to the up state)

When a partition goes from up to drain no harm is done to running jobs. In a normally functioning SLURM system, the passage from up or drain to down happens only when no jobs are running on the partition. If (e.g., due to a malfunction) the passage happens with jobs still running, they get killed.

A partition in state drain or down requires intervention by a Job Administrator to be restored to up.

Default values

The features of SLURM partitions, including the default values which are applied to jobs that do not make explicit requests, can be inspected with

scontrol show partition

which provides an output similar to this:

PartitionName=jobs
   AllowGroups=ALL AllowAccounts=ALL AllowQos=nogpu,gpulight,gpu,gpuwide,gpuheavy-20,gpuheavy-40
   AllocNodes=ALL Default=YES QoS=N/A
=> DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=gn01
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
=> DefMemPerNode=4096 MaxMemPerNode=UNLIMITED
   TRES=cpu=48,mem=1011435M,node=1,billing=49,gres/gpu=13,gres/gpu:3g.20gb=5,gres/gpu:40gb=3,gres/gpu:4g.20gb=5
   TRESBillingWeights=cpu=1.0,gres/gpu:3g.20gb=6.0,gres/gpu:4g.20gb=6.0,gres/gpu:40gb=6.0,mem=0.05g

In the example, we have highlighted with "=>" the most relevant default values for Mufasa users, i.e.:

DefaultTime
the default execution time assigned to a job run on the partition (e.g., 1 hour)
DefMemPerNode
the default amount of RAM assigned to a job run on the partition (e.g., 4GB)