Difference between revisions of "User Jobs"
(44 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This page presents the features of Mufasa that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them). | This page presents the features of Mufasa that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them). | ||
= System resources subjected to limitations = | = System resources subjected to limitations = | ||
Line 92: | Line 91: | ||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | ||
debug | debug* up 20:00 1 mix gn01 | ||
small | small up 12:00:00 1 mix gn01 | ||
normal up 1-00:00:00 1 mix gn01 | normal up 1-00:00:00 1 mix gn01 | ||
longnormal up 3-00:00:00 1 mix gn01 | longnormal up 3-00:00:00 1 mix gn01 | ||
Line 101: | Line 100: | ||
</pre> | </pre> | ||
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside " | In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "debug" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified. (On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.) | ||
The columns in the standard output of <code>sinfo</code> shown above correspond to the following information: | The columns in the standard output of <code>sinfo</code> shown above correspond to the following information: | ||
Line 151: | Line 150: | ||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
sacctmgr list qos format=name | sacctmgr list qos format=name%-10,maxwall,maxtres%-64 | ||
</pre> | </pre> | ||
which provides an output similar to the following: | |||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
Name | Name MaxWall MaxTRES | ||
normal | ---------- ----------- ---------------------------------------------------------------- | ||
small | normal 1-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G | ||
longnormal 3-00:00:00 | small 12:00:00 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G | ||
gpu | longnormal 3-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G | ||
gpulong | gpu 1-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G | ||
fat | gpulong 3-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G | ||
fat 3-00:00:00 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G | |||
</pre> | </pre> | ||
Line 197: | Line 179: | ||
: <code>'''gres/''Name:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Name:Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K'' | : <code>'''gres/''Name:Type''=''K'''''</code> means that the maximum number of GPUs of class <code>''Name:Type''</code> (see [[User Jobs#gres syntax|<code>gres</code> syntax]]) is ''K'' | ||
: <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes | : <code>'''mem=''K''G'''</code> means that the maximum amount of system RAM is ''K'' GBytes | ||
Note that there may be additional limits to the possibility to fully exploit the resources of a partition. For instance, there may be a cap on the maximum number of GPUs that can be used at the same time by a single job and/or a single user. | |||
== Partition availability == | == Partition availability == | ||
Line 254: | Line 238: | ||
= Running jobs with SLURM: generalities = | = Running jobs with SLURM: generalities = | ||
'''''Note''': these are general considerations. See [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]] for instructions about running your own processing jobs on Mufasa.'' | '''''Note''': these are general considerations. See [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]] for instructions about running your own processing jobs on Mufasa.'' | ||
Line 328: | Line 313: | ||
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process. | If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process. | ||
== Other resources == | |||
The contents of this wiki are specifically tailored for users of Mufasa. They should include everything Mufasa users need to make good use of the machine. However, specific needs vary and advanced users may require advanced functionalities of SLURM that are not covered here. | |||
There are a lot of resources on the internet dealing with the execution of jobs using SLURM. Usually these have been published for the benefit of the users of a specific High Performance Computing system, so there's no guarantee that whatever they suggest will work on Mufasa. If you feel the need to look for external resources, we you may start with [https://www.e4company.com/en/2021/01/creating-job-with-slurm-how-to-and-automation-examples/ this one], which has been prepared by the same people who built Mufasa. | |||
= Executing jobs on Mufasa = | = Executing jobs on Mufasa = | ||
Line 375: | Line 364: | ||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
srun [‑p <partition_name>] ‑‑container-image <container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] ‑‑pty | srun [‑p <partition_name>] ‑‑container-image=<container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] ‑‑pty <command_to_run_within_container> | ||
</pre> | </pre> | ||
Line 381: | Line 370: | ||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
srun [‑p <partition_name>] ‑‑container-image <container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] <command_to_run_within_container> | srun [‑p <partition_name>] ‑‑container-image=<container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] [<command_to_run_within_container>] | ||
</pre> | </pre> | ||
Line 389: | Line 378: | ||
:;‑p <partition_name> | :;‑p <partition_name> | ||
:: specifies the [[User Jobs#SLURM partitions|SLURM partition]] on which the job will be run. | :: specifies the [[User Jobs#SLURM partitions|SLURM partition]] on which the job will be run. If it is not specified, the ''default partition'' is used. | ||
:: ''Important! If <code>‑‑p <partition_name></code> is used, options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) can be omitted, greatly | :: ''Important! The chosen partition limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is allowed by the chosen partition.'' | ||
:: ''Important! If <code>‑‑p <partition_name></code> is used, options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) can be omitted, greatly simplifying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job needs access to GPUs.'' | |||
:; --job-name=<jobname> | :; --job-name=<jobname> | ||
:: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name. | :: Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with <code>squeue</code>. The default job name (i.e., the one assigned to the job when <code>--job-name</code> is not used) is the executable program's name. | ||
:;‑‑container-image <container_path.sqsh> | :;‑‑container-image=<container_path.sqsh> | ||
:: specifies the container to be run | :: specifies the container to be run | ||
Line 473: | Line 464: | ||
<nowiki>#</nowiki>----------------start of preamble---------------- | <nowiki>#</nowiki>----------------start of preamble---------------- | ||
'''<nowiki>#</nowiki> SBATCH ‑p <partition_name>''' | '''<nowiki>#</nowiki>SBATCH ‑p <partition_name>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑container-image <container_path.sqsh>''' | '''<nowiki>#</nowiki>SBATCH ‑‑container-image=<container_path.sqsh>''' | ||
'''<nowiki>#</nowiki> SBATCH --job-name=<name>''' | '''<nowiki>#</nowiki>SBATCH --job-name=<name>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑no‑container‑entrypoint''' | '''<nowiki>#</nowiki>SBATCH ‑‑no‑container‑entrypoint''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑container‑mounts=<mufasa_dir>:<docker_dir>''' | '''<nowiki>#</nowiki>SBATCH ‑‑container‑mounts=<mufasa_dir>:<docker_dir>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑gres=<gpu_resources>''' | '''<nowiki>#</nowiki>SBATCH ‑‑gres=<gpu_resources>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑mem=<mem_resources>''' | '''<nowiki>#</nowiki>SBATCH ‑‑mem=<mem_resources>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑cpus-per-task=<cpu_amount>''' | '''<nowiki>#</nowiki>SBATCH ‑‑cpus-per-task=<cpu_amount>''' | ||
'''<nowiki>#</nowiki> SBATCH ‑‑time=<d-hh:mm:ss>''' | '''<nowiki>#</nowiki>SBATCH ‑‑time=<d-hh:mm:ss>''' | ||
: <nowiki>#</nowiki> The following directives (not described [[User Jobs#Using SLURM to run a Docker container|so far]]) activate SLURM's email notifications: | : <nowiki>#</nowiki> The following directives (not described [[User Jobs#Using SLURM to run a Docker container|so far]]) activate SLURM's email notifications: | ||
Line 495: | Line 486: | ||
: <nowiki>#</nowiki> the first specifies where they are sent; the following 3 set up notifications start/end/failure of job execution | : <nowiki>#</nowiki> the first specifies where they are sent; the following 3 set up notifications start/end/failure of job execution | ||
'''<nowiki>#</nowiki> SBATCH --mail-user <email_address>''' | '''<nowiki>#</nowiki>SBATCH --mail-user <email_address>''' | ||
'''<nowiki>#</nowiki> SBATCH --mail-type BEGIN''' | '''<nowiki>#</nowiki>SBATCH --mail-type BEGIN''' | ||
'''<nowiki>#</nowiki> SBATCH --mail-type END''' | '''<nowiki>#</nowiki>SBATCH --mail-type END''' | ||
'''<nowiki>#</nowiki> SBATCH --mail-type FAIL''' | '''<nowiki>#</nowiki>SBATCH --mail-type FAIL''' | ||
<nowiki>#</nowiki>----------------end of preamble---------------- | <nowiki>#</nowiki>----------------end of preamble---------------- | ||
Line 525: | Line 516: | ||
Also note that, once a Docker container launched with <code>srun</code> is in execution, its own bash shell is completely indistinguishable from the bash shell of Mufasa where the <code>srun</code> command that put the container in execution was issued. The two shells share the same terminal window. The only clue to the fact that you now are, in fact, in the container's shell may be the command prompt, which should now show your location as <code>/opt</code>. | Also note that, once a Docker container launched with <code>srun</code> is in execution, its own bash shell is completely indistinguishable from the bash shell of Mufasa where the <code>srun</code> command that put the container in execution was issued. The two shells share the same terminal window. The only clue to the fact that you now are, in fact, in the container's shell may be the command prompt, which should now show your location as <code>/opt</code>. | ||
= Detaching from a running job with <code>screen</code> = | = Detaching from a running job with <code>screen</code> = | ||
Line 540: | Line 530: | ||
== Creating a screen session, running a job in it, detaching from it == | == Creating a screen session, running a job in it, detaching from it == | ||
# Connect to Mufasa with SSH | # Connect to Mufasa with SSH | ||
# From the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen</pre> | # From the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen</pre> | ||
Line 547: | Line 538: | ||
== Reattaching to an active screen session == | == Reattaching to an active screen session == | ||
# Connect to Mufasa with SSH | # Connect to Mufasa with SSH | ||
# In the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen -r</pre> | # In the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen -r</pre> | ||
Line 552: | Line 544: | ||
== Closing (i.e. destroying) a screen session == | == Closing (i.e. destroying) a screen session == | ||
When you do not need a screen session anymore: | When you do not need a screen session anymore: | ||
Line 557: | Line 550: | ||
# destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) | # destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) | ||
Of course, any | Of course, any program running within the screen gets terminated when the screen is destroyed. | ||
= Using <code>salloc</code> to reserve resources = | = Using <code>salloc</code> to reserve resources = | ||
Line 564: | Line 556: | ||
== What is <code>salloc</code>? == | == What is <code>salloc</code>? == | ||
[https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a GPU) for a given time in the future. | [https://slurm.schedmd.com/salloc.html <code>salloc</code>] is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future. | ||
The typical use of <code>salloc</code> is to "book" interactive | The typical use of <code>salloc</code> is to "book" an interactive session where the user enjoys '''complete control of a set of resources'''. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM. | ||
More precisely: | |||
* the user, using <code>salloc</code>, specifies what resources they need and the time when they will need them; | |||
* when the delivery comes, SLURM creates an interactive shell session for the user; | |||
* within such session, the user can use <code>srun</code> and <code>sbatch</code> to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources. | |||
Resource reservation using <code>salloc</code> is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it. | |||
When a user makes a request with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of | When a user makes a request for resources with <code>salloc</code>, the request (called an '''allocation''') gets added to the job queue of SLURM of the requisite partition as a job in <code>pending</code> (<code>PD</code>) state (job states are described [[User_Jobs#Interpreting Job state as provided by squeue|here]]). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using <code>salloc</code> actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user. | ||
Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the | Until the delivery time specified by the user comes, the allocation remains in state <code>PD</code>, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the <code>PD</code> state, the stronger this accumulation of priority: so, by requesting resources with <code>salloc</code> '''well in advance of the delivery time''', users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended. | ||
== <code>salloc</code> commands == | == <code>salloc</code> commands == | ||
Line 585: | Line 578: | ||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
salloc [--job-name=<jobname>] [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] ‑‑time=<duration> --begin=<time> | salloc [-p <partition_name>] [--job-name=<jobname>] [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] ‑‑time=<duration> --begin=<time> | ||
</pre> | </pre> | ||
Line 591: | Line 584: | ||
Below, the elements of the command are explained. | Below, the elements of the command are explained. | ||
:;‑p <partition_name> | |||
:: specifies the [[User Jobs#SLURM partitions|SLURM partition]] on which the job will be run. If it is not specified, the ''default partition'' is used. | |||
:: ''Important! The chosen partition limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is allowed by the chosen partition.'' | |||
:: ''Important! If <code>‑‑p <partition_name></code> is used, options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task=<cpu_amount></code> or <code>‑‑time=<duration></code>) can be omitted, greatly simplifying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception concerns option <code>‑‑gres=<gpu_resources></code>, which is always required (see below) if the job needs access to GPUs.'' | |||
:; --job-name=<jobname> | :; --job-name=<jobname> | ||
Line 612: | Line 612: | ||
:: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time. | :: specifies the delivery time of the resources reserved with <code>salloc</code>, according to the syntax described below. The delivery time must be a future time. | ||
=== Syntax of parameter < | === Syntax of parameter <code>--begin</code> === | ||
If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form | If the allocation is for the current day, you can specify <nowiki><time></nowiki> as hours and minutes in the form | ||
Line 646: | Line 646: | ||
== How to use <code>salloc</code> == | == How to use <code>salloc</code> == | ||
In the typical scenario, the user of <code>salloc</code> will make use of [User_Jobs#Detaching from a running job with screen|screen] | In the typical scenario, the user of <code>salloc</code> will make use of [[User_Jobs#Detaching from a running job with screen|screen]]. Command <code>screen</code> creates a shell session (called "a screen") that it is possible to abandon without closing it ("detaching from the screen"). It is then possible to reach again the screen at a later time ("reattaching to the screen"). This means that a user can create a screen, run <code>salloc</code> within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by <code>salloc</code>. | ||
More precisely, the operations needed to do this are the following: | |||
# [[System#Accessing Mufasa|Connect to Mufasa with SSH]]. | # [[System#Accessing Mufasa|Connect to Mufasa with SSH]]. | ||
# From the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen</pre> | # From the Mufasa shell, run <pre style="color: lightgrey; background: black;">screen</pre> | ||
# In the ''screen session'' ("screen") thus created run the [[ | # In the ''screen session'' ("screen") thus created run the [[User Jobs#salloc commands|<code>salloc</code> command]], specifying via its options the resources you need and the time at which you want them delivered. | ||
# SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre> | # SLURM will respond with a message similar to <pre style="color: lightgrey; background: black;">salloc: Pending job allocation XXXX</pre> | ||
# ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original Mufasa shell. | # ''Detach'' from the screen by pressing '''''ctrl + A''''' followed by '''''D''''': you will come back to the original Mufasa shell. | ||
# You can now close the SSH connection to Mufasa without damaging your resource allocation request. | # You can now close the SSH connection to Mufasa without damaging your resource allocation request. | ||
# At the delivery time you specified in the [[ | # At the delivery time you specified in the [[User Jobs#salloc commands|<code>salloc</code> command]], connect to Mufasa with SSH. | ||
# Once you are in the Mufasa shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre> | # Once you are in the Mufasa shell, reattach to the screen with command <pre style="color: lightgrey; background: black;">screen -r</pre> | ||
# You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt. | # You are now back to the screen where you used <code>salloc</code>; as soon as SLURM provides to you with the resources you reserved, message "''salloc: Pending job allocation XXXX''" changes to the shell prompt. | ||
# You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>. | # You are now in the interactive shell session you booked with <code>salloc</code>. From here, you can run any programs you want, including <code>srun</code> and <code>sbatch</code>. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved with <code>salloc</code>.<br>'''Important!''' Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets '''forcibly terminated''' by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.) | ||
# Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.) | # Once the interactive shell session is not needed anymore, cancel it by exiting from the session with <pre style="color: lightgrey; background: black;">exit</pre> (Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.) | ||
# You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the Mufasa shell. | # You are now back to your screen. Destroy it by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the Mufasa shell. | ||
== | == Cancelling a resource request made with <code>salloc</code> == | ||
To cancel a request for resources made as explained in [[User Jobs#How to use salloc|How to use <code>salloc</code>]], follow these steps: | To cancel a request for resources made as explained in [[User Jobs#How to use salloc|How to use <code>salloc</code>]], follow these steps: | ||
Line 669: | Line 669: | ||
# [[System#Accessing Mufasa|Connect to Mufasa with SSH]]. | # [[System#Accessing Mufasa|Connect to Mufasa with SSH]]. | ||
# Once you are in the Mufasa shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre> | # Once you are in the Mufasa shell, reattach to the screen where you used command <code>salloc</code> with command <pre style="color: lightgrey; background: black;">screen -r</pre> | ||
# You should see the message "''salloc: Pending job allocation XXXX''". Now just press ''' | # You should see the message "''salloc: Pending job allocation XXXX''" (if the allocation is still pending) or ""''salloc: job XXXX queued and waiting for resources''" (if the allocation is done and waiting for its start time). Now just press '''Ctrl + C'''. This communicates to SLURM your intention to cancel your request for resources. | ||
# SLURM will communicate the cancellation with message <pre style="color: lightgrey; background: black;">salloc: Job allocation XXXX has been revoked.</pre> | |||
# Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the Mufasa shell. | # Destroy the screen by pressing '''ctrl + A''' followed by '''\''' (i.e., backslash) to get back to the Mufasa shell. | ||
Revision as of 08:04, 12 April 2023
This page presents the features of Mufasa that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).
System resources subjected to limitations
The hardware resources of Mufasa are limited. For this reason, some of them are subjected to limitations, i.e. (these are SLURM's own terms):
- cpu
- the number of processor cores that a job can use
- mem
- the amount of RAM that a job can use
- gres
- the amount of generic resources that a job can use: in Mufasa, the only resources belonging to this set are the GPUs (the virtual GPUs defined by Nvidia MIG, not the physical GPUs)
These are some of the TRES (Trackable RESources) defined by SLURM. From SLURM's documentation: "A TRES is a resource that can be tracked for usage or used to enforce limits against."
SLURM provides jobs with access to resources only for a limited time: i.e., execution time is itself a limited resource.
When a resource is limited, a job cannot use arbitrary quantities of it. On the contrary, the job must specify how much of the resource it requests. Requests are done either by running the job on a partition for which a default amount of resources has been defined, or through the options of the srun command that executes the job via SLURM.
gres
syntax
Whenever it is necessary to specify the quantity of gres
, i.e. generic resources, a special syntax must be used. In Mufasa gres
resources are GPUs, so this syntax applies to GPUs. Number and type of Mufasa's GPUs is described here.
The name of each GPU resource takes the form
Name:Type
where Name
is gpu
and Type
takes the following values:
40gb
for GPUs with 40 Gbytes of onboard RAM20gb
for GPUs with 20 Gbytes of onboard10gb
for GPUs with 10 Gbytes of onboard RAM
So, for instance,
gpu:20gb
identifies the resource corresponding to GPUs with 20 GB of RAM. Of this resource Mufasa has a given number, of which a job can request to use some (or all).
When asking for a gres
resource (e.g., in an srun
command or an SBATCH
directive of an execution script), the syntax required by SLURM is
<Name>:<Type>:<quantity>
where quantity
is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 20gb
the syntax is
gpu:20gb:2
SLURM's generic resources are defined in /etc/slurm/gres.conf
. In order to make GPUs available to SLURM's gres
management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.
Looking for unused GPUs
GPUs are usually the most limited resource on Mufasa. So, if your job requires a GPU, the best way to get it executed quickly is to request a GPU that is not currently in use.
This command
sinfo -O Gres:100
provides a summary of all the Gres (i.e., GPU) resources possessed by Mufasa. It provides this output:
GRES gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)
To know which of the GPUs are currently in use, use command
sinfo -O GresUsed:100
which provides an output similar to this:
GRES_USED gpu:40gb:2(IDX:0-1),gpu:20gb:2(IDX:5,8),gpu:10gb:3(IDX:3-4,6)
By comparing the two lists (GRES and GRES_USED) above, you can see that at the moment:
- of the 2 40 GB GPUs, both are in use
- of the 3 20 GB GPUs, one is not in use
- of the 6 10 GB GPUs, 3 are not in use
SLURM Partitions
Several execution queues for jobs have been defined on Mufasa. Such queues are called partitions in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command
sinfo
(link to SLURM docs) provides a list of available partitions. Its output is similar to this:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up 20:00 1 mix gn01 small up 12:00:00 1 mix gn01 normal up 1-00:00:00 1 mix gn01 longnormal up 3-00:00:00 1 mix gn01 gpu up 1-00:00:00 1 mix gn01 gpulong up 3-00:00:00 1 mix gn01 fat up 3-00:00:00 1 mix gn01
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "debug" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified. (On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.)
The columns in the standard output of sinfo
shown above correspond to the following information:
- PARTITION
- name of the partition
- AVAIL
- state/availability of the partition: see below
- TIMELIMIT
- maximum runtime of a job allowed by the partition
- NODES
- number of nodes available to jobs run on the partition: for Mufasa, this is always 1 since there is only 1 node in the computing cluster
- STATE
- state of the node (using these codes); typical values are
mixed
- meaning that some of the resources of the node are busy executing jobs while other are free, andallocated
- meaning that all of the resources of the node are busy
- NODELIST
- list of nodes available to the partition: for Mufasa this field always contains
gn01
since Mufasa is the only node in the computing cluster
One information that the standard output of sinfo
doesn't provide is if there are partitions that can only be used by the root user of Mufasa. To know which partiions are root-only, you can use command
sinfo -o "%.10P %.4r"
Its output is
PARTITION ROOT debug* no small no normal no longnormal no gpu no gpulong no fat no
and shows that on Mufasa no partitions are reserved for root.
For what concerns hardware resources (such as CPUs, GPUs and RAM) the amounts of each resource available to Mufasa's partitions are set by SLURM's accounting system, and are not visible to sinfo
. See Partition features for a description of these amounts.
Partition features
The output of sinfo
(see above) provides a list of available partitions, but (except for time) it does not provide information about the amount of resources that a partition makes available to the user jobs which are run on it. The amount of resources is visible through command
sacctmgr list qos format=name%-10,maxwall,maxtres%-64
which provides an output similar to the following:
Name MaxWall MaxTRES ---------- ----------- ---------------------------------------------------------------- normal 1-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G small 12:00:00 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G longnormal 3-00:00:00 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G gpu 1-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G gpulong 3-00:00:00 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G fat 3-00:00:00 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G
Its elements are the following (for more information, see SLURM's documentation):
- Name
- name of the partition
- MaxWall
- maximum wall clock duration of the jobs run on the partition (after which they are killed by SLURM), in format [days-]hours:minutes:seconds
- MaxTRES
- maximum amount of resources ("Trackable RESources") available to a job running on the partition, where
cpu=K
means that the maximum number of processor cores is Kgres/Name:Type=K
means that the maximum number of GPUs of className:Type
(seegres
syntax) is Kmem=KG
means that the maximum amount of system RAM is K GBytes
Note that there may be additional limits to the possibility to fully exploit the resources of a partition. For instance, there may be a cap on the maximum number of GPUs that can be used at the same time by a single job and/or a single user.
Partition availability
An important information that sinfo provides (column "AVAIL") is the availability (also called state) of partitions. Possible partition states are:
- up
- The partition is available
- Running jobs will be completed
- Currently queued jobs will be executed as soon as resources allow
- drain
- The partition is in the process of becoming unavailable (down)
- Running jobs will be completed
- Queued jobs will be executed only when the partition becomes available again (up)
- down
- The partition is unavailable
- There are no running jobs
- Queued jobs will be executed only when the partition becomes available again (up)
When a partition passes from up to drain no harm is done to running jobs. When a partition passes from any other state to down, running jobs (if any) get killed.
A partition in state drain or down requires intervention by a Job Administrator to be restored to up. Jobs waiting for that partition are paused unless the partition returns available.
Choosing the partition on which to run a job
When launching a job (as explained in Executing jobs on Mufasa) a user should select the partition that is most suitable for it according to the job's features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition. Partition features explains how to find out how many of Mufasa's resources are associated to each partition.
The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job's requirements only for those resources that have an unsuitable default value.
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.
Tips for partition choice
The larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa's limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, will have the effect of delaying (possibly for a long time) job execution. These are tips that you can use to guide partition choice for your job in order to get it executed quickly:
- use the least powerful partition that can support the job
- do not ask for more resources or time than needed
- prefer partitions without access to GPUs
- ask for GPUs that are currently not in use
User limitations on the use of resources
Mufasa is a shared machine, meaning that at any given time its resources subjected to limitations are splitted among all users who request them. This also means that there are limitations on the amount of resources that Mufasa can provide to a given user, whatever the amount of resources that the user requested.
Such limitations come from two sources.
The first source is the fact that each user job is associated to the SLURM partition on which it runs. So, each job can only access the specific subset of resources that are available to the partition.
The second source of limitations is applied by SLURM on a per-user basis. Mufasa is configured in such a way that:
- no more than 2 jobs per user can be running at the same time (note that, since each partition can execute only one job at any given time, the two jobs must make use of different partitions)
- if a user already has a running job, a second job from the same user is only put into execution if there are no requests from other users for the partition it is intended to be run on
Running jobs with SLURM: generalities
Note: these are general considerations. See Executing jobs on Mufasa for instructions about running your own processing jobs on Mufasa.
The commands that SLURM provides to run jobs are
srun [options] <command_to_be_run_via_SLURM>
and
sbatch [options] <command_to_be_run_via_SLURM>
(see SLURM documentation: srun, sbatch).
In both cases, <command_to_be_run_via_SLURM>
can be any program or Linux shell script. By using srun
or sbatch
, the command or script specified by <command_to_be_run_via_SLURM>
(including any programs launched by it) are added to SLURM's execution queues.
The main difference between srun
and sbatch
is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user. (You can, though, detach from that shell and come back later using screen
.) sbatch
, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.
Additionally, with sbatch
<command_to_be_run_via_SLURM> can be an execution script, i.e. a special (and SLURM-specific) type of Linux shell script that includes SBATCH directives. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the [options]
part of the sbatch
command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.
The [options]
part of srun
and sbatch
commands is used to tell SLURM the conditions under which it has to execute the job; in particular, it is used to specify what system resources SLURM should reserve for the job.
A quick way to define the set of resources that a program will be provided with is to use SLURM partitions. This is done with option -p <partition_name>
. This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.
For instance, running
srun -p small ./my_program
makes SLURM run my_program
on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.
Running interactive jobs via SLURM
As explained, SLURM command srun
is suitable for launching interactive user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a bash shell (i.e. a terminal session) with a command similar to
srun --pty /bin/bash
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)
exit
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with srun
or sbatch
ensures that they can access all the resources managed by SLURM.
GPU resources (if needed) must always be requested explicitly with parameter --gres=gpu:<10|20|40>gb:K
, where K
is an integer between 1 and the maximum number of GPUs of that type available to the partition (see gres
syntax). For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command
srun --gres=gpu:10gb:1 --pty /bin/bash
an then run the interactive program from the newly opened shell.
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run /bin/bash
on one of the available partitions. For instance, to run the shell on partition “small” the command is
srun -p small --pty /bin/bash
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as (SLURM ID xx)
(where xx
is the ID of the /bin/bash
process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.
Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command
echo $SLURM_JOB_ID
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.
Other resources
The contents of this wiki are specifically tailored for users of Mufasa. They should include everything Mufasa users need to make good use of the machine. However, specific needs vary and advanced users may require advanced functionalities of SLURM that are not covered here.
There are a lot of resources on the internet dealing with the execution of jobs using SLURM. Usually these have been published for the benefit of the users of a specific High Performance Computing system, so there's no guarantee that whatever they suggest will work on Mufasa. If you feel the need to look for external resources, we you may start with this one, which has been prepared by the same people who built Mufasa.
Executing jobs on Mufasa
The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done.
Considering that all computation on Mufasa must occur within Docker containers, the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. This wiki includes directions about preparing Docker containers.
The process of launching a user job on Mufasa involves the following steps:
-
- [for interactive and non-interactive user jobs]
-
- [for interactive user jobs only]
Interactive and non-interactive user jobs
- Interactive user jobs
- are jobs that require interaction with the user while they are running, via a bash shell running within the Docker container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the Docker container is in execution.
- Non-interactive user jobs
- are the most common variety. The user prepares the Docker container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the Docker container while it is in execution.
Both interactive and non-interactive user jobs can be run via a (quite complex) command directly issued from the terminal opened via SSH. To reduce the possibility of mistakes, it is usually preferable to define an execution script that takes care of launching the job.
Job output
The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of the Docker container.
As explained below, SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the Docker container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own /home
directory) and can be retrieved by the user at a later time.
The same mechanism can be used to allow user jobs running into a Docker container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own /home
directory).
Using SLURM to run a Docker container
The first step to run a user job on Mufasa is to run the Docker container where the job will take place. A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if they belong to the user's /home
directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results. This wiki includes directions about preparing Docker containers
Each user is in charge of preparing the Docker container(s) where the user's jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.
In order to run a Docker container via SLURM, a user must use a command similar to the following ones:
srun [‑p <partition_name>] ‑‑container-image=<container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] ‑‑pty <command_to_run_within_container>
For non-interactive user jobs:
srun [‑p <partition_name>] ‑‑container-image=<container_path.sqsh> [--job-name=<jobname>] [‑‑no‑container‑entrypoint] ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] [‑‑time=<duration>] [<command_to_run_within_container>]
The parts of the above commands within [square brackets]
are optional.
Below, the elements of these commands are explained.
- ‑p <partition_name>
- specifies the SLURM partition on which the job will be run. If it is not specified, the default partition is used.
- Important! The chosen partition limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is allowed by the chosen partition.
- Important! If
‑‑p <partition_name>
is used, options that specify how many resources to assign to the job (such as‑‑mem=<mem_resources>
,‑‑cpus‑per‑task=<cpu_amount>
or‑‑time=<duration>
) can be omitted, greatly simplifying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception concerns option‑‑gres=<gpu_resources>
, which is always required (see below) if the job needs access to GPUs.
- Important! If
- --job-name=<jobname>
- Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with
squeue
. The default job name (i.e., the one assigned to the job when--job-name
is not used) is the executable program's name.
- ‑‑container-image=<container_path.sqsh>
- specifies the container to be run
- ‑‑no‑container‑entrypoint
- specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is an element of a Docker container: a command that gets executed as soon as the container is in execution. Option
‑‑no‑container‑entrypoint
is useful when -for some reason- the user does not want the entrypoint in the container to be run.
- ‑‑container‑mounts=<mufasa_dir>:<docker_dir>
- specifies what parts of Mufasa's filesystem will be available within the container's filesystem, and where they will be mounted. This is necessary to let the container get input data from Mufasa and/or write output data to Mufasa. For instance, if
<mufasa_dir>:<docker_dir>
takes the value/home/mrossi:/data
this tells srun to mount Mufasa's directory/home/mrossi
in position/data
within the filesystem of the Docker container. When the docker container reads or writes files in directory/data
of its own (internal) filesystem, what actually happens is that files in/home/mrossi
get manipulated instead./home/mrossi
is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.
- ‑‑gres=<gpu_resources>
- specifies what GPUs to assign to the container.
gpu_resources
is a comma-delimited list where each element has the formgpu:<Type>:<amount>
, where<Type>
is one of the types of GPU available on Mufasa (seegres
syntax) and<amount>
is an integer between 1 and the number of GPUs of such type available to the partition. For instance,<gpu_resources>
may begpu:40gb:1,gpu:10gb:3
, corresponding to asking for one "full" GPU and 3 "small" GPUs.
- Important! The
‑‑gres
parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.
- Important! The
- ‑‑mem=<mem_resources>
- specifies the amount of RAM to assign to the container; for instance,
<mem_resources>
may be200G
- ‑‑cpus-per-task=<cpu_amount>
- specifies how many CPUs to assign to the container; for instance,
<cpu_amount>
may be2
- ‑‑time=<duration>
- specifies the maximum time allowed to the job to run, in the format
days-hours:minutes:seconds
, wheredays
is optional; for instance,<d-hh:mm:ss>
may be72:00:00
- ‑‑pty
- specifies that the job will be interactive (this is necessary when
<command_to_run_within_container>
is/bin/bash
: see Running interactive jobs via SLURM)
- <command_to_run_within_container>
- the command that will be put into execution within the Docker container as soon as it the container is active. Note that this is mandatory for non-interactive user jobs and optional for interactive user jobs. If specified, this command will be executed in the environment created by Docker.
For interactive user jobs, a typical value for <command_to_run_within_container>
is /bin/bash
. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for <command_to_run_within_container>
is python
, which launches an interactive Python session from which the user will then run their job.
For non-interactive user jobs, using [command_to_run_within_container]
is one of the two available methods to run the program(s) that the user wants to be executed within the Docker container. The other available method to run the user job(s) is to use the entrypoint of the container. The use of [command_to_run_within_container]
is therefore optional.
Using execution scripts to run jobs
The srun
commands described in Using SLURM to run a Docker container are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.
When the user job is non-interactive, in fact, the srun
command can be substituted with a much simpler sbatch
command. As already explained, sbatch
can make use of an execution script to specify all the parts of the command to be run via SLURM. So the command to run the Docker container where the user job will take place becomes
sbatch <execution_script>
An execution script is a special type of Linux script that includes SBATCH directives. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an srun
command.
Note on Linux shell scripts A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must: - have the “executable” flag set
- have
#!/bin/bash
as its very first line
Usually, a Linux shell script is given a name ending in .sh, such as my_execution_script.sh, but this is not mandatory.
Within any shell script, lines preceded by
#
are comments (with the notable exception of the initial#!/bin/bash
line). Use of blank lines as spacers is allowed.
An execution script is a Linux shell script composed of two parts:
- a preamble, composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword
SBATCH
- [optionally] one or more
srun
commands that launch jobs with SLURM using the parameter values specified in the preamble
The srun
commands are optional because jobs can also be launched by the Docker container's own entrypoint.
Below is an execution script template to be copied and pasted into your own execution script text file.
The template includes all the options already described above, plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].
All the SBATCH directives in the script template below are inactive because commented out. To enable a directive, just uncomment it by removing the leading "#". To make them stand out more visibly, in the template the comments corresponding to actual instructions are in bold.
#!/bin/bash
#----------------start of preamble----------------
#SBATCH ‑p <partition_name>
#SBATCH ‑‑container-image=<container_path.sqsh>
#SBATCH --job-name=<name>
#SBATCH ‑‑no‑container‑entrypoint
#SBATCH ‑‑container‑mounts=<mufasa_dir>:<docker_dir>
#SBATCH ‑‑gres=<gpu_resources>
#SBATCH ‑‑mem=<mem_resources>
#SBATCH ‑‑cpus-per-task=<cpu_amount>
#SBATCH ‑‑time=<d-hh:mm:ss>
- # The following directives (not described so far) activate SLURM's email notifications:
- # the first specifies where they are sent; the following 3 set up notifications start/end/failure of job execution
#SBATCH --mail-user <email_address>
#SBATCH --mail-type BEGIN
#SBATCH --mail-type END
#SBATCH --mail-type FAIL
#----------------end of preamble----------------
# srun <command_to_run_within_container>
- # to run the user job, either uncomment (and personalise) the above srun command or use the entrypoint of the Docker container
Nvidia Pyxis
Some of the options described below are specifically dedicated to Docker containers: these are provided by the Nvidia Pyxis package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them.
More specifically, options ‑‑container-image
, ‑‑no‑container‑entrypoint
, ‑‑container-mounts
are provided to srun
by Pyxis.
See the Nvidia Pyxis github page for additional information about the options that it provides to srun
.
Launching a user job from within a Docker container
For interactive user jobs, once the Docker container (run as explained here) is up and running, the user is dropped to the interactive environment specified by <command_to_run_within_container>
. This interactive environment can be, for instance, a bash shell or an interactive Python console. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).
Please note that the interactive environment of the Docker container does not have any relation with Mufasa's system. The only contact point is the part of Mufasa's filesystem that has been grafted to the container's filesystem via the ‑‑container‑mounts
option of srun
. In particular, none of the software packages (such as the Nvidia drivers) installed on Mufasa are available in the container, unless they have been installed in it at preparation time (as explained in Docker), or manually after the container is put in execution.
Also note that, once a Docker container launched with srun
is in execution, its own bash shell is completely indistinguishable from the bash shell of Mufasa where the srun
command that put the container in execution was issued. The two shells share the same terminal window. The only clue to the fact that you now are, in fact, in the container's shell may be the command prompt, which should now show your location as /opt
.
Detaching from a running job with screen
A consequence of the way srun
operates is that if you launch an interactive user job, the shell where the command is running must remain open: if it closes, the job terminates. That shell runs in the terminal of your own PC where the SSH connection to Mufasa exists.
If you do not plan to keep the SSH connection to Mufasa open (for instance because you have to turn off or suspend your PC), there is a way to keep your interactive job alive. Namely, you should use command srun
inside a screen session (often simply called "a screen"), then detach from the screen (here is one of many tutorials about screen
available online).
Once you have detached from the screen session, you can close the SSH connection to Mufasa without damage. When you need to reach your (still running) job again, you can can open a new SSH connection to Mufasa and then reattach to the screen.
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.
Basic usage of screen
is explained below.
Creating a screen session, running a job in it, detaching from it
- Connect to Mufasa with SSH
- From the Mufasa shell, run
screen
- In the screen session ("screen") thus created (it has the look of an empty shell), launch your job with
srun
- Detach from the screen by pressing ctrl + A followed by D: you will come back to the original Mufasa shell, while your process will go on running in the screen
- You can now close the SSH connection to Mufasa without damaging your running job
Reattaching to an active screen session
- Connect to Mufasa with SSH
- In the Mufasa shell, run
screen -r
- You are now back to the screen where you launched your job
Closing (i.e. destroying) a screen session
When you do not need a screen session anymore:
- reattach to the screen as explained above
- destroy the screen by pressing ctrl + A followed by \ (i.e., backslash)
Of course, any program running within the screen gets terminated when the screen is destroyed.
Using salloc
to reserve resources
What is salloc
?
salloc
is a SLURM command that allows a user to reserve a set of resources (e.g., a 40 GB GPU) for a given time in the future.
The typical use of salloc
is to "book" an interactive session where the user enjoys complete control of a set of resources. The resources that are part of this set are chosen by the user. Within the "booked" session, any job run by the user that relies on the reserved resources is immediately put into execution by SLURM.
More precisely:
- the user, using
salloc
, specifies what resources they need and the time when they will need them; - when the delivery comes, SLURM creates an interactive shell session for the user;
- within such session, the user can use
srun
andsbatch
to run programs, enjoying full (i.e. not shared with anyone else) and instantaneous access to the resources.
Resource reservation using salloc
is only possible if the request is done in advance wrt the delivery time. The more the resources that the user wants to reserve are in high demand, the more anticipated the request should be to ensure that SLURM is able to fulfill it.
When a user makes a request for resources with salloc
, the request (called an allocation) gets added to the job queue of SLURM of the requisite partition as a job in pending
(PD
) state (job states are described here). Indeed, resource allocation is the first part of SLURM's process of executing a user job, while the second part is running the program and letting it use the allocated resources. Using salloc
actually corresponds to having SLURM perform the first part of the process (resource allocation) while leaving the second part (running programs) to the user.
Until the delivery time specified by the user comes, the allocation remains in state PD
, and other jobs requesting the same resources, even if submitted later, are executed. While the request waits for the delivery time, however, it accumulates a priority that increases over time. The longer the allocation stays in the PD
state, the stronger this accumulation of priority: so, by requesting resources with salloc
well in advance of the delivery time, users can ensure that the resources they need will be ready for them at the requested delivery time, even if these resources are highly contended.
salloc
commands
salloc
commands use a similar syntax to srun
commands. In particular, salloc
lets a user specify what resources they need and -importantly- a delivery time for the requested resources (delivery time can also be specified with srun
, but in that case it is not very useful).
The typical salloc
command has this form:'
salloc [-p <partition_name>] [--job-name=<jobname>] [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task=<cpu_amount>] ‑‑time=<duration> --begin=<time>
The parts of the above commands within [square brackets]
are optional.
Below, the elements of the command are explained.
- ‑p <partition_name>
- specifies the SLURM partition on which the job will be run. If it is not specified, the default partition is used.
- Important! The chosen partition limits the resources that can be requested, since it is not allowed to request resources (type or quantity) that exceed what is allowed by the chosen partition.
- Important! If
‑‑p <partition_name>
is used, options that specify how many resources to assign to the job (such as‑‑mem=<mem_resources>
,‑‑cpus‑per‑task=<cpu_amount>
or‑‑time=<duration>
) can be omitted, greatly simplifying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception concerns option‑‑gres=<gpu_resources>
, which is always required (see below) if the job needs access to GPUs.
- Important! If
- --job-name=<jobname>
- Specifies a name for the job corresponding to the resource allocation. The specified name will appear along with the JOBID number when querying running jobs on the system with
squeue
. The default job name (i.e., the one assigned to the job when--job-name
is not used) is "interact".
- ‑‑gres=<gpu_resources>
- specifies what GPUs are requested.
gpu_resources
is a comma-delimited list where each element has the formgpu:<Type>:<amount>
, where<Type>
is one of the types of GPU available on Mufasa (seegres
syntax) and<amount>
is an integer between 1 and the number of GPUs of such type available to the partition. For instance,<gpu_resources>
may begpu:40gb:1,gpu:10gb:3
.
- Important! The
‑‑gres
parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.
- Important! The
- ‑‑mem=<mem_resources>
- specifies the amount of RAM requested; for instance,
<mem_resources>
may be200G
- ‑‑cpus-per-task=<cpu_amount>
- specifies how many CPUs are requested; for instance,
<cpu_amount>
may be2
- ‑‑time=<duration>
- specifies the maximum time allowed to the job to run, in the format
days-hours:minutes:seconds
, wheredays
is optional; for instance,<d-hh:mm:ss>
may be72:00:00
. While the interactive session associated to the allocation is active, the user can decide to cancel the allocation at any time just by closing the session (e.g., with commandexit
forbash
)
- --begin=<time>
- specifies the delivery time of the resources reserved with
salloc
, according to the syntax described below. The delivery time must be a future time.
Syntax of parameter --begin
If the allocation is for the current day, you can specify <time> as hours and minutes in the form
HH:MM
If you want to specify a time of a different day, the form for
It is also possible to specify
now+Kminutes
now+Khours
now+Kdays
where K is a (positive) integer.
Examples:
--begin=16:00
--begin=now+1hours
--begin=now+1days
--begin=2030-01-20T12:34:00
Note that Mufasa's time zone is GMT, so <time> must be expressed in GMT as well. If you want to know Mufasa's current time, use command
date
It provides an output similar to the following:
Thu Nov 10 16:43:30 UTC 2022
How to use salloc
In the typical scenario, the user of salloc
will make use of screen. Command screen
creates a shell session (called "a screen") that it is possible to abandon without closing it ("detaching from the screen"). It is then possible to reach again the screen at a later time ("reattaching to the screen"). This means that a user can create a screen, run salloc
within it to create an allocation for time X, detach from the screen and reattach to it just before time X to use the reserved resources from the interactive session created by salloc
.
More precisely, the operations needed to do this are the following:
- Connect to Mufasa with SSH.
- From the Mufasa shell, run
screen
- In the screen session ("screen") thus created run the
salloc
command, specifying via its options the resources you need and the time at which you want them delivered. - SLURM will respond with a message similar to
salloc: Pending job allocation XXXX
- Detach from the screen by pressing ctrl + A followed by D: you will come back to the original Mufasa shell.
- You can now close the SSH connection to Mufasa without damaging your resource allocation request.
- At the delivery time you specified in the
salloc
command, connect to Mufasa with SSH. - Once you are in the Mufasa shell, reattach to the screen with command
screen -r
- You are now back to the screen where you used
salloc
; as soon as SLURM provides to you with the resources you reserved, message "salloc: Pending job allocation XXXX" changes to the shell prompt. - You are now in the interactive shell session you booked with
salloc
. From here, you can run any programs you want, includingsrun
andsbatch
. For the whole duration of the allocation, your programs have unrestricted use of all the resources you reserved withsalloc
.
Important! Any job run within the shell session is subject to the time limit (i.e., maximum duration) imposed by the partition it is running on! Therefore, if the job reaches the time limit, it gets forcibly terminated by SLURM. Termination depends exclusively from the time limit: so it occurs even if the end time for the allocation has not been reached yet. (Of course, the job also gets terminated if the allocation ends.) - Once the interactive shell session is not needed anymore, cancel it by exiting from the session with
exit
(Note that if you get to the end of the time period you specified in your request without closing the shell session, SLURM does it for you, killing any programs still running.) - You are now back to your screen. Destroy it by pressing ctrl + A followed by \ (i.e., backslash) to get back to the Mufasa shell.
Cancelling a resource request made with salloc
To cancel a request for resources made as explained in How to use salloc
, follow these steps:
- Connect to Mufasa with SSH.
- Once you are in the Mufasa shell, reattach to the screen where you used command
salloc
with commandscreen -r
- You should see the message "salloc: Pending job allocation XXXX" (if the allocation is still pending) or ""salloc: job XXXX queued and waiting for resources" (if the allocation is done and waiting for its start time). Now just press Ctrl + C. This communicates to SLURM your intention to cancel your request for resources.
- SLURM will communicate the cancellation with message
salloc: Job allocation XXXX has been revoked.
- Destroy the screen by pressing ctrl + A followed by \ (i.e., backslash) to get back to the Mufasa shell.
Automatic job caching
When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where /home
partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.
Each time a job is run via SLURM, this is what happens automatically:
- Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user's own
/home
) to a cache space located on system SSDs - Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files
- The executables create their output files in the cache space
- When the user jobs end, Mufasa copies the output files from the cache space back to the user's own
/home
The whole process is completely transparent to the user. The user simply prepares the executable (or the execution script) in a subdirectory of their /home
directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of /home
, exactly as if the execution actually occurred there.
Important! The caching mechanism requires that during job execution the user does not modify the contents of the /home
subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.
Monitoring and managing jobs
SLURM provides Job Users with tools to inspect and manage jobs. While a Job User is able to see all users' jobs, they are only allowed to interact with their own.
The main commands used to interact with jobs are squeue
to inspect the scheduling queues and scancel
to terminate queued or running jobs.
Inspecting jobs with squeue
Running command
squeue
provides an output similar to the following:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 520 fat bash acasella R 2-04:10:25 1 gn01 523 fat bash amarzull R 1:30:35 1 gn01 522 gpu bash clena R 20:51:16 1 gn01
This output comprises the following information:
- JOBID
- Numerical identifier of the job assigned by SLURM
- This identifier is used to intervene on the job, for instance with
scancel
- PARTITION
- the partition that the job is run on
- NAME
- the name assigned to the job; can be personalised using the
--job-name
option
- USER
- username of the user who launched the job
- ST
- job state (see Job state for further information)
- TIME
- time that has passed since the beginning of job execution
- NODES
- number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)
- NODELIST (REASON)
- name of the nodes where the job is being executed: for Mufasa it is always
gn01
, which is the name of the node corresponding to Mufasa.
To limit the output of squeue
to the jobs owned by user <username>
, it can be used like this:
squeue -u <username>
Interpreting Job state as provided by squeue
Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of squeue
as an abbreviated code (e.g., "R" for RUNNING).
The most relevant codes and states are the following:
- PD PENDING
- Job is awaiting resource allocation.
- R RUNNING
- Job currently has an allocation.
- S SUSPENDED
- Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
- CG COMPLETING
- Job is in the process of completing. Some processes on some nodes may still be active.
- CD COMPLETED
- Job has terminated all processes on all nodes with an exit code of zero.
Beyond these, there are other (less frequent) job states. The SLURM doc page for squeue
provides a complete list of them.
Knowing when jobs are expected to end or start
If you are interested in understanding when jobs are expected to start or end, use command
squeue -o "%5i %8u %10P %.2t |%19S |%.11L|"
which provides an output is similar to the following:
JOBID USER PARTITION ST |START_TIME | TIME_LEFT| 5307 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00| 5308 thuynh fat PD |2022-11-11T17:55:54 | 3-00:00:00| 5296 cziyang fat R |2022-11-08T16:58:03 | 1-00:48:14| 5306 thuynh fat R |2022-11-10T08:13:30 | 2-16:03:41| 5297 gnannini fat R |2022-11-08T17:55:54 | 1-01:46:05| 5336 ssaitta gpu R |2022-11-10T08:13:00 | 6:03:11| 5358 dmilesi gpulong R |2022-11-10T15:11:32 | 2-23:01:43| 5338 cziyang gpulong R |2022-11-10T09:45:01 | 1-17:35:12|
- For running jobs (state
R
) - column "START_TIME" tells you when the job started its execution
- column "TIME_LEFT" tells you how much remains of the running time requested by the job
- For pending jobs (state
PD
) - column "START_TIME" tells you when the job is expected to start its execution
- column "TIME_LEFT" tells you how much running time has been requested by the job
Important! Start and end times are forecasts based on the features of current jobs in the queues, and may change if running jobs end prematurely and/or if new jobs with higher priority are added to the queues. So these times should never be considered as certain.
If you simply want to know when pending jobs (state PD
) are expected to begin execution, use
squeue --start
which lists pending jobs in order of increasing START_TIME (the job on top is the one which will be run first). For each pending job the command provides an output similar to the example below:
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 5090 fat training thuynh PD 2022-10-27T09:28:01 1 (null) (Resources)
Getting detailed information about a job
If needed, complete information about a job (either pending or running) can be obtained using command
scontrol show job <JOBID>
where <JOBID>
is the number from the first column of the output of squeue
. The output of this command is similar to the following:
JobId=936 JobName=bash UserId=acasella(1001) GroupId=acasella(1001) MCS_label=N/A Priority=7885 Nice=0 Account=research QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=03:21:59 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2022-02-08T11:57:24 EligibleTime=2022-02-08T11:57:24 AccrueTime=Unknown StartTime=2022-02-08T11:57:24 EndTime=2022-02-11T11:57:24 Deadline=N/A PreemptEligibleTime=2022-02-08T11:57:24 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-08T11:57:24 Scheduler=Main Partition=fat AllocNode:Sid=rk018445:4034 ReqNodeList=(null) ExcNodeList=(null) NodeList=gn01 BatchHost=gn01 NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=128G,node=1,billing=8,gres/gpu:40gb=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryNode=128G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/home/acasella Power= TresPerNode=gres:gpu:40gb:1
In particular, the line beginning with "StartTime=" provides expected times for the start and end of job execution. As explained in Knowing when jobs are expected to end or start, start time is only a prediction and subject to change.
Canceling a job with scancel
It is possible to cancel a job using command scancel
, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it). The following are some examples of use of scancel
adapted from SLURM's documentation.
scancel <JOBID>
removes queued job <JOBID>
from the execution queue.
scancel --signal=TERM <JOBID>
terminates execution of job <JOBID>
with signal SIGTERM (request to stop).
scancel --signal=KILL <JOBID>
terminates execution of job <JOBID>
with signal SIGKILL (force stop).
scancel --state=PENDING --user=<username> --partition=<partition_name>
cancels all pending jobs belonging to user <username>
in partition <partition_name>
.
Knowing what jobs you ran today
Command
sacct -X
provides a list of all jobs run today by your user.