User Jobs

From Mufasa (BioHPC)
Jump to navigation Jump to search

This page presents the features of Mufasa that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Job Users are by necessity SLURM users (see The SLURM job scheduling system) so you may also want to read SLURM's own Quick Start User Guide.

System resources subjected to usage limitations

The hardware resources of Mufasa are limited. For this reason, some of them are subjected to limitations, i.e. (these are SLURM's own terms):

cpu
the number of processor cores that a job can use
mem
the amount of RAM that a job can use
gres
the amount of generic resources that a job can use: in Mufasa, the only resources belonging to this set are the GPUs

These are some of the TRES (Trackable RESources) defined by SLURM. From SLURM's documentation: "A TRES is a resource that can be tracked for usage or used to enforce limits against."

SLURM provides jobs with access to resources only for a limited time: i.e., execution time is itself a limited resource.

When a resource is limited, a job cannot use arbitrary quantities of it. On the contrary, the job must specify how much of the resource it requests. Requests are done either by running the job on a partition for which a default amount of resources has been defined, or through the options of the srun command that executes the job via SLURM.

gres syntax

Whenever it is necessary to specify the quantity of gres, i.e. generic resources, a special syntax must be used. In Mufasa gres resources are GPUs, so this syntax applies to GPUs. Number and type of Mufasa's GPUs is described here.

The name of each GPU resource takes the form

Name:Type

where Name is gpu and Type takes the following values:

  • 40gb for GPUs with 40 Gbytes of onboard RAM
  • 20gb for GPUs with 20 Gbytes of onboard
  • 10gb for GPUs with 10 Gbytes of onboard RAM

So, for instance,

gpu:20gb

identifies the resource corresponding to GPUs with 20 GB of RAM. Of this resource Mufasa has a given number, of which a job can request to use some (or all).

When asking for a gres resource (e.g., in an srun command or an SBATCH directive of an execution script), the syntax required by SLURM is

<Name>:<Type>:<quantity>

where quantity is an integer value specifying how many items of the resource are requested. So, for instance, to ask for 2 GPUs of type 20gb the syntax is

gpu:20gb:2

SLURM's generic resources are defined in /etc/slurm/gres.conf. In order to make GPUs available to SLURM's gres management, Mufasa makes use of Nvidia's NVML library. For additional information see SLURM's documentation.

SLURM Partitions

Several execution queues for jobs have been defined on Mufasa. Such queues are called partitions in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command

sinfo

(link to SLURM docs) provides a list of available partitions. Its output is similar to this:

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug         up   infinite      1    mix gn01
small*        up   12:00:00      1    mix gn01
normal        up 1-00:00:00      1    mix gn01
longnormal    up 3-00:00:00      1    mix gn01
gpu           up 1-00:00:00      1    mix gn01
gpulong       up 3-00:00:00      1    mix gn01
fat           up 3-00:00:00      1    mix gn01

In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "small" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.

On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to.

One information that the standard output of sinfo doesn't provide is if there are partitions that can only be used by the root user of Mufasa. In the example above, partition "debug" is for root users only. To know which partiions are root-only, you can use command

sinfo -o "%.10P %.5a %.11l %.6D %.6t %.8N %.4r"

Its output is

 PARTITION AVAIL   TIMELIMIT  NODES  STATE NODELIST ROOT
     debug    up    infinite      1    mix     gn01  yes
    small*    up    12:00:00      1    mix     gn01   no
    normal    up  1-00:00:00      1    mix     gn01   no
longnormal    up  3-00:00:00      1    mix     gn01   no
       gpu    up  1-00:00:00      1    mix     gn01   no
   gpulong    up  3-00:00:00      1    mix     gn01   no
       fat    up  3-00:00:00      1    mix     gn01   no

For what concerns hardware resources (such as CPUs, GPUs and RAM) the amounts of each resource available to Mufasa's partitions are set by SLURM's accounting system, and are not visible to sinfo. See Partition features for a description of these amounts.

Partition features

The output of sinfo (see above) provides a list of available partitions, but (except for time) does not provide information about the amount of resources that a partition makes available to the user jobs which are run on it. The amount of resources is visible through command

sacctmgr list qos format=name,priority,maxtres,maxwall -p

which provides the following (very badly formatted) output:

Name|Priority|MaxTRES|MaxWall|
normal|200|cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G|1-00:00:00|
small|500|cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G|12:00:00|
longnormal|100|cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G|3-00:00:00|
gpu|200|cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G|1-00:00:00|
gpulong|100|cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G|3-00:00:00|
fat|50|cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G|3-00:00:00|

Using the sed Linux utility to make the output a bit more legible, the command becomes

sacctmgr list qos format=name,priority,maxtres,maxwall -p | sed 's/|/\t /g'

and provides an output similar to the following:

Name	 Priority	 MaxTRES	 MaxWall	 
normal	 200	 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G	 1-00:00:00	 
small	 500	 cpu=2,gres/gpu:10gb=1,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=16G	 12:00:00	 
longnormal	 100	 cpu=16,gres/gpu:10gb=0,gres/gpu:20gb=0,gres/gpu:40gb=0,mem=128G	 3-00:00:00	 
gpu	 200	 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G	 1-00:00:00	 
gpulong	 100	 cpu=8,gres/gpu:10gb=2,gres/gpu:20gb=2,mem=64G	 3-00:00:00	 
fat	 50	 cpu=32,gres/gpu:10gb=2,gres/gpu:20gb=2,gres/gpu:40gb=2,mem=256G	 3-00:00:00

Its elements are the following (for more information, see SLURM's documentation):

Name
name of the partition
Priority
priority assigned to jobs run on the partition
MaxTRES
maximum amount of resources ("Trackable RESources") available to a job running on the partition, where
cpu=K means that the maximum number of processor cores is K
gres/Name:Type=K means that the maximum number of GPUs of class Name:Type (see gres syntax) is K
mem=KG means that the maximum amount of system RAM is K GBytes
MaxWall
maximum wall clock duration of the jobs run on the partition (after which they are killed by SLURM), in format [days-]hours:minutes:seconds

Partition availability

An important information that sinfo provides is the availability (also called state) of partitions. Possible partition states are:

up
the partition is available to be allocated work
drain
the partition is not available to be allocated work
down
the same as drain but the partition failed: i.e., it suffered a disruption

A partition in state drain or down requires intervention by a Job Administrator to be restored to up. Jobs waiting for that partition are paused unless the partition returns available.

Choosing the partition on which to run a job

When launching a job (as explained in Executing jobs on Mufasa) a user should select the partition that is most suitable for it according to the job's features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition. Partition features explains how to find out how many of Mufasa's resources are associated to each partition.

The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job's requirements only for those resources that have an unsuitable default value.

Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.

In general, the larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa's limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, witl have the effect of delaying (possibly for a long time) job execution.

Running jobs with SLURM: generalities

Note: these are general considerations. See Executing jobs on Mufasa for instructions about running your own processing jobs on Mufasa.


The commands that SLURM provides to run jobs are

srun [options] <command_to_be_run_via_SLURM>

and

sbatch [options] <command_to_be_run_via_SLURM>

(see SLURM documentation: srun, sbatch).

In both cases, <command_to_be_run_via_SLURM> can be any program or Linux shell script. By using srun or sbatch, the command or script specified by <command_to_be_run_via_SLURM> (including any programs launched by it) are added to SLURM's execution queues.

The main difference between srun and sbatch is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user. (You can, though, detach from that shell and come back later using screen.) sbatch, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

Additionally, with sbatch <command_to_be_run_via_SLURM> can be an execution script, i.e. a special (and SLURM-specific) type of Linux shell script that includes SBATCH directives. SBATCH directives can be used to specify the values of some of the parameters that would otherwise have to be set using the [options] part of the sbatch command. This is handy because it allows to write down the parameters in an execution script instead of having to write them in the command line while launching a job, which greatly reduces the possibility of mistakes. Also, an execution script is easy to keep and reuse.

The [options] part of srun and sbatch commands is used to tell SLURM the conditions under which it has to execute the job; in particular, it is used to specify what system resources SLURM should reserve for the job.

A quick way to define the set of resources that a program will be provided with is to use SLURM partitions. This is done with option -p <partition_name>. This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.

For instance, running

srun -p small ./my_program

makes SLURM run my_program on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.

Running interactive jobs via SLURM

As explained, SLURM command srun is suitable for launching interactive user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a bash shell (i.e. a terminal session) with a command similar to

srun --pty /bin/bash

and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)

exit

Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with srun or sbatch ensures that they can access all the resources managed by SLURM.

GPU resources (if needed) must always be requested explicitly with parameter --gres=gpu:<10|20|40>gb:K, where K is an integer between 1 and the maximum number of GPUs of that type available to the partition (see gres syntax). For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command

srun --gres=gpu:10gb:1 --pty /bin/bash

an then run the interactive program from the newly opened shell.

An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run /bin/bash on one of the available partitions. For instance, to run the shell on partition “small” the command is

srun -p small --pty /bin/bash

Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as (SLURM ID xx) (where xx is the ID of the /bin/bash process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

echo $SLURM_JOB_ID

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

Executing jobs on Mufasa

The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done.

Considering that all computation on Mufasa must occur within Docker containers, the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs.

The process of launching a user job on Mufasa involves the following steps:

Step 1 [for interactive and non-interactive user jobs] Use SLURM to run the Docker container where the job will take place
Step 2 [for interactive user jobs only] Launch the user job from within the container

Interactive and non-interactive user jobs

Interactive user jobs
are jobs that require interaction with the user while they are running, via a bash shell running within the Docker container. The shell is used to receive commands from the user and/or print output messages. For interactive user jobs, the job is usually launched manually by the user (with a command issued via the shell) after the Docker container is in execution.
Non-interactive user jobs
are the most common variety. The user prepares the Docker container in such a way that, when in execution, the container autonomously puts the user's jobs into execution. The user does not have any communication with the Docker container while it is in execution.

Both interactive and non-interactive user jobs can be run via a (very complex) command directly issued from the [[System#Accessing Mufasa|terminal opened via SSH]. To reduce the possibility of mistakes, it is usually preferable to define an execution script that takes care of launching the job.

Job output

The whole point of running a user job is to collect its output. Usually, such output takes the form of one or more files generated within the filesystem of the Docker container.

As explained below, SLURM includes a mechanism to mount a part of Mufasa's own filesystem onto the container's filesystem: so when the job running within the container writes to this mounted part, it actually writes to Mufasa's filesystem. This means that when the Docker container ends its execution, its output files persist in Mufasa's filesystem (usually in a subdirectory of the user's own /home directory) and can be retrieved by the user at a later time.

The same mechanism can be used to allow user jobs running into a Docker container to read their input data from Mufasa's filesystem (usually a subdirectory of the user's own /home directory).

Using SLURM to run a Docker container

The first step to run a user job on Mufasa is to run the Docker container where the job will take place. A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if they belong to the user's /home directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

Each user is in charge of preparing the Docker container(s) where the user's jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.

In order to run a Docker container via SLURM, a user must use a command similar to the following ones:

For interactive user jobs:

srun [‑p <partition_name>] ‑‑container-image <container_path.sqsh> --job-name=<jobname> ‑‑no‑container‑entrypoint ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task <cpu_amount>] [‑‑time=<hh:mm:ss>] ‑‑pty <command_to_run_within_container>

For non-interactive user jobs:

srun [‑p <partition_name>] ‑‑container-image <container_path.sqsh> --job-name=<jobname> ‑‑no‑container‑entrypoint ‑‑container‑mounts=<mufasa_dir>:<docker_dir> [‑‑gres=<gpu_resources>] [‑‑mem=<mem_resources>] [‑‑cpus‑per‑task <cpu_amount>] [‑‑time=<hh:mm:ss>] [command_to_run_within_container]

Parts of the above commands within < > are mandatory, parts within [ ] are optional.

Below, the elements of these commands are explained.

‑p <partition_name>
specifies the SLURM partition on which the job will be run.
Important! If ‑‑p <partition_name> is used, options that specify how many resources to assign to the job (such as ‑‑mem=<mem_resources>, ‑‑cpus‑per‑task <cpu_number> or ‑‑time=<hh:mm:ss>) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception concerns option ‑‑gres=<gpu_resources>, which is always required (see below) if the job needs access to GPUs.
--job-name=<jobname>
Specifies a name for the job. The specified name will appear along with the JOBID number when querying running jobs on the system with squeue. The default job name (i.e., the one assigned to the job when --job-name is not used) is the executable program's name.
‑‑container-image <container_path.sqsh>
specifies the container to be run
‑‑no‑container‑entrypoint
specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is an element of a Docker container: a command that gets executed as soon as the container is in execution. Option ‑‑no‑container‑entrypoint is useful when -for some reason- the user does not want the entrypoint in the container to be run.
‑‑container‑mounts=<mufasa_dir>:<docker_dir>
specifies what parts of Mufasa's filesystem will be available within the container's filesystem, and where they will be mounted. This is necessary to let the container get input data from Mufasa and/or write output data to Mufasa. For instance, if <mufasa_dir>:<docker_dir> takes the value /home/mrossi:/data this tells srun to mount Mufasa's directory /home/mrossi in position /data within the filesystem of the Docker container. When the docker container reads or writes files in directory /data of its own (internal) filesystem, what actually happens is that files in /home/mrossi get manipulated instead. /home/mrossi is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.
‑‑gres=<gpu_resources>
specifies what GPUs to assign to the container. gpu_resources is a comma-delimited list where each element has the form gpu:<Type>:<amount>, where <Type> is one of the types of GPU available on Mufasa (see gres syntax) and <amount> is an integer between 1 and the number of GPUs of such type available to the partition. For instance, <gpu_resources> may be gpu:40gb:1,gpu:10gb:3, corresponding to asking for one "full" GPU and 3 "small" GPUs.
Important! The ‑‑gres parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount), GPUs must always be explicitly requested.
‑‑mem=<mem_resources>
specifies the amount of RAM to assign to the container; for instance, <mem_resources> may be 200G
‑‑cpus-per-task <cpu_amount>
specifies how many CPUs to assign to the container; for instance, <cpu_amount> may be 2
‑‑time=<d-hh:mm:ss>
specifies the maximum time allowed to the job to run, in the format days-hours:minutes:seconds, where days is optional; for instance, <d-hh:mm:ss> may be 72:00:00
‑‑pty
specifies that the job will be interactive (this is necessary when <command_to_run_within_container> is /bin/bash: see Running interactive jobs via SLURM)
<command_to_run_within_container>, [command_to_run_within_container]
the command that will be put into execution within the Docker container as soon as it is operative. Note that this is mandatory for interactive user jobs and optional for non-interactive user jobs.
Important! This command will be run in the environment created by Docker.


For interactive user jobs, a typical value for <command_to_run_within_container> is /bin/bash. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for <command_to_run_within_container> is python, which launches an interactive Python session from which the user will then run their job.

For non-interactive user jobs, using [command_to_run_within_container] is one of the two available methods to run the program(s) that the user wants to be executed within the Docker container. The other available method to run the user job(s) is to use the entrypoint of the container. The use of [command_to_run_within_container] is therefore optional.

Using execution scripts to run jobs

The srun commands described in Using SLURM to run a Docker container are very complex, and it's easy to forget some option or make mistakes while using them. For non-interactive jobs, there is a solution to this problem.

When the user job is non-interactive, in fact, the srun command can be substituted with a much simpler sbatch command. As already explained, sbatch can make use of an execution script to specify all the parts of the command to be run via SLURM. So the command to run the Docker container where the user job will take place becomes

sbatch <execution_script>

An execution script is a special type of Linux script that includes SBATCH directives. SBATCH directives are used to specify the values of the parameters that are otherwise set in the [options] part of an srun command.

Note on Linux shell scripts
A shell script is a text file that will be run by the bash shell. In order to be acceptable as a bash script, a text file must:
  • have the “executable” flag set
  • have #!/bin/bash as its very first line

Usually, a Linux shell script is given a name ending in .sh, such as my_execution_script.sh, but this is not mandatory.

Within any shell script, lines preceded by # are comments (with the notable exception of the initial #!/bin/bash line). Use of blank lines as spacers is allowed.

An execution script is a Linux shell script composed of two parts:

  1. a preamble, composed of directives using which the user specifies the values to be given to parameters, each preceded by the keyword SBATCH
  2. [optionally] one or more srun commands that launch jobs with SLURM using the parameter values specified in the preamble

The srun commands are optional because jobs can also be launched by the Docker container's own entrypoint.

Below is an execution script template to be copied and pasted into your own execution script text file.

The template includes all the options already described above, plus a few additional useful ones (for instance, those that enable SLURM to send email messages to the user in correspondence to events in the lifecycle of their job). Information about all the possible options can be found in [SLURM's own documentation].

All the SBATCH directives in the script template below are inactive because commented out. To enable a directive, just uncomment it by removing the leading "#". To make them stand out more visibly, in the template the comments corresponding to actual instructions are in bold.

#!/bin/bash

#----------------start of preamble----------------

# SBATCH ‑p <partition_name>

# SBATCH ‑‑container-image <container_path.sqsh>

# SBATCH --job-name=<name>

# SBATCH ‑‑no‑container‑entrypoint

# SBATCH ‑‑container‑mounts=<mufasa_dir>:<docker_dir>

# SBATCH ‑‑gres=<gpu_resources>

# SBATCH ‑‑mem=<mem_resources>

# SBATCH ‑‑cpus-per-task <cpu_amount>

# SBATCH ‑‑time=<d-hh:mm:ss>

# Note: the following SBATCH directives introduce new options not described before

# SBATCH --mail-user <email_address>

# this directive activates SLURM's email notifications and specifies the address that SLURM will send notifications to

# SBATCH --mail-type BEGIN

# SBATCH --mail-type END

# SBATCH --mail-type FAIL

# the above 3 directives tell SLURM to send email notifications when the job execution begins/ends/fails

#----------------end of preamble----------------

# srun <command_to_run_within_container>

# to run the user job, either uncomment (and personalise) the above srun command or use the entrypoint of the Docker container

Nvidia Pyxis

Some of the options described below are specifically dedicated to Docker containers: these are provided by the Nvidia Pyxis package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them.

More specifically, options ‑‑container-image, ‑‑no‑container‑entrypoint, ‑‑container-mounts are provided to srun by Pyxis.

See the Nvidia Pyxis github page for additional information about the options that it provides to srun.

Launching a user job from within a Docker container

For interactive user jobs, once the Docker container (run as explained here) is up and running, the user is dropped to the interactive environment specified by <command_to_run_within_container>. This interactive environment can be, for instance, a bash shell or an interactive Python console. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).

Please note that the interactive environment of the Docker container does not have any relation with Mufasa's system. The only contact point is the part of Mufasa's filesystem that has been grafted to the container's filesystem via the ‑‑container‑mounts option of srun. In particular, none of the software packages (such as the Nvidia drivers) installed on Mufasa are available in the container, unless they have been installed in it at preparation time (as explained in Docker), or manually after the container is put in execution.

Also note that, once a Docker container launched with srun is in execution, its own bash shell is completely indistinguishable from the bash shell of Mufasa where the srun command that put the container in execution was issued. The two shells share the same terminal window. The only clue to the fact that you now are, in fact, in the container's shell may be the command prompt, which should now show your location as /opt.

Detaching from a running job with screen

A consequence of the way srun operates is that if you launch an interactive user job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command srun inside a screen session (often simply called "a screen"), then detach from the screen (here is one of many tutorials about screen available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the screen.

More specifically, to create a screen session and run a job in it:

  1. Connect to Mufasa with SSH
  2. From the Mufasa shell, run
    screen
  3. In the screen session thus created (it has the look of an empty shell), launch your job with srun
  4. Detach from the screen session by pressing ctrl + A followed by D: you will come back to the original Mufasa shell, while your process will go on running in the screen session
  5. You can now close the SSH connection to Mufasa without damaging your process

Later, when you are ready to resume contact with your running process:

  1. Connect to Mufasa with SSH
  2. In the Mufasa shell, run
    screen -r
  3. You are now back to the screen session where you launched your job
  4. When you do not need the screen containing your job anymore, destroy it by pressing (within the screen) ctrl + A followed by \ (i.e., backslash)

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

Automatic job caching

When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where /home partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.

Each time a job is run via SLURM, this is what happens automatically:

  1. Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user's own /home) to a cache space located on system SSDs
  2. Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files
  3. The executables create their output files in the cache space
  4. When the user jobs end, Mufasa copies the output files from the cache space back to the user's own /home

The whole process is completely transparent to the user. The user simply prepares the executable (or the execution script) in a subdirectory of their /home directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of /home, exactly as if the execution actually occurred there.

Important! The caching mechanism requires that during job execution the user does not modify the contents of the /home subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.

Monitoring and managing jobs

SLURM provides Job Users with tools to inspect and manage jobs. While a Job User is able to see all users' jobs, they are only allowed to interact with their own.

The main commands used to interact with jobs are squeue to inspect the scheduling queues and scancel to terminate queued or running jobs.

Inspecting jobs with squeue

Running command

squeue

provides an output similar to the following:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  520       fat     bash acasella  R 2-04:10:25      1 gn01
  523       fat     bash amarzull  R    1:30:35      1 gn01
  522       gpu     bash    clena  R   20:51:16      1 gn01

This output comprises the following information:

JOBID
Numerical identifier of the job assigned by SLURM
This identifier is used to intervene on the job, for instance with scancel
PARTITION
the partition that the job is run on
NAME
the name assigned to the job; can be personalised using the --job-name option
USER
username of the user who launched the job
ST
job state (see Job state for further information)
TIME
time that has passed since the beginning of job execution
NODES
number of nodes where the job is being executed (for Mufasa, this is always 1 as it is a single machine)
NODELIST (REASON)
name of the nodes where the job is being executed: for Mufasa it is always gn01, which is the name of the node corresponding to Mufasa.

To limit the output of squeue to the jobs owned by user <username>, it can be used like this:

squeue -u <username>

If needed, complete information about a job can be obtained using command

scontrol show job <JOBID>

where <JOBID> is the number from the first column of the output of squeue. The output of this command is similar to the following:

JobId=936 JobName=bash
   UserId=acasella(1001) GroupId=acasella(1001) MCS_label=N/A
   Priority=7885 Nice=0 Account=research QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=03:21:59 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2022-02-08T11:57:24 EligibleTime=2022-02-08T11:57:24
   AccrueTime=Unknown
   StartTime=2022-02-08T11:57:24 EndTime=2022-02-11T11:57:24 Deadline=N/A
   PreemptEligibleTime=2022-02-08T11:57:24 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-08T11:57:24 Scheduler=Main
   Partition=fat AllocNode:Sid=rk018445:4034
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gn01
   BatchHost=gn01
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=128G,node=1,billing=8,gres/gpu:40gb=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=128G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/acasella
   Power=
   TresPerNode=gres:gpu:40gb:1

Job state

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of squeue as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

PD PENDING
Job is awaiting resource allocation.
R RUNNING
Job currently has an allocation.
S SUSPENDED
Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
CG COMPLETING
Job is in the process of completing. Some processes on some nodes may still be active.
CD COMPLETED
Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. The SLURM doc page for squeue provides a complete list of them.

Canceling a job with scancel

It is possible to cancel a job using command scancel, either while it is waiting for execution or when it is in execution (in this case you can choose what system signal to send the process in order to terminate it). The following are some examples of use of scancel adapted from SLURM's documentation.

scancel <JOBID>

removes queued job <JOBID> from the execution queue.

scancel --signal=TERM <JOBID>

terminates execution of job <JOBID> with signal SIGTERM (request to stop).

scancel --signal=KILL <JOBID>

terminates execution of job <JOBID> with signal SIGKILL (force stop).

scancel --state=PENDING --user=<username> --partition=<partition_name>

cancels all pending jobs belonging to user <username> in partition <partition_name>.