Mufasa (BioHPC) - User contributions [en]

Docker

2022-01-26T13:06:23Z

10.79.2.181:

[Under construction]

User Jobs

2022-01-18T16:08:44Z

10.79.2.181: /* Launching a user job from within a Docker container */

This page presents the features of Mufasa that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM's own Quick Start User Guide].

= SLURM Partitions =

Several execution queues for jobs have been defined on Mufasa. Such queues are called '''partitions''' in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command

<pre style="color: lightgrey; background: black;">
sinfo
</pre>

([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 1 mix gn01
small* up 12:00:00 1 mix gn01
normal up 1-00:00:00 1 mix gn01
longnormal up 3-00:00:00 1 mix gn01
gpu up 1-00:00:00 1 mix gn01
gpulong up 3-00:00:00 1 mix gn01
fat up 3-00:00:00 1 mix gn01
</pre>

In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "small" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.

On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command

<pre style="color: lightgrey; background: black;">
sinfo --Format=all
</pre>

but its output can be overwhelming. For instance, in the example above the output of <code>sinfo --Format=all</code> is the following:

<pre style="color: lightgrey; background: black;">
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1
</pre>

A less comprehensive but more readable view of partition features can be obtained via a tailored <code>sinfo</code> command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:

<pre style="color: lightgrey; background: black;">
sinfo -o "%.10P %.6a %.4c %.17B %.60G %.11l %.11L %.4r"
</pre>

Such command provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL CPUS MAX_CPUS_PER_NODE GRES TIMELIMIT DEFAULTTIME ROOT
debug up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) infinite n/a yes
small* up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 12:00:00 15:00 no
normal up 62 24 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 1-00:00:00 15:00 no
longnormal up 62 24 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
gpu up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 1-00:00:00 15:00 no
gpulong up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
fat up 62 48 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
</pre>

The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the ''node'' is Mufasa:
<code>
: %P Partition name followed by "*" for the default partition

: %a State/availability of a partition

: %c Number of CPUs per node

: %B The max number of CPUs per node available to jobs in the partition

: %G Generic resources (gres) associated with the nodes [''for Mufasa these correspond to the [[System#Hardware|virtual GPUs defined with MIG]]'']

: %l Maximum time for any job in the format "days-hours:minutes:seconds"

: %L Default time for any job in the format "days-hours:minutes:seconds"

: %r Only user root may initiate jobs, "yes" or "no"
</code>

In the actual command, field identifiers <code>%...</code> are preceded by width specifiers in the form <code>.N</code>, where <code>N</code> is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and can be used to help readability.

== Partition availability ==

The most important information that ''sinfo'' provides to users is the ''availability'' (also called ''state'') of partitions.

For operational partitions, availability is ''up'', meaning that the partition is available to be allocated work. A state/availability equal to ''drain'' means that the partition is not available to be allocated work, while ''down'' means the same as ''drain'' but also that the partition failed, i.e. that it suffered a disruption.

A partition in state ''drain'' or ''down'' requires intervention by a [[Roles|Job Administrator]] to be restored to ''up'' state. Jobs waiting for that partition are paused.

== Choosing the "right" partition ==

When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job's features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, and instead rely on the set of resources already defined for the partition.

The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to such partitions.

Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job's requirements only for those resources that have an unsuitable default value.

Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to launch on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the launch command is refused.

One of the most important resources provided to jobs by partitions is ''time'', in the sense that a job is permitted to run for no longer than a predefined time duration. Jobs that exceed their allotted time are killed by SLURM.

= Executing jobs on Mufasa =

The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done.

Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:
----
----
:; Step 1
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]

:; Step 2
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]
----
----
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.

The commands that SLURM provides to run jobs are

<pre style="color: lightgrey; background: black;">
srun <options> <path_of_the_program_to_be_run_via_SLURM>
</pre>

and

<pre style="color: lightgrey; background: black;">
srun <options> <path_of_the_program_to_be_run_via_SLURM>
</pre>

(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

Among the options available for <code>srun</code> and <code>sbatch</code>, one of the most important is <code>--res=gpu:K</code>, where <code>K</code> is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.

As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option <code>--p <partition name></code>.
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.

For instance, running
<pre style="color: lightgrey; background: black;">
srun -p small ./my_program
</pre>

makes SLURM run <code>my_program</code> on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.

= Using SLURM to run a Docker container =

The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if they belong to the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

Each user is in charge of preparing the Docker container(s) where the user's jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.

In order to run a Docker container via SLURM, a user must use a command similar to the following:

<pre style="color: lightgrey; background: black;">
srun ‑‑p <partition_name> ‑‑container-image=<container_path.sqsh> ‑‑no‑container‑entrypoint ‑‑container‑mounts=<mufasa_dir>:<docker_dir> ‑‑gres=<gpu_resources> ‑‑mem=<mem_resources> ‑‑cpus‑per‑task <cpu_amount> ‑‑pty ‑‑time=<hh:mm:ss> <command_to_run_within_container>
</pre>

All parts of the command above that come after ''srun'' are options that specify what to execute and how. Below these options are explained.

;‑‑p <partition_name>
: specifies the resource partition on which the job will be run.

''Important! If <code>‑‑p <partition_name></code> is used, options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task <cpu_number></code> or <code>‑‑time=<hh:mm:ss></code>) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option <code>‑‑gres=<gpu_resources></code>: GPU resources, in fact, must always be explicitly requested with option <code>‑‑gres</code>, otherwise no access to GPUs is granted to the job.''

;‑‑container-image=<container_path.sqsh>
: specifies the container to be run

;‑‑no‑container‑entrypoint
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.

;<nowiki>‑‑container‑mounts=<mufasa_dir>:<docker_dir></nowiki>
: specifies what parts of Mufasa's filesystem will be available within the container's filesystem, and where they will be mounted; for instance, if <code><mufasa_dir>:<docker_dir></code> takes the value <code>/home/mrossi:/data</code> this tells srun to mount Mufasa's directory <code>/home/mrossi</code> in position <code>/data</code> within the filesystem of the Docker container. When the docker container reads or writes files in directory <code>/data</code> of its own (internal) filesystem, what actually happens is that files in <code>/home/mrossi</code> get manipulated instead. <code>/home/mrossi</code> is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.

;‑‑gres=<gpu_resources>
: specifies what GPUs to assign to the container; for instance, <code><gpus></code> may be <code>gpu:40gb:2</code>, that corresponds to giving the job control to 2 entire large‑size GPUs.

''Important! The <code>‑‑gres</code> parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with <code>‑‑gres</code>.''

;‑‑mem=<mem_resources>
: specifies the amount of RAM to assign to the container; for instance, <code><mem_resources></code> may be <code>200G</code>

;‑‑cpus-per-task <cpu_amount>
: specifies how many CPUs to assign to the container; for instance, <code><cpu_amount></code> may be <code>2</code>

;‑‑pty
: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])

;<nowiki>‑‑time=<d-hh:mm:ss></nowiki>
: specifies the maximum time allowed to the job to run, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>

;<command_to_run_within_container>
: the executable that will be run within the Docker container as soon as it is operative.

A typical value for <code><command_to_run_within_container></code> is <code>/bin/bash</code>. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for <code><command_to_run_within_container></code> is <code>python</code>, which launches an interactive Python session from which the user will then run their job. It is also possible to use <code><command_to_run_within_container></code> to launch non-interactive programs.

== Nvidia Pyxis ==

Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options <code>‑‑container-image</code>, <code>‑‑no‑container‑entrypoint</code>, <code>‑‑container-mounts</code> are provided to <code>srun</code> by Pyxis.

= Launching a user job from within a Docker container =

Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, the user is dropped to the interactive environment specified by <code><command_to_run_within_container></code>. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).

= Running interactive jobs via SLURM =

As explained, SLURM command <code>srun</code> is suitable for launching ''interactive'' user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a ''bash shell'' (i.e. a terminal session) with a command similar to

<pre style="color: lightgrey; background: black;">
srun --pty /bin/bash
</pre>

and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with <code>srun</code> or <code>sbatch</code> ensures that they can access all the resources managed by SLURM.

As usual, GPU resources (if needed) must always be requested explicitly with parameter <code>--res=gpu:K</code>. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command

<pre style="color: lightgrey; background: black;">
srun --gres=gpu:1 --pty /bin/bash
</pre>

an then run the interactive program from the newly opened shell.

An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run <code>/bin/bash</code> on one of the available partitions. For instance, to run the shell on partition “small” the command is

<pre style="color: lightgrey; background: black;">
srun -p small --pty /bin/bash
</pre>

Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as <code>(SLURM ID xx)</code> (where <code>xx</code> is the ID of the <code>/bin/bash</code> process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

= Detach from a running job with ''screen'' =

A consequence of the way <code>srun</code> operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then detach from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the ''screen''.

More specifically, to create a screen session and run a job in it:

* Connect to Mufasa with SSH
* From the Mufasa shell, run

<pre style="color: lightgrey; background: black;">
screen
</pre>

* In the screen session thus created (it has the look of an empty shell), launch your job with <code>srun</code>
* ''Detach'' from the screen session with '''''ctrl + A''''' followed by '''''D''''': you will come back to the original Mufasa shell, while your process will go on running in the screen session
* You can now close the SSH connection to Mufasa without damaging your process

Later, when you are ready to resume contact with your running process:

* Connect to Mufasa with SSH
* In the Mufasa shell, run

<pre style="color: lightgrey; background: black;">
screen -r
</pre>

* You are now back to the screen session where you launched your job

* When you do not need the screen containing your job anymore, destroy it by using (within the screen) '''''ctrl + A''''' followed by '''''X'''''

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

= Using execution scripts to run jobs =

Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.

In general, though, it is preferable to wrap the commands that run jobs into '''execution scripts'''. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', where the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# one or more '''srun commands''' that launch jobs with SLURM using the parameter values specified in the preamble

An execution script is a special type of Linux ''bash script''. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:

* have the “executable” flag set
* have <code>#!/bin/bash</code> as its very first line

Usually, a Linux bash script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.

To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts's full path (e.g., ''./my_execution_script.sh'') and press the <enter> key. The script is executed in the terminal, and any output (e.g., whatever is printed by any <code>echo</code> commands in the script) is shown on the terminal.

Within a bash script, lines preceded by <code>#</code> are comments (with the notable exception of the initial <code>#!/bin/bash</code> line). Use of blank lines as spacers is allowed.

Below is an example of execution script (actual instructions are shown in bold; the rest are comments):

<blockquote>
'''#!/bin/bash'''

<nowiki>#</nowiki> ----------------start of preamble----------------

<nowiki>#</nowiki> Note: these are examples. Put your own SBATCH directives below

'''SBATCH --job-name=myjob'''

<nowiki>#</nowiki> name assigned to the job

'''SBATCH --cpus-per-task=1'''

<nowiki>#</nowiki> number of threads allocated to each task

'''SBATCH --mem-per-cpu=500M'''

<nowiki>#</nowiki> amount of memory per CPU core

'''SBATCH --gres=gpu:1'''

<nowiki>#</nowiki> number of GPUs per node

'''SBATCH --partition=small'''

<nowiki>#</nowiki> the partition to run your jobs on

'''SBATCH --time=0-00:01:00'''

<nowiki>#</nowiki> time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)

<nowiki>#</nowiki>----------------end of preamble----------------

<nowiki>#</nowiki> ----------------srun commands-----------------

<nowiki>#</nowiki> Put your own srun command(s) below

'''srun ...

'''srun ...

<nowiki>#</nowiki> ----------------end of srun commands-----------------
</blockquote>

As the example above shows, beyond the initial directive <code>#!/bin/bash</code> the script includes a series of <code>SBATCH</code> directives used to specify parameter values, and finally one or more <code>srun</code> commands that run the jobs. Any parameter accepted by commands <code>srun</code> and <code>sbatch</code> can be used as an <code>SBATCH</code> directive in an execution script.

= Job caching =

When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where <code>/home</code> partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.

Each time a job is run via SLURM, this is what happens automatically:

# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user's own <code>/home</code>) to a cache space located on system SSDs
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files
# The executables create their output files in the cache space
# When the user jobs end, Mufasa copies the output files from the cache space back to the user's own <code>/home</code>

The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their <code>/home</code> directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of <code>/home</code>, exactly as if the execution actually occurred there.

'''Important!''' The caching mechanism requires that ''during job execution'' the user does not modify the contents of the <code>/home</code> subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.

= Monitoring and managing jobs =

SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users' jobs, they are only allowed to modify the condition of their own jobs.

From [https://slurm.schedmd.com/overview.html SLURM's own overview]:

“''User tools include

<pre style="color: lightgrey; background: black;">
srun
</pre>

[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs,

<pre style="color: lightgrey; background: black;">
scancel
</pre>

[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,

<pre style="color: lightgrey; background: black;">
sinfo
</pre>

[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,

<pre style="color: lightgrey; background: black;">
squeue
</pre>

[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and

<pre style="color: lightgrey; background: black;">
sacct
</pre>

[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.''”

An example of the output of <code>squeue</code> is the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

== Job state ==

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

; PD PENDING
: Job is awaiting resource allocation.

; R RUNNING
: Job currently has an allocation.

; S SUSPENDED
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

; CG COMPLETING
: Job is in the process of completing. Some processes on some nodes may still be active.

; CD COMPLETED
: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them, reported here for completeness:

; BF BOOT_FAIL
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).

; CA CANCELLED
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

; CF CONFIGURING
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).

; DL DEADLINE
: Job terminated on deadline.

; F FAILED
: Job terminated with non-zero exit code or other failure condition.

; NF NODE_FAIL
: Job terminated due to failure of one or more allocated nodes.

; OOM OUT_OF_MEMORY
: Job experienced out of memory error.

; PR PREEMPTED
: Job terminated due to preemption.

; RD RESV_DEL_HOLD
: Job is being held after requested reservation was deleted.

; RF REQUEUE_FED
: Job is being requeued by a federation.

; RH REQUEUE_HOLD
: Held job is being requeued.

; RQ REQUEUED
: Completing job is being requeued.

; RS RESIZING
: Job is about to change size.

; RV REVOKED
: Sibling was removed from cluster due to other cluster starting the job.

; SI SIGNALING
: Job is being signaled.

; SE SPECIAL_EXIT
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.

; SO STAGE_OUT
: Job is staging out files.

; ST STOPPED
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.

; TO TIMEOUT
: Job terminated upon reaching its time limit.

User Jobs

2022-01-18T16:08:15Z

10.79.2.181: /* Launching a user job from within a Docker container */

This page presents the features of Mufasa that are most relevant to Mufasa's [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).

Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM's own Quick Start User Guide].

= SLURM Partitions =

Several execution queues for jobs have been defined on Mufasa. Such queues are called '''partitions''' in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command

<pre style="color: lightgrey; background: black;">
sinfo
</pre>

([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 1 mix gn01
small* up 12:00:00 1 mix gn01
normal up 1-00:00:00 1 mix gn01
longnormal up 3-00:00:00 1 mix gn01
gpu up 1-00:00:00 1 mix gn01
gpulong up 3-00:00:00 1 mix gn01
fat up 3-00:00:00 1 mix gn01
</pre>

In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside "small" indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.

On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command

<pre style="color: lightgrey; background: black;">
sinfo --Format=all
</pre>

but its output can be overwhelming. For instance, in the example above the output of <code>sinfo --Format=all</code> is the following:

<pre style="color: lightgrey; background: black;">
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1
</pre>

A less comprehensive but more readable view of partition features can be obtained via a tailored <code>sinfo</code> command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:

<pre style="color: lightgrey; background: black;">
sinfo -o "%.10P %.6a %.4c %.17B %.60G %.11l %.11L %.4r"
</pre>

Such command provides an output similar to the following:

<pre style="color: lightgrey; background: black;">
PARTITION AVAIL CPUS MAX_CPUS_PER_NODE GRES TIMELIMIT DEFAULTTIME ROOT
debug up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) infinite n/a yes
small* up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 12:00:00 15:00 no
normal up 62 24 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 1-00:00:00 15:00 no
longnormal up 62 24 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
gpu up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 1-00:00:00 15:00 no
gpulong up 62 UNLIMITED gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
fat up 62 48 gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) 3-00:00:00 1:00:00 no
</pre>

The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the ''node'' is Mufasa:
<code>
: %P Partition name followed by "*" for the default partition

: %a State/availability of a partition

: %c Number of CPUs per node

: %B The max number of CPUs per node available to jobs in the partition

: %G Generic resources (gres) associated with the nodes [''for Mufasa these correspond to the [[System#Hardware|virtual GPUs defined with MIG]]'']

: %l Maximum time for any job in the format "days-hours:minutes:seconds"

: %L Default time for any job in the format "days-hours:minutes:seconds"

: %r Only user root may initiate jobs, "yes" or "no"
</code>

In the actual command, field identifiers <code>%...</code> are preceded by width specifiers in the form <code>.N</code>, where <code>N</code> is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and can be used to help readability.

== Partition availability ==

The most important information that ''sinfo'' provides to users is the ''availability'' (also called ''state'') of partitions.

For operational partitions, availability is ''up'', meaning that the partition is available to be allocated work. A state/availability equal to ''drain'' means that the partition is not available to be allocated work, while ''down'' means the same as ''drain'' but also that the partition failed, i.e. that it suffered a disruption.

A partition in state ''drain'' or ''down'' requires intervention by a [[Roles|Job Administrator]] to be restored to ''up'' state. Jobs waiting for that partition are paused.

== Choosing the "right" partition ==

When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job's features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, and instead rely on the set of resources already defined for the partition.

The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to such partitions.

Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job's requirements only for those resources that have an unsuitable default value.

Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to launch on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the launch command is refused.

One of the most important resources provided to jobs by partitions is ''time'', in the sense that a job is permitted to run for no longer than a predefined time duration. Jobs that exceed their allotted time are killed by SLURM.

= Executing jobs on Mufasa =

The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done.

Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:
----
----
:; Step 1
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]

:; Step 2
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]
----
----
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.

The commands that SLURM provides to run jobs are

<pre style="color: lightgrey; background: black;">
srun <options> <path_of_the_program_to_be_run_via_SLURM>
</pre>

and

<pre style="color: lightgrey; background: black;">
srun <options> <path_of_the_program_to_be_run_via_SLURM>
</pre>

(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between <code>srun</code> and <code>sbatch</code> is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; <code>sbatch</code>, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.

Among the options available for <code>srun</code> and <code>sbatch</code>, one of the most important is <code>--res=gpu:K</code>, where <code>K</code> is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.

As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option <code>--p <partition name></code>.
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.

For instance, running
<pre style="color: lightgrey; background: black;">
srun -p small ./my_program
</pre>

makes SLURM run <code>my_program</code> on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.

= Using SLURM to run a Docker container =

The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if they belong to the user's <code>/home</code> directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.

Each user is in charge of preparing the Docker container(s) where the user's jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.

In order to run a Docker container via SLURM, a user must use a command similar to the following:

<pre style="color: lightgrey; background: black;">
srun ‑‑p <partition_name> ‑‑container-image=<container_path.sqsh> ‑‑no‑container‑entrypoint ‑‑container‑mounts=<mufasa_dir>:<docker_dir> ‑‑gres=<gpu_resources> ‑‑mem=<mem_resources> ‑‑cpus‑per‑task <cpu_amount> ‑‑pty ‑‑time=<hh:mm:ss> <command_to_run_within_container>
</pre>

All parts of the command above that come after ''srun'' are options that specify what to execute and how. Below these options are explained.

;‑‑p <partition_name>
: specifies the resource partition on which the job will be run.

''Important! If <code>‑‑p <partition_name></code> is used, options that specify how many resources to assign to the job (such as <code>‑‑mem=<mem_resources></code>, <code>‑‑cpus‑per‑task <cpu_number></code> or <code>‑‑time=<hh:mm:ss></code>) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option <code>‑‑gres=<gpu_resources></code>: GPU resources, in fact, must always be explicitly requested with option <code>‑‑gres</code>, otherwise no access to GPUs is granted to the job.''

;‑‑container-image=<container_path.sqsh>
: specifies the container to be run

;‑‑no‑container‑entrypoint
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.

;<nowiki>‑‑container‑mounts=<mufasa_dir>:<docker_dir></nowiki>
: specifies what parts of Mufasa's filesystem will be available within the container's filesystem, and where they will be mounted; for instance, if <code><mufasa_dir>:<docker_dir></code> takes the value <code>/home/mrossi:/data</code> this tells srun to mount Mufasa's directory <code>/home/mrossi</code> in position <code>/data</code> within the filesystem of the Docker container. When the docker container reads or writes files in directory <code>/data</code> of its own (internal) filesystem, what actually happens is that files in <code>/home/mrossi</code> get manipulated instead. <code>/home/mrossi</code> is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.

;‑‑gres=<gpu_resources>
: specifies what GPUs to assign to the container; for instance, <code><gpus></code> may be <code>gpu:40gb:2</code>, that corresponds to giving the job control to 2 entire large‑size GPUs.

''Important! The <code>‑‑gres</code> parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with <code>‑‑gres</code>.''

;‑‑mem=<mem_resources>
: specifies the amount of RAM to assign to the container; for instance, <code><mem_resources></code> may be <code>200G</code>

;‑‑cpus-per-task <cpu_amount>
: specifies how many CPUs to assign to the container; for instance, <code><cpu_amount></code> may be <code>2</code>

;‑‑pty
: specifies that the job will be interactive (this is necessary when <code><command_to_run_within_container></code> is <code>/bin/bash</code>: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])

;<nowiki>‑‑time=<d-hh:mm:ss></nowiki>
: specifies the maximum time allowed to the job to run, in the format <code>days-hours:minutes:seconds</code>, where <code>days</code> is optional; for instance, <code><d-hh:mm:ss></code> may be <code>72:00:00</code>

;<command_to_run_within_container>
: the executable that will be run within the Docker container as soon as it is operative.

A typical value for <code><command_to_run_within_container></code> is <code>/bin/bash</code>. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for <code><command_to_run_within_container></code> is <code>python</code>, which launches an interactive Python session from which the user will then run their job. It is also possible to use <code><command_to_run_within_container></code> to launch non-interactive programs.

== Nvidia Pyxis ==

Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options <code>‑‑container-image</code>, <code>‑‑no‑container‑entrypoint</code>, <code>‑‑container-mounts</code> are provided to <code>srun</code> by Pyxis.

= Launching a user job from within a Docker container =

Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, usually the user is dropped to the interactive environment specified by <code><command_to_run_within_container></code>. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).

= Running interactive jobs via SLURM =

As explained, SLURM command <code>srun</code> is suitable for launching ''interactive'' user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a ''bash shell'' (i.e. a terminal session) with a command similar to

<pre style="color: lightgrey; background: black;">
srun --pty /bin/bash
</pre>

and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)

<pre style="color: lightgrey; background: black;">
exit
</pre>

Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with <code>srun</code> or <code>sbatch</code> ensures that they can access all the resources managed by SLURM.

As usual, GPU resources (if needed) must always be requested explicitly with parameter <code>--res=gpu:K</code>. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command

<pre style="color: lightgrey; background: black;">
srun --gres=gpu:1 --pty /bin/bash
</pre>

an then run the interactive program from the newly opened shell.

An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run <code>/bin/bash</code> on one of the available partitions. For instance, to run the shell on partition “small” the command is

<pre style="color: lightgrey; background: black;">
srun -p small --pty /bin/bash
</pre>

Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as <code>(SLURM ID xx)</code> (where <code>xx</code> is the ID of the <code>/bin/bash</code> process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.

Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command

<pre style="color: lightgrey; background: black;">
echo $SLURM_JOB_ID
</pre>

If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.

= Detach from a running job with ''screen'' =

A consequence of the way <code>srun</code> operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command <code>srun</code> inside a ''screen session'' (often simply called "a screen"), then detach from the ''screen'' ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about <code>screen</code> available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the ''screen''.

More specifically, to create a screen session and run a job in it:

* Connect to Mufasa with SSH
* From the Mufasa shell, run

<pre style="color: lightgrey; background: black;">
screen
</pre>

* In the screen session thus created (it has the look of an empty shell), launch your job with <code>srun</code>
* ''Detach'' from the screen session with '''''ctrl + A''''' followed by '''''D''''': you will come back to the original Mufasa shell, while your process will go on running in the screen session
* You can now close the SSH connection to Mufasa without damaging your process

Later, when you are ready to resume contact with your running process:

* Connect to Mufasa with SSH
* In the Mufasa shell, run

<pre style="color: lightgrey; background: black;">
screen -r
</pre>

* You are now back to the screen session where you launched your job

* When you do not need the screen containing your job anymore, destroy it by using (within the screen) '''''ctrl + A''''' followed by '''''X'''''

A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.

= Using execution scripts to run jobs =

Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.

In general, though, it is preferable to wrap the commands that run jobs into '''execution scripts'''. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.

An execution script is a Linux shell script composed of two parts:

# a '''preamble''', where the user specifies the values to be given to parameters, each preceded by the keyword <code>SBATCH</code>
# one or more '''srun commands''' that launch jobs with SLURM using the parameter values specified in the preamble

An execution script is a special type of Linux ''bash script''. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:

* have the “executable” flag set
* have <code>#!/bin/bash</code> as its very first line

Usually, a Linux bash script is given a name ending in ''.sh,'' such as ''my_execution_script.sh'', but this is not mandatory.

To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts's full path (e.g., ''./my_execution_script.sh'') and press the <enter> key. The script is executed in the terminal, and any output (e.g., whatever is printed by any <code>echo</code> commands in the script) is shown on the terminal.

Within a bash script, lines preceded by <code>#</code> are comments (with the notable exception of the initial <code>#!/bin/bash</code> line). Use of blank lines as spacers is allowed.

Below is an example of execution script (actual instructions are shown in bold; the rest are comments):

<blockquote>
'''#!/bin/bash'''

<nowiki>#</nowiki> ----------------start of preamble----------------

<nowiki>#</nowiki> Note: these are examples. Put your own SBATCH directives below

'''SBATCH --job-name=myjob'''

<nowiki>#</nowiki> name assigned to the job

'''SBATCH --cpus-per-task=1'''

<nowiki>#</nowiki> number of threads allocated to each task

'''SBATCH --mem-per-cpu=500M'''

<nowiki>#</nowiki> amount of memory per CPU core

'''SBATCH --gres=gpu:1'''

<nowiki>#</nowiki> number of GPUs per node

'''SBATCH --partition=small'''

<nowiki>#</nowiki> the partition to run your jobs on

'''SBATCH --time=0-00:01:00'''

<nowiki>#</nowiki> time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)

<nowiki>#</nowiki>----------------end of preamble----------------

<nowiki>#</nowiki> ----------------srun commands-----------------

<nowiki>#</nowiki> Put your own srun command(s) below

'''srun ...

'''srun ...

<nowiki>#</nowiki> ----------------end of srun commands-----------------
</blockquote>

As the example above shows, beyond the initial directive <code>#!/bin/bash</code> the script includes a series of <code>SBATCH</code> directives used to specify parameter values, and finally one or more <code>srun</code> commands that run the jobs. Any parameter accepted by commands <code>srun</code> and <code>sbatch</code> can be used as an <code>SBATCH</code> directive in an execution script.

= Job caching =

When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where <code>/home</code> partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.

Each time a job is run via SLURM, this is what happens automatically:

# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user's own <code>/home</code>) to a cache space located on system SSDs
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files
# The executables create their output files in the cache space
# When the user jobs end, Mufasa copies the output files from the cache space back to the user's own <code>/home</code>

The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their <code>/home</code> directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of <code>/home</code>, exactly as if the execution actually occurred there.

'''Important!''' The caching mechanism requires that ''during job execution'' the user does not modify the contents of the <code>/home</code> subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.

= Monitoring and managing jobs =

SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users' jobs, they are only allowed to modify the condition of their own jobs.

From [https://slurm.schedmd.com/overview.html SLURM's own overview]:

“''User tools include

<pre style="color: lightgrey; background: black;">
srun
</pre>

[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs,

<pre style="color: lightgrey; background: black;">
scancel
</pre>

[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,

<pre style="color: lightgrey; background: black;">
sinfo
</pre>

[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,

<pre style="color: lightgrey; background: black;">
squeue
</pre>

[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and

<pre style="color: lightgrey; background: black;">
sacct
</pre>

[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.''”

An example of the output of <code>squeue</code> is the following:

<pre style="color: lightgrey; background: black;">
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
520 fat bash acasella R 2-04:10:25 1 gn01
523 fat bash amarzull R 1:30:35 1 gn01
522 gpu bash clena R 20:51:16 1 gn01
</pre>

== Job state ==

Jobs typically pass through several states in the course of their execution. Job state is shown in column "ST" of the output of <code>squeue</code> as an abbreviated code (e.g., "R" for RUNNING).

The most relevant codes and states are the following:

; PD PENDING
: Job is awaiting resource allocation.

; R RUNNING
: Job currently has an allocation.

; S SUSPENDED
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

; CG COMPLETING
: Job is in the process of completing. Some processes on some nodes may still be active.

; CD COMPLETED
: Job has terminated all processes on all nodes with an exit code of zero.

Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for <code>squeue</code>] provides a complete list of them, reported here for completeness:

; BF BOOT_FAIL
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).

; CA CANCELLED
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

; CF CONFIGURING
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).

; DL DEADLINE
: Job terminated on deadline.

; F FAILED
: Job terminated with non-zero exit code or other failure condition.

; NF NODE_FAIL
: Job terminated due to failure of one or more allocated nodes.

; OOM OUT_OF_MEMORY
: Job experienced out of memory error.

; PR PREEMPTED
: Job terminated due to preemption.

; RD RESV_DEL_HOLD
: Job is being held after requested reservation was deleted.

; RF REQUEUE_FED
: Job is being requeued by a federation.

; RH REQUEUE_HOLD
: Held job is being requeued.

; RQ REQUEUED
: Completing job is being requeued.

; RS RESIZING
: Job is about to change size.

; RV REVOKED
: Sibling was removed from cluster due to other cluster starting the job.

; SI SIGNALING
: Job is being signaled.

; SE SPECIAL_EXIT
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.

; SO STAGE_OUT
: Job is staging out files.

; ST STOPPED
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.

; TO TIMEOUT
: Job terminated upon reaching its time limit.

System

2022-01-17T14:29:42Z

10.79.2.181: /* File transfer */

Mufasa is a Linux server located in a server room managed by the [[Roles|System Administrators]]. [[Roles|Job Users]] and [[Roles|Job Administrators]] can only access Mufasa remotely.

Remote access to Mufasa is performed using the [[System#Accessing Mufasa|SSH protocol]] for the execution of commands and the [[System#File transfer|SFTP protocol]] for the exchange of files. Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.

= Hardware =

Mufasa is a server for massively parallel computation. Its main hardware components are:

* 32-core, 64-thread AMD processor
* 1 TB RAM
* 9 TB of SSDs (for OS and execution cache)
* 28TB of HDDs (for user /home directories)
* 5 Nvidia A100 GPUs [based on the ''Ampere'' architecture]
* Linux Ubuntu operating system

Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, access resources are shared among different users and processes in order to optimise their usage and availability.

For what concerns GPUs, the 5 physical A100 GPUs are subdivided into “virtual” GPUs with different capabilities using Nvidia' MIG system. From [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ MIG's user guide]:

<blockquote>“''The Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.''”
</blockquote>

In practice, MIG allows flexible partitioning of a very powerful (but single) GPU to create multiple virtual GPUs with different capabilities, that are then made available to users as if they were separate devices.

Command

<code>[https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi]</code>

(“smi” stands for System Management Interface) provides an overview of the physical and virtual GPUs available to users in a system<ref>On Mufasa, this command may require to be launched via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.

= Accessing Mufasa =

User access to Mufasa is always remote and exploits the ''SSH'' (''Secure SHell'') protocol.

To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command

<pre style="color: silver; background: black;">
ssh <your_username_on_Mufasa>@<Mufasa's_IP_address>
</pre>

where <code><Mufasa's_IP_address></code> is either <code>'''10.79.23.96'''</code> or <code>'''10.79.23.97'''</code>
For example, user mrossi may access Mufasa with command

<pre style="color: silver; background: black;">
ssh mrossi@10.79.23.97
</pre>

Access via SSH works with Linux, MacOs and Windows 10 (and later) terminals. For Windows users, a handy alternative tool (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].

If you don't have a user account on Mufasa, you first have to ask your supervisor for one. See [[System#Users and groups|Users and groups]] for more information about Mufasa's users.

As soon as you launch the ''ssh'' command, you will be asked to type the password (i.e., the one of your user account on Mufasa). Once you provide the password, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa. The remote shell sports a ''command prompt'' such as

<your_username_on_Mufasa>@rk018445:~$

(''rk018445'' is the Linux hostname of Mufasa). For instance, user mrossi will see a prompt similar to this:

<pre style="color: silver; background: black;">
mrossi@rk018445:~$
</pre>

In the remote shell, you can issue commands to Mufasa by typing them after the prompt, then pressing the ''enter'' key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as <code>pwd</code> (which prints the path to the current directory) or <code>cd <destination_dir></code> (which changes the current directory). On the internet you can find many tutorials about the Linux command line, such as [https://linuxcommand.org/index.php this one].

To close the SSH session run

<pre style="color: silver; background: black;">
exit
</pre>

from the command prompt of the remote shell.

== VPN ==
To be able to connect to Mufasa, your computer must belong to Polimi's LAN. This happens either because the computer is physically located at Politecnico di Milano and connected via ethernet, or because you are using Polimi's VPN to connect to its LAN from somewhere else (such as your home). In particular, using the VPN is the ''only'' way to use Mufasa from outside Polimi. See [https://intranet.deib.polimi.it/ita/vpn-wifi this DEIB webpage] for instructions about how to activate VPN access.

== Timeout ==

SSH sessions to Mufasa may be subjected to an inactivity timeout: i.e., after a given inactivity period the ssh session gets automatically closed. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should use the ''screen'' command, as explained in [[User Jobs#Using screen with srun]].

== Using SSH with graphics ==

The standard form of the ''ssh'' command, i.e. the one described above, should always be preferred. However, it only allows text communication with Mufasa. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user's machine (which must use Linux as well). A special mode of operation of ''ssh'' is needed to enable this. This mode is engaged by running

<code> ssh -X <your username on Mufasa>@<Mufasa's IP address></code>

= File transfer =

Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the ''SFTP'' protocol (''Secure File Transfer Protocol'').

Linux and MacOS users can directly use the ''sftp'' package, as explained (for instance) by [https://geekflare.com/sftp-command-examples/ this guide]. Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package. MacOS users can interact with Mufasa via SFTP also with the [https://cyberduck.io/ Cyberduck] software package.

For Linux and MacOS user, file transfer to/from Mufasa occurs via an ''interactive sftp shell'', i.e. a remote shell very similar to the one one described in [[Accessing Mufasa|Accessing Mufasa]].
The first thing to do is to open a terminal and run the following command (note the similarity to SSH connections):

<pre style="color: silver; background: black;">
sftp <your_username_on_Mufasa>@<Mufasa's_IP_address>
</pre>

where <code><Mufasa's_IP_address></code> is either <code>'''10.79.23.96'''</code> or <code>'''10.79.23.97'''</code>

You will be asked your password. Once you provide it, you access an interactive sftp shell, where the command prompt takes the form

<pre style="color: silver; background: black;">
sftp>
</pre>

From this shell you can run the commands to exchange files. Most of these commands have two forms: one to act on the remote machine (in this case, Mufasa) and one to act on the local machine (i.e. your own computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L).

<pre style="color: silver; background: black;">
cd <path>
</pre>
to change directory to <code><path></code> on the remote machine.

<pre style="color: silver; background: black;">
lcd <path>
</pre>
to change directory to <code><path></code> on the local machine.

<pre style="color: silver; background: black;">
get <file>
</pre>
to download (i.e. copy) <code><file></code> from the current directory of the remote machine to the current directory of the local machine.

<pre style="color: silver; background: black;">
put <file>
</pre>
to upload (i.e. copy) <code><file></code> from the current directory of the local machine to the current directory of the remote machine.

Naturally, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories). Also, users can only download files from directories where they have read permission. (File permission on Mufasa follow the standard Linux rules.)

= Docker containers =

'''As a general rule, all computation performed on Mufasa must occur within '''[https://www.docker.com/ '''Docker containers''']. This allows every user to configure their own execution environment without any risk of interfering with everyone else's.

From [https://docs.docker.com/get-started/ Docker's documentation]:

<blockquote>“''Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.''
</blockquote>
<blockquote>Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.
</blockquote>
<blockquote>''A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.''”
</blockquote>
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.

A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&type=image dockerhub].

How to run Docker containers on Mufasa will be explained in Part 2 of this document.

== The SLURM job scheduling system ==

Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] to manage shared access to its resources. '''Users of Mufasa must use SLURM to run and manage the jobs they run on the machine'''<ref>It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa's resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.
</ref>. From [https://slurm.schedmd.com/documentation.html SLURM's documentation]:

<blockquote>“''Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.''”
</blockquote>
The use of a job scheduling system ensures that Mufasa's resources are exploited in an efficient way. However, the fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets ''queued'' and will be executed as soon as possible, according to the availability of resources in the machine.

Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].

In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. While managing process queues, SLURM will consider such requirements and match them with the available resources. As a consequence, resource-heavy jobs generally take longer to get executed, while less demanding jobs are usually put into execution quickly. On the other hand, processes that -while running- try to use more resources than they requested get killed by SLURM to avoid damaging other jobs.

All in all, the take-away message is: ''consider carefully how much resources to ask for your job''.

In Part 2 of this document it will be explained how resource requests can be greatly simplified by making use of predefined resource sets called ''SLURM partitions''.

= Users and groups =

As already explained, only Mufasa users can access the machine and interact with it. Creation of new users is done by Job Administrators or by specially designated users within each research group.

Mufasa usernames have the form '''''xyyy''''' (all lowercase) where '''''x''''' is the first letter of the first name and '''''yyy''''' is the complete surname. For instance, user Mario Rossi will be assigned user name ''mrossi''. If multiple users with the same surname and first letter of the name exist, those created after the first are given usernames ''xyyy01'', ''xyyy02'', and so on.

On Linux machines such as Mufasa, users belong to ''groups''. On Mufasa, groups are used to identify the research group that a specific user is part of. Assigment of Mufasa's users to groups follow these rules:

* All users belong to group '''''users'''''.
* Additionally, each user must belong to ''one and only one'' of the following (within brackets is the name of the faculty who is in charge of Mufasa for each group):
** '''''nearmrs''''', i.e. [https://nearlab.polimi.it/medical/ Medical Robotics Section of NearLab] (prof. De Momi);
** '''''nearnes''''', i.e. [https://nearlab.polimi.it/neuroengineering/ NeuroEngineering Section of NearLab] (prof. Ferrante);
** '''''cartcas''''', i.e. [http://www.cartcas.polimi.it/ CartCasLab] (prof. Cerveri);
** '''''biomech''''', i.e. [http://www.biomech.polimi.it/ Biomechanics Research Group] (prof. Votta);
** '''''bio''''', for BioEngineering users not belonging to the research groups listed above.

Users who are not Job Administrators but have been given the power to create users can do so with command

''sudo /opt/share/sbin/add_user.sh -u <user> -g users,<group>''

where ''<user>'' is the username of the new user and ''<group>'' is one of the 6 groups from the list above.

For instance, in order to create a user on Mufasa for a person named Mario Rossi belonging to the NeuroEngineering Section of NearLab, the following command will be used:

''sudo /opt/share/sbin/add_user.sh -u mrossi -g users,nearnes''

New users are created with a predefined password, that they will be asked to change at their first login. For security reason, it is important that such first login occurs as soon as possible.

System

2022-01-17T14:23:36Z

10.79.2.181:

Mufasa is a Linux server located in a server room managed by the [[Roles|System Administrators]]. [[Roles|Job Users]] and [[Roles|Job Administrators]] can only access Mufasa remotely.

Remote access to Mufasa is performed using the [[System#Accessing Mufasa|SSH protocol]] for the execution of commands and the [[System#File transfer|SFTP protocol]] for the exchange of files. Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.

= Hardware =

Mufasa is a server for massively parallel computation. Its main hardware components are:

* 32-core, 64-thread AMD processor
* 1 TB RAM
* 9 TB of SSDs (for OS and execution cache)
* 28TB of HDDs (for user /home directories)
* 5 Nvidia A100 GPUs [based on the ''Ampere'' architecture]
* Linux Ubuntu operating system

Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, access resources are shared among different users and processes in order to optimise their usage and availability.

For what concerns GPUs, the 5 physical A100 GPUs are subdivided into “virtual” GPUs with different capabilities using Nvidia' MIG system. From [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ MIG's user guide]:

<blockquote>“''The Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.''”
</blockquote>

In practice, MIG allows flexible partitioning of a very powerful (but single) GPU to create multiple virtual GPUs with different capabilities, that are then made available to users as if they were separate devices.

Command

<code>[https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi]</code>

(“smi” stands for System Management Interface) provides an overview of the physical and virtual GPUs available to users in a system<ref>On Mufasa, this command may require to be launched via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.

= Accessing Mufasa =

User access to Mufasa is always remote and exploits the ''SSH'' (''Secure SHell'') protocol.

To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command

<pre style="color: silver; background: black;">
ssh <your_username_on_Mufasa>@<Mufasa's_IP_address>
</pre>

where <code><Mufasa's_IP_address></code> is either <code>'''10.79.23.96'''</code> or <code>'''10.79.23.97'''</code>
For example, user mrossi may access Mufasa with command

<pre style="color: silver; background: black;">
ssh mrossi@10.79.23.97
</pre>

Access via SSH works with Linux, MacOs and Windows 10 (and later) terminals. For Windows users, a handy alternative tool (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].

If you don't have a user account on Mufasa, you first have to ask your supervisor for one. See [[System#Users and groups|Users and groups]] for more information about Mufasa's users.

As soon as you launch the ''ssh'' command, you will be asked to type the password (i.e., the one of your user account on Mufasa). Once you provide the password, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa. The remote shell sports a ''command prompt'' such as

<your_username_on_Mufasa>@rk018445:~$

(''rk018445'' is the Linux hostname of Mufasa). For instance, user mrossi will see a prompt similar to this:

<pre style="color: silver; background: black;">
mrossi@rk018445:~$
</pre>

In the remote shell, you can issue commands to Mufasa by typing them after the prompt, then pressing the ''enter'' key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as <code>pwd</code> (which prints the path to the current directory) or <code>cd <destination_dir></code> (which changes the current directory). On the internet you can find many tutorials about the Linux command line, such as [https://linuxcommand.org/index.php this one].

To close the SSH session run

<pre style="color: silver; background: black;">
exit
</pre>

from the command prompt of the remote shell.

== VPN ==
To be able to connect to Mufasa, your computer must belong to Polimi's LAN. This happens either because the computer is physically located at Politecnico di Milano and connected via ethernet, or because you are using Polimi's VPN to connect to its LAN from somewhere else (such as your home). In particular, using the VPN is the ''only'' way to use Mufasa from outside Polimi. See [https://intranet.deib.polimi.it/ita/vpn-wifi this DEIB webpage] for instructions about how to activate VPN access.

== Timeout ==

SSH sessions to Mufasa may be subjected to an inactivity timeout: i.e., after a given inactivity period the ssh session gets automatically closed. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should use the ''screen'' command, as explained in [[User Jobs#Using screen with srun]].

== Using SSH with graphics ==

The standard form of the ''ssh'' command, i.e. the one described above, should always be preferred. However, it only allows text communication with Mufasa. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user's machine (which must use Linux as well). A special mode of operation of ''ssh'' is needed to enable this. This mode is engaged by running

<code> ssh -X <your username on Mufasa>@<Mufasa's IP address></code>

= File transfer =

Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the ''SFTP'' protocol (''Secure File Transfer Protocol'').

Linux and MacOS users can directly use the ''sftp'' package, as explained (for instance) by [https://geekflare.com/sftp-command-examples/ this guide]. Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package.

For Linux and MacOS user, file transfer to/from Mufasa occurs via an ''interactive sftp shell'', i.e. a remote shell very similar to the one one described in [[Accessing Mufasa|Accessing Mufasa]].
The first thing to do is to open a terminal and run the following command (note the similarity to SSH connections):

<pre style="color: silver; background: black;">
sftp <your_username_on_Mufasa>@<Mufasa's_IP_address>
</pre>

where <code><Mufasa's_IP_address></code> is either <code>'''10.79.23.96'''</code> or <code>'''10.79.23.97'''</code>

You will be asked your password. Once you provide it, you access an interactive sftp shell, where the command prompt takes the form

<pre style="color: silver; background: black;">
sftp>
</pre>

From this shell you can run the commands to exchange files. Most of these commands have two forms: one to act on the remote machine (in this case, Mufasa) and one to act on the local machine (i.e. your own computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L).

MacOS users can interact with Mufasa via SFTP also using the [https://cyberduck.io/ Cyberduck] software package.

The most basic ''sftp'' commands (to be issued from the sftp command prompt) are:

'''''cd ''''''<''''''path''''''>'''''Change directory to <path> on remote machine (i.e. Mufasa)

'''''lcd ''''''<''''''path''''''>'''''''''''Change directory to <path> on local machine (i.e. user's machine)

'''''get <file>'''''Downloads (i.e. copies) <file> from current directory of remote 
machine tocurrent directory of local machine

'''''put <file>'''''Uploads (i.e. copies) <file> from current directory of local machine to 
current directory of remote machine

'''''exit'''''Quit sftp

Of course, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories), and can only download files that they have read permission.

= Docker containers =

'''As a general rule, all computation performed on Mufasa must occur within '''[https://www.docker.com/ '''Docker containers''']. This allows every user to configure their own execution environment without any risk of interfering with everyone else's.

From [https://docs.docker.com/get-started/ Docker's documentation]:

<blockquote>“''Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.''
</blockquote>
<blockquote>Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.
</blockquote>
<blockquote>''A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.''”
</blockquote>
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.

A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&type=image dockerhub].

How to run Docker containers on Mufasa will be explained in Part 2 of this document.

== The SLURM job scheduling system ==

Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] to manage shared access to its resources. '''Users of Mufasa must use SLURM to run and manage the jobs they run on the machine'''<ref>It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa's resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.
</ref>. From [https://slurm.schedmd.com/documentation.html SLURM's documentation]:

<blockquote>“''Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.''”
</blockquote>
The use of a job scheduling system ensures that Mufasa's resources are exploited in an efficient way. However, the fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets ''queued'' and will be executed as soon as possible, according to the availability of resources in the machine.

Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].

In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. While managing process queues, SLURM will consider such requirements and match them with the available resources. As a consequence, resource-heavy jobs generally take longer to get executed, while less demanding jobs are usually put into execution quickly. On the other hand, processes that -while running- try to use more resources than they requested get killed by SLURM to avoid damaging other jobs.

All in all, the take-away message is: ''consider carefully how much resources to ask for your job''.

In Part 2 of this document it will be explained how resource requests can be greatly simplified by making use of predefined resource sets called ''SLURM partitions''.

= Users and groups =

As already explained, only Mufasa users can access the machine and interact with it. Creation of new users is done by Job Administrators or by specially designated users within each research group.

Mufasa usernames have the form '''''xyyy''''' (all lowercase) where '''''x''''' is the first letter of the first name and '''''yyy''''' is the complete surname. For instance, user Mario Rossi will be assigned user name ''mrossi''. If multiple users with the same surname and first letter of the name exist, those created after the first are given usernames ''xyyy01'', ''xyyy02'', and so on.

On Linux machines such as Mufasa, users belong to ''groups''. On Mufasa, groups are used to identify the research group that a specific user is part of. Assigment of Mufasa's users to groups follow these rules:

* All users belong to group '''''users'''''.
* Additionally, each user must belong to ''one and only one'' of the following (within brackets is the name of the faculty who is in charge of Mufasa for each group):
** '''''nearmrs''''', i.e. [https://nearlab.polimi.it/medical/ Medical Robotics Section of NearLab] (prof. De Momi);
** '''''nearnes''''', i.e. [https://nearlab.polimi.it/neuroengineering/ NeuroEngineering Section of NearLab] (prof. Ferrante);
** '''''cartcas''''', i.e. [http://www.cartcas.polimi.it/ CartCasLab] (prof. Cerveri);
** '''''biomech''''', i.e. [http://www.biomech.polimi.it/ Biomechanics Research Group] (prof. Votta);
** '''''bio''''', for BioEngineering users not belonging to the research groups listed above.

Users who are not Job Administrators but have been given the power to create users can do so with command

''sudo /opt/share/sbin/add_user.sh -u <user> -g users,<group>''

where ''<user>'' is the username of the new user and ''<group>'' is one of the 6 groups from the list above.

For instance, in order to create a user on Mufasa for a person named Mario Rossi belonging to the NeuroEngineering Section of NearLab, the following command will be used:

''sudo /opt/share/sbin/add_user.sh -u mrossi -g users,nearnes''

New users are created with a predefined password, that they will be asked to change at their first login. For security reason, it is important that such first login occurs as soon as possible.