<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://biohpc.deib.polimi.it/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Admin</id>
	<title>Mufasa (BioHPC) - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://biohpc.deib.polimi.it/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Admin"/>
	<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=Special:Contributions/Admin"/>
	<updated>2026-05-03T19:48:26Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.36.2</generator>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=1256</id>
		<title>System</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=1256"/>
		<updated>2023-04-19T15:47:12Z</updated>

		<summary type="html">&lt;p&gt;Admin: Update with new MIG configuration&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Mufasa is a Linux server located in a server room managed by the [[Roles|System Administrators]].&lt;br /&gt;
&lt;br /&gt;
[[Roles|Job Users]] and [[Roles|Job Administrators]] can only access Mufasa remotely. &lt;br /&gt;
&lt;br /&gt;
Remote access to Mufasa is performed using the [[System#Accessing Mufasa|SSH protocol]] for the execution of commands and the [[System#File transfer|SFTP protocol]] for the exchange of files. Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
&lt;br /&gt;
[[File:hw.png|right|320px]]&lt;br /&gt;
Mufasa is a server for massively parallel computation. It has been set up and configured by [https://www.e4company.com/en/ E4 Computer Engineering] with the support of the [http://www.biomech.polimi.it/ Biomechanics Group], the [http://www.cartcas.polimi.it/ CartCasLab] laboratory and the [https://nearlab.polimi.it/ NearLab] laboratory.&lt;br /&gt;
&lt;br /&gt;
Mufasa&amp;#039;s main hardware components are:&lt;br /&gt;
&lt;br /&gt;
* 2 AMD Epyc 7542 32-core, 64-thread processors (64 CPU cores total)&lt;br /&gt;
* 1 TB RAM&lt;br /&gt;
* 9 TB of SSDs (for OS and [[User Jobs#Automatic job caching|job caching]])&lt;br /&gt;
* 28TB of HDDs (for user &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directories)&lt;br /&gt;
* 5 Nvidia A100 GPUs [based on the &amp;#039;&amp;#039;Ampere&amp;#039;&amp;#039; architecture]&lt;br /&gt;
* [https://ubuntu.com/ Ubuntu Linux] operating system&lt;br /&gt;
&lt;br /&gt;
Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, resources are shared among different users and processes in order to optimise their usage and availability. Most of the management of this sharing is done by [[System#The SLURM job scheduling system|SLURM]].&lt;br /&gt;
&lt;br /&gt;
== CPUs and GPUs ==&lt;br /&gt;
&lt;br /&gt;
Mufasa is fitted with two 32-core CPU, so the system has a total of 64 phyical CPUs (each of which can run 2 threads). Of the 64 CPUs, 2 are reserved for jobs run outside the [[System#The SLURM job scheduling system|SLURM job scheduling system]] (i.e., for low-power &amp;quot;housekeeping&amp;quot; tasks) while the remaining 62 are reserved for jobs run via SLURM.&lt;br /&gt;
&lt;br /&gt;
For what concerns GPUs, some of the 5 physical A100 processing cards (i.e., GPUs) are subdivided into “virtual” GPUs with different capabilities using [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ Nvidia&amp;#039;s MIG system]. Command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
nvidia-smi -L&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
provides an overview of the physical and virtual GPUs available to users in a system. (On Mufasa, this command may require to be launched in a bash shell via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.) The output of &amp;lt;code&amp;gt;nvidia-smi -L&amp;lt;/code&amp;gt; is similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-a9f6e4f2-2877-8642-1802-5eeb3518d415)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-dd1ccc27-d106-5cd9-80f1-b6291f0d682d)&lt;br /&gt;
  MIG 3g.20gb     Device  1: (UUID: MIG-abe13a42-013b-5bef-aa5e-bbd268d72447)&lt;br /&gt;
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-5f28ca0a-5b2c-bfc7-5b9f-581b5ca1d110)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-07372a92-2e37-5ad6-b334-add0100cf5e3)&lt;br /&gt;
  MIG 3g.20gb     Device  1: (UUID: MIG-a704d927-7303-5077-ab7c-6ead57329233)&lt;br /&gt;
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-fb86701b-5781-b63c-5cda-911cff3a5edb)&lt;br /&gt;
GPU 3: NVIDIA A100-PCIE-40GB (UUID: GPU-bbeed512-ab4c-e984-cfea-8067c009a600)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-0d1232cd-6b37-5ac7-b00f-a9fdf6997b72)&lt;br /&gt;
  MIG 3g.20gb     Device  1: (UUID: MIG-bdbcf24a-a0aa-56fb-a7e4-fc18f17b7f24)&lt;br /&gt;
GPU 4: NVIDIA A100-PCIE-40GB (UUID: GPU-a9511357-2476-7ddf-c4c5-c90feb68acfd)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This output shows that the physical Nvidia A100 GPUs installed on Mufasa have been so subdivided:&lt;br /&gt;
&lt;br /&gt;
* two of the physical GPUs (GPU 2 and GPU 4) have not been subdivided at all&lt;br /&gt;
* three of the physical GPUs (GPU 0, GPU 1 and GPU 3) have been subdivided into 2 virtual GPUs with 20 GB of RAM each&lt;br /&gt;
&lt;br /&gt;
Thanks to MIG, users can use all the GPUs listed above as if they were all physical devices installed on Mufasa, without having to worry (or even know) which actually are and which instead are virtual GPUs.&lt;br /&gt;
&lt;br /&gt;
All in all, then, users of Mufasa are provided with the following set of &amp;#039;&amp;#039;&amp;#039;8 GPUs&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
:; 2 GPUs with 40 GB of RAM each&lt;br /&gt;
:; 6 GPUs with 20 GB of RAM each&lt;br /&gt;
&lt;br /&gt;
How these devices are made available to Mufasa users is explained in [[User Jobs]].&lt;br /&gt;
&lt;br /&gt;
= Accessing Mufasa =&lt;br /&gt;
&lt;br /&gt;
User access to Mufasa is always remote and exploits the &amp;#039;&amp;#039;SSH&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure SHell&amp;#039;&amp;#039;) protocol. &lt;br /&gt;
&lt;br /&gt;
To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
ssh &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is one of the IP addresses of Mufasa, i.e. either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; may access Mufasa with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;ssh mrossi@10.79.23.97&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Access via SSH works with Linux, MacOs and Windows 10 (and later) terminals. For Windows users, a handy alternative tool (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].&lt;br /&gt;
&lt;br /&gt;
If you don&amp;#039;t have a user account on Mufasa, you first have to ask your supervisor for one. See [[Users]] for more information about Mufasa&amp;#039;s users.&lt;br /&gt;
&lt;br /&gt;
As soon as you launch the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, you will be asked to type the password (i.e., the one of your user account on Mufasa). Once you provide the password, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa. The remote shell sports a &amp;#039;&amp;#039;command prompt&amp;#039;&amp;#039; such as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;username&amp;gt;@rk018445:~$&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(&amp;#039;&amp;#039;rk018445&amp;#039;&amp;#039; is the Linux hostname of Mufasa). For instance, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; will see a prompt similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;mrossi@rk018445:~$&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the remote shell, you can issue commands to Mufasa by typing them after the prompt, then pressing the &amp;#039;&amp;#039;enter&amp;#039;&amp;#039; key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as &amp;lt;code&amp;gt;pwd&amp;lt;/code&amp;gt; (which prints the path to the current directory) or &amp;lt;code&amp;gt;cd &amp;lt;destination_dir&amp;gt;&amp;lt;/code&amp;gt; (which changes the current directory). On the internet you can find many tutorials about the Linux command line, such as [https://linuxcommand.org/index.php this one].&lt;br /&gt;
&lt;br /&gt;
To close the SSH session run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
from the command prompt of the remote shell.&lt;br /&gt;
&lt;br /&gt;
== VPN ==&lt;br /&gt;
&lt;br /&gt;
To be able to connect to Mufasa, your computer must belong to Polimi&amp;#039;s LAN. This happens either because the computer is physically located at Politecnico di Milano and connected via ethernet, or because you are using Polimi&amp;#039;s VPN (Virtual Private Network) to connect to its LAN from somewhere else (such as your home). In particular, using the VPN is the &amp;#039;&amp;#039;only&amp;#039;&amp;#039; way to use Mufasa from outside Polimi. See [https://intranet.deib.polimi.it/ita/vpn-wifi this DEIB webpage] for instructions about how to activate VPN access.&lt;br /&gt;
&lt;br /&gt;
== SSH timeout ==&lt;br /&gt;
&lt;br /&gt;
SSH sessions to Mufasa may be subjected to an inactivity timeout: i.e., after a given inactivity period the ssh session gets automatically closed. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should [[User Jobs#Detaching from a running job with screen|use the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; command]].&lt;br /&gt;
&lt;br /&gt;
== SSH and graphics ==&lt;br /&gt;
&lt;br /&gt;
The standard form of the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, i.e. the one described at the beginning of [[system#Accessing Mufasa|Accessing Mufasa]], should always be preferred. However, it only allows text communication with Mufasa. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user&amp;#039;s machine (which must use Linux as well). A special mode of operation of &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; is needed to enable this. This mode is engaged by running command &amp;lt;code&amp;gt;ssh&amp;lt;/code&amp;gt; like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt; ssh -X &amp;lt;your username on Mufasa&amp;gt;@&amp;lt;Mufasa&amp;#039;s IP address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File transfer ==&lt;br /&gt;
&lt;br /&gt;
Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the &amp;#039;&amp;#039;SFTP&amp;#039;&amp;#039; protocol (&amp;#039;&amp;#039;Secure File Transfer Protocol&amp;#039;&amp;#039;). &lt;br /&gt;
&lt;br /&gt;
Linux and MacOS users can directly use the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; package, as explained (for instance) by [https://geekflare.com/sftp-command-examples/ this guide]. Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package. MacOS users can interact with Mufasa via SFTP also with the [https://cyberduck.io/ Cyberduck] software package.&lt;br /&gt;
&lt;br /&gt;
For Linux and MacOS user, file transfer to/from Mufasa occurs via an &amp;#039;&amp;#039;interactive sftp shell&amp;#039;&amp;#039;, i.e. a remote shell very similar to the one one described in [[Accessing Mufasa|Accessing Mufasa]]. &lt;br /&gt;
The first thing to do is to open a terminal and run the following command (note the similarity to SSH connections):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user, and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will be asked your password. Once you provide it, you access an interactive sftp shell, where the command prompt takes the form&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
From this shell you can run the commands to exchange files. Most of these commands have two forms: one to act on the remote machine (in this case, Mufasa) and one to act on the local machine (i.e. your own computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L). &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
cd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the remote machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
lcd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
get &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to download (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the remote machine to the current directory of the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
put &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to upload (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the local machine to the current directory of the remote machine.&lt;br /&gt;
&lt;br /&gt;
Naturally, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories). Also, users can only download files from directories where they have read permission. (File permission on Mufasa follow the standard Linux rules.)&lt;br /&gt;
&lt;br /&gt;
In addition to the terminal interface, users of Linux distributions based on Gnome (such as Ubuntu) can use a handy graphical tool to exchange files with Mufasa. In Gnome&amp;#039;s Nautilus file manager, write&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sftp://&amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
in the address bar of Nautilus, where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is your username on Mufasa and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;10.79.23.96&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;10.79.23.97&amp;lt;/code&amp;gt;. Nautilus becomes a graphical interface to Mufasa&amp;#039;s remote filesystem.&lt;br /&gt;
&lt;br /&gt;
= Using Mufasa =&lt;br /&gt;
&lt;br /&gt;
This section provide a brief guide to Mufasa users (especially those who are not experienced in the use of Linux and/or remote servers) about interacting with the system.&lt;br /&gt;
&lt;br /&gt;
== Storage spaces ==&lt;br /&gt;
&lt;br /&gt;
User jobs require storage of programs and data files. On Mufasa, the space available to users for data storage is the &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; directory. &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; contains three types of directories:&lt;br /&gt;
&lt;br /&gt;
; Personal directories&lt;br /&gt;
: Each user has a personal &amp;#039;&amp;#039;home directory&amp;#039;&amp;#039; where they can store their own files. The home directory is the one with the same name of the user. By default, only the owner of a home directory can access its contents.&lt;br /&gt;
&lt;br /&gt;
; Group directories&lt;br /&gt;
: Each research group has a common &amp;#039;&amp;#039;group directory&amp;#039;&amp;#039; where group members can store files that they share with other group members. The group directory is the one called &amp;lt;code&amp;gt;shared-&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt; is the corresponding [[Users#Group_names|user group]]. The owner of group directory is user &amp;lt;code&amp;gt;root&amp;lt;/code&amp;gt;, while group ownership is assigned to &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;. On Mufasa, group directories have GUID activated. This means that any file or directory created inside &amp;lt;code&amp;gt;shared-&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt; has group ownership assigned to &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;: so editing permissions on the new file or directory extend to all group members.&lt;br /&gt;
&lt;br /&gt;
; The &amp;#039;&amp;#039;&amp;lt;code&amp;gt;shared-public&amp;lt;/code&amp;gt;&amp;#039;&amp;#039; directory&lt;br /&gt;
: This is a shared directory common to all users of Mufasa. Users that share files but do not belong to the same research group can use it to store their shared files.&lt;br /&gt;
&lt;br /&gt;
== Disk quotas ==&lt;br /&gt;
&lt;br /&gt;
On Mufasa, the directories in &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; must be used as a temporary storage area for user programs and their data, limited to the execution period of the jobs that use the data. They are not intended for long-term storage. For this reason, disk usage is subjected to a quota system.&lt;br /&gt;
&lt;br /&gt;
=== User quotas ===&lt;br /&gt;
&lt;br /&gt;
Each user is assigned a &amp;#039;&amp;#039;disk quota&amp;#039;&amp;#039;, i.e. an amount of space that they can use before the user is blocked by the quota system. Note that the quota applies not only to the data created and/or uploaded by you as a user, but also to data created by programs run by your user.&lt;br /&gt;
&lt;br /&gt;
The quotas assigned to your user and the amount of it that you are currently using can be inspected with command &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt; is similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
Filesystem   space   quota   limit   grace   files   quota   limit   grace&lt;br /&gt;
 /dev/sdb1  11104K    100G    150G               1       0       0        &lt;br /&gt;
 /dev/sdc2   5552K    100G    150G              60       0       0        &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is a simple guide to the output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
:; Column &amp;quot;Filesystems&amp;quot;&lt;br /&gt;
:: identifies the filesystems where the user has been assigned a disk quota. On Mufasa, &amp;lt;code&amp;gt;/dev/sdb1&amp;lt;/code&amp;gt; is the SSD disk space used as [[User Jobs#Automatic job caching|cache space]], while &amp;lt;code&amp;gt;/dev/sdc2&amp;lt;/code&amp;gt; is the HDD space used for the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directories.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;space&amp;quot; and &amp;quot;files&amp;quot;&lt;br /&gt;
:: tell the user how much of their quota they are actually using: the first in term of bytes, the second in term of number of files (more precisely, of &amp;#039;&amp;#039;inodes&amp;#039;&amp;#039;).&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;quota&amp;quot;&lt;br /&gt;
:: tell the user how much is their &amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;, in term of bytes and files respectively. If the value is 0, it means there is no limit.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;limit&amp;quot;&lt;br /&gt;
:: tell the user how much is their &amp;#039;&amp;#039;hard limit&amp;#039;&amp;#039;, in term of bytes and files respectively. If the value is 0, it means there is no limit.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;grace&amp;quot;&lt;br /&gt;
:: tell the user how long they are allowed to stay above their &amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;,  for what concerns bytes and files respectively. When these columns are empty (as in the example above) the user is not over quota.&lt;br /&gt;
&lt;br /&gt;
The meaning of &amp;#039;&amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;hard limit&amp;#039;&amp;#039;&amp;#039; is the following. &lt;br /&gt;
&lt;br /&gt;
The hard limit cannot be exceeded. When a user reaches their hard limit, they cannot use any more disk space: for them, the filesystem behaves as if the disks are out of space. Disk writes will fail, temporary files will fail to be created, and the user will start to see warnings and errors while performing common tasks. The only disk operation allowed is file deletion.&lt;br /&gt;
&lt;br /&gt;
The soft limit is, as the word goes, softer. When a user exceeds it, they are not immediately prevented from using more disk space (provided that they stay below the hard limit). However, as the user goes beyond the soft limit, their &amp;#039;&amp;#039;&amp;#039;grace period&amp;#039;&amp;#039;&amp;#039; begins: i.e. a period within which the user must reduce their amount of data back to below the soft limit. During the grace period, the &amp;quot;grace&amp;quot; column(s) of the output of &amp;lt;code&amp;gt;quota&amp;lt;/code&amp;gt; show how much of the grace period remains to the user. If the user is still above their soft limit at the end of the grace period, the quota system will treat the soft limit as a hard limit: i.e. it will force the user to delete data until they are below the soft limit before they can write on disk again.&lt;br /&gt;
&lt;br /&gt;
In the output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt;, the grace columns are blank except when a soft limit has been exceeded.&lt;br /&gt;
&lt;br /&gt;
=== Group and project quotas ===&lt;br /&gt;
&lt;br /&gt;
While on Mufasa disk quotas are usually assigned &amp;#039;&amp;#039;per-user&amp;#039;&amp;#039;, the quota system also enables the setup of &amp;#039;&amp;#039;per-group&amp;#039;&amp;#039; quotas (i.e., limits to the disk space that, collectively, a group of users can use) and &amp;#039;&amp;#039;per-project&amp;#039;&amp;#039; quotas (i.e., limits to the amount of data that a specific directory and all its subdirectories can contain).&lt;br /&gt;
&lt;br /&gt;
A comprehensive view of the quota situation for one&amp;#039;s user and user groups is provided by command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
quotainfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For what concerns project quotas, on Mufasa they are applied to group directories in &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Finding out how much disk space you are using ==&lt;br /&gt;
&lt;br /&gt;
If your user is the owner of directory &amp;lt;code&amp;gt;/path/to/dir/&amp;lt;/code&amp;gt; you can find out how much disk space is used by the directory with command &amp;lt;code&amp;gt;du&amp;lt;/code&amp;gt; like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -sh /path/to/dir/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;-sh&amp;lt;/code&amp;gt; flag is used to ask for options &amp;lt;code&amp;gt;-s&amp;lt;/code&amp;gt; (which provides the overall size of the directory) and &amp;lt;code&amp;gt;-h&amp;lt;/code&amp;gt; (which provides &amp;#039;&amp;#039;human-readable&amp;#039;&amp;#039; values using measurement units such as K (KBytes), M (MBytes), G (GBytes)).&lt;br /&gt;
&lt;br /&gt;
In particular, you can find out how much disk space is used by your home directory with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -sh ~&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In fact, in Linux the symbol &amp;lt;code&amp;gt;~&amp;lt;/code&amp;gt; is shorthand for the path to the current user&amp;#039;s home directory. &lt;br /&gt;
&lt;br /&gt;
If you want a detailed summary of how much disk space is used by each item (i.e., subdirectory or file) in a directory you own, use command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -h /path/to/dir/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For instance, for user gfontana the output of&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -h ~&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
may be similar to the following&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gfontana@rk018445:~$ du -h ~&lt;br /&gt;
12K	/home/gfontana/.ssh&lt;br /&gt;
356K	/home/gfontana/.cache/gstreamer-1.0&lt;br /&gt;
5.0M	/home/gfontana/.cache/tracker&lt;br /&gt;
5.3M	/home/gfontana/.cache&lt;br /&gt;
  [...other similar lines...]&lt;br /&gt;
4.0K	/home/gfontana/.config/htop&lt;br /&gt;
32K	/home/gfontana/.config&lt;br /&gt;
8.0K	/home/gfontana/.slurm&lt;br /&gt;
6.3M	/home/gfontana&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hidden files and directories ===&lt;br /&gt;
&lt;br /&gt;
In Linux, directories and files with a leading &amp;quot;.&amp;quot; in their name are &amp;#039;&amp;#039;hidden&amp;#039;&amp;#039;. These do not appear in listings, such as the output of the &amp;lt;code&amp;gt;ls&amp;lt;/code&amp;gt; command, to avoid cluttering them up: however, they still occupy disk space. &lt;br /&gt;
&lt;br /&gt;
The output of command &amp;lt;code&amp;gt;du&amp;lt;/code&amp;gt;, however, also considers hidden elements and provides their size: therefore it can help you understand why the quota system says that you are using more disk space than reported by &amp;lt;code&amp;gt;ls&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Changing file/directory ownership and permissions ==&lt;br /&gt;
&lt;br /&gt;
Every file or directory in a Linux system is owned by both a user and a group. User and group ownerships are not connected, so a file can have as group owner a group that its user ownwer does not belong to.&lt;br /&gt;
&lt;br /&gt;
Being able to manipulate who owns a file and what permissions any user has on that file is often important in a multi-user system such as Mufasa. This is a recapitulation of the main Linux commands to manipulate file permissions. Key commands are&lt;br /&gt;
&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chown&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change ownership - user part&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chgrp&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change ownership - group part&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chmod&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change access permissions&lt;br /&gt;
&lt;br /&gt;
All three accept option &amp;lt;code&amp;gt;-R&amp;lt;/code&amp;gt; (uppercase) for recursive operation, so -if needed- you can change ownership and/or permissions of all contents of a directory and its subdirectories with a single command.&lt;br /&gt;
&lt;br /&gt;
The syntax of &amp;lt;code&amp;gt;chown&amp;lt;/code&amp;gt; commands is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chown &amp;lt;new_owner&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;new_owner&amp;gt;&amp;lt;/code&amp;gt; is the user part of the new file ownership.&lt;br /&gt;
&lt;br /&gt;
The syntax of &amp;lt;code&amp;gt;chgrp&amp;lt;/code&amp;gt; commands is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chgrp &amp;lt;new_group&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;new_owner&amp;gt;&amp;lt;/code&amp;gt; is the group part of the new file ownership.&lt;br /&gt;
&lt;br /&gt;
User and group ownership for a file can also be both changed at the same time with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chown &amp;lt;new_owner&amp;gt;:&amp;lt;new_group&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For what concerns &amp;lt;code&amp;gt;chmod&amp;lt;/code&amp;gt;, the easiest way to use it makes use of symbolic descriptions of the permissions. The format for this is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod [users]&amp;lt;+|-&amp;gt;&amp;lt;permissions&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt; is the file or directory that the change is applied to&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;code&amp;gt;[users]&amp;lt;/code&amp;gt; is &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;ugo&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or a subset of it; the three letters correspond respectively:&lt;br /&gt;
:::to the &amp;#039;&amp;#039;&amp;#039;u&amp;#039;&amp;#039;&amp;#039;ser who owns &amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt; (also used if &amp;lt;code&amp;gt;[users]&amp;lt;/code&amp;gt; is not specified)&lt;br /&gt;
:::to the &amp;#039;&amp;#039;&amp;#039;g&amp;#039;&amp;#039;&amp;#039;roup that owns &amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
:::to everyone else (&amp;#039;&amp;#039;&amp;#039;o&amp;#039;&amp;#039;&amp;#039;thers)&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;+&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;-&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; correspond to adding or removing permissions&lt;br /&gt;
:&amp;lt;code&amp;gt;&amp;lt;permissions&amp;gt;&amp;lt;/code&amp;gt; is &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;rwx&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or a subset, corresponding to &amp;#039;&amp;#039;&amp;#039;r&amp;#039;&amp;#039;&amp;#039;ead, &amp;#039;&amp;#039;&amp;#039;w&amp;#039;&amp;#039;&amp;#039;rite and e&amp;#039;&amp;#039;&amp;#039;x&amp;#039;&amp;#039;&amp;#039;ecute permissions&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; permission have a different meaning for files and for directories.&lt;br /&gt;
&lt;br /&gt;
;For files:&lt;br /&gt;
: permission &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt; allows to read the contents of the file&lt;br /&gt;
: permission &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; allows to change the contents of the file&lt;br /&gt;
: permission &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; allows to execute the file (provided that it is a program: e.g., a shell script)&lt;br /&gt;
&lt;br /&gt;
;For directories:&lt;br /&gt;
: permission &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt; allows to list the files within the directory&lt;br /&gt;
: permission &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; allows to create, rename, or delete files within the directory&lt;br /&gt;
: permission &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; allows to enter the directory (i.e., &amp;lt;code&amp;gt;cd&amp;lt;/code&amp;gt; into it) and access its files&lt;br /&gt;
&lt;br /&gt;
For instance&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod g+rwx myfile.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
adds permission to read, write and execute myfile.txt to all the Linux users of the same group of the user that the file belongs to;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod go-x mydir&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
takes away permission to enter directory &amp;lt;dirname&amp;gt; from everyone except the user who owns the directory.&lt;br /&gt;
&lt;br /&gt;
= Docker containers =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-docker_logo_cropped.jpg|right|262px]]&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;As a general rule, all computation performed on Mufasa must occur within [https://www.docker.com/ Docker containers]&amp;#039;&amp;#039;&amp;#039;. From [https://docs.docker.com/get-started/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
“&amp;#039;&amp;#039;Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.&lt;br /&gt;
&lt;br /&gt;
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.&lt;br /&gt;
&lt;br /&gt;
A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.&lt;br /&gt;
&lt;br /&gt;
A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&amp;amp;type=image dockerhub].&lt;br /&gt;
&lt;br /&gt;
How to run Docker containers on Mufasa is explained in [[User Jobs|User Jobs]]. There is also a page of this wiki [[Docker|dedicated to the preparation of Docker containers]].&lt;br /&gt;
&lt;br /&gt;
= The SLURM job scheduling system =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-Slurm logo.png|right|262px]]&lt;br /&gt;
Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] (&amp;#039;&amp;#039;Slurm Workload Manager&amp;#039;&amp;#039;, formerly known as &amp;#039;&amp;#039;Simple Linux Utility for Resource Management&amp;#039;&amp;#039;) to manage shared access to its resources.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Users of Mufasa must use SLURM to run and manage all processing-heavy jobs they run on the machine&amp;#039;&amp;#039;&amp;#039;. It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa&amp;#039;s resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/documentation.html SLURM&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The use of a job scheduling system such as SLURM ensures that Mufasa&amp;#039;s resources are exploited in an efficient way. The fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets &amp;#039;&amp;#039;queued&amp;#039;&amp;#039; and will be executed as soon as possible, according to the availability of resources in the machine.&lt;br /&gt;
&lt;br /&gt;
Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].&lt;br /&gt;
&lt;br /&gt;
SLURM is capable of managing complex computing systems composed of multiple &amp;#039;&amp;#039;&amp;#039;clusters&amp;#039;&amp;#039;&amp;#039; (i.e. sets) of machines, each comprising one &amp;#039;&amp;#039;&amp;#039;node&amp;#039;&amp;#039;&amp;#039; (i.e. machine) or more. The case of Mufasa is the simplest of all: &amp;#039;&amp;#039;Mufasa is the single node (called &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;gn01&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039;) of a SLURM computing cluster composed of that single machine.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. In managing process queues, SLURM considers such requirements and matches them with available resources. As a consequence, resource-heavy jobs generally take longer before thet get executed, while less demanding jobs are usually put into execution quickly. Processes that -while they are running- try to use more resources than they requested at launch time get killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
All in all, the take-away message is: [[User Jobs#Choosing the partition on which to run a job|&amp;#039;&amp;#039;consider carefully how much of each resource to ask for your job&amp;#039;&amp;#039;]].&lt;br /&gt;
&lt;br /&gt;
In [[User Jobs]] it will be explained how the process of requesting resources is greatly simplified by making use of process queues with predefined resource allocations called [[User Jobs#SLURM Partitions|&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;]].&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=1255</id>
		<title>System</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=1255"/>
		<updated>2023-04-19T15:44:57Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Hardware */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Mufasa is a Linux server located in a server room managed by the [[Roles|System Administrators]].&lt;br /&gt;
&lt;br /&gt;
[[Roles|Job Users]] and [[Roles|Job Administrators]] can only access Mufasa remotely. &lt;br /&gt;
&lt;br /&gt;
Remote access to Mufasa is performed using the [[System#Accessing Mufasa|SSH protocol]] for the execution of commands and the [[System#File transfer|SFTP protocol]] for the exchange of files. Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
&lt;br /&gt;
[[File:hw.png|right|320px]]&lt;br /&gt;
Mufasa is a server for massively parallel computation. It has been set up and configured by [https://www.e4company.com/en/ E4 Computer Engineering] with the support of the [http://www.biomech.polimi.it/ Biomechanics Group], the [http://www.cartcas.polimi.it/ CartCasLab] laboratory and the [https://nearlab.polimi.it/ NearLab] laboratory.&lt;br /&gt;
&lt;br /&gt;
Mufasa&amp;#039;s main hardware components are:&lt;br /&gt;
&lt;br /&gt;
* 2 AMD Epyc 7542 32-core, 64-thread processors (64 CPU cores total)&lt;br /&gt;
* 1 TB RAM&lt;br /&gt;
* 9 TB of SSDs (for OS and [[User Jobs#Automatic job caching|job caching]])&lt;br /&gt;
* 28TB of HDDs (for user &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directories)&lt;br /&gt;
* 5 Nvidia A100 GPUs [based on the &amp;#039;&amp;#039;Ampere&amp;#039;&amp;#039; architecture]&lt;br /&gt;
* [https://ubuntu.com/ Ubuntu Linux] operating system&lt;br /&gt;
&lt;br /&gt;
Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, resources are shared among different users and processes in order to optimise their usage and availability. Most of the management of this sharing is done by [[System#The SLURM job scheduling system|SLURM]].&lt;br /&gt;
&lt;br /&gt;
== CPUs and GPUs ==&lt;br /&gt;
&lt;br /&gt;
Mufasa is fitted with two 32-core CPU, so the system has a total of 64 phyical CPUs (each of which can run 2 threads). Of the 64 CPUs, 2 are reserved for jobs run outside the [[System#The SLURM job scheduling system|SLURM job scheduling system]] (i.e., for low-power &amp;quot;housekeeping&amp;quot; tasks) while the remaining 62 are reserved for jobs run via SLURM.&lt;br /&gt;
&lt;br /&gt;
For what concerns GPUs, some of the 5 physical A100 processing cards (i.e., GPUs) are subdivided into “virtual” GPUs with different capabilities using [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ Nvidia&amp;#039;s MIG system]. Command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
nvidia-smi -L&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
provides an overview of the physical and virtual GPUs available to users in a system. (On Mufasa, this command may require to be launched in a bash shell via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.) The output of &amp;lt;code&amp;gt;nvidia-smi -L&amp;lt;/code&amp;gt; is similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-a9f6e4f2-2877-8642-1802-5eeb3518d415)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-abe13a42-013b-5bef-aa5e-bbd268d72447)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-268c6b30-d10c-59db-babd-3eda7b89da34)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-90e26aa7-cf69-5672-b758-419679238cd3)&lt;br /&gt;
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-5f28ca0a-5b2c-bfc7-5b9f-581b5ca1d110)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-07372a92-2e37-5ad6-b334-add0100cf5e3)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-4ca248b0-ab87-5f91-a788-5fe169d0623e)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-a93ffffb-9a0d-51d1-b9df-36bc624a2084)&lt;br /&gt;
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-fb86701b-5781-b63c-5cda-911cff3a5edb)&lt;br /&gt;
GPU 3: NVIDIA A100-PCIE-40GB (UUID: GPU-bbeed512-ab4c-e984-cfea-8067c009a600)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-bdbcf24a-a0aa-56fb-a7e4-fc18f17b7f24)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-4c44132b-7499-562d-a85f-55a0a2cbb5ba)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-fe354ead-4f87-53ab-9271-1d98190248f4)&lt;br /&gt;
GPU 4: NVIDIA A100-PCIE-40GB (UUID: GPU-a9511357-2476-7ddf-c4c5-c90feb68acfd)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This output shows that the physical Nvidia A100 GPUs installed on Mufasa have been so subdivided:&lt;br /&gt;
&lt;br /&gt;
* two of the physical GPUs (GPU 2 and GPU 4) have not been subdivided at all&lt;br /&gt;
* three of the physical GPUs (GPU 0, GPU 1 and GPU 3) have been subdivided into 3 virtual GPUs each:&lt;br /&gt;
** one virtual GPU with 20 GB of RAM&lt;br /&gt;
** two virtual GPUs with 10 GB of RAM each&lt;br /&gt;
&lt;br /&gt;
Thanks to MIG, users can use all the GPUs listed above as if they were all physical devices installed on Mufasa, without having to worry (or even know) which actually are and which instead are virtual GPUs.&lt;br /&gt;
&lt;br /&gt;
All in all, then, users of Mufasa are provided with the following set of &amp;#039;&amp;#039;&amp;#039;11 GPUs&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
:; 2 GPUs with 40 GB of RAM each&lt;br /&gt;
:; 3 GPUs with 20 GB of RAM each&lt;br /&gt;
:; 6 GPUs with 10 GB of RAM each&lt;br /&gt;
&lt;br /&gt;
How these devices are made available to Mufasa users is explained in [[User Jobs]].&lt;br /&gt;
&lt;br /&gt;
= Accessing Mufasa =&lt;br /&gt;
&lt;br /&gt;
User access to Mufasa is always remote and exploits the &amp;#039;&amp;#039;SSH&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure SHell&amp;#039;&amp;#039;) protocol. &lt;br /&gt;
&lt;br /&gt;
To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
ssh &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is one of the IP addresses of Mufasa, i.e. either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; may access Mufasa with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;ssh mrossi@10.79.23.97&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Access via SSH works with Linux, MacOs and Windows 10 (and later) terminals. For Windows users, a handy alternative tool (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].&lt;br /&gt;
&lt;br /&gt;
If you don&amp;#039;t have a user account on Mufasa, you first have to ask your supervisor for one. See [[Users]] for more information about Mufasa&amp;#039;s users.&lt;br /&gt;
&lt;br /&gt;
As soon as you launch the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, you will be asked to type the password (i.e., the one of your user account on Mufasa). Once you provide the password, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa. The remote shell sports a &amp;#039;&amp;#039;command prompt&amp;#039;&amp;#039; such as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;username&amp;gt;@rk018445:~$&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(&amp;#039;&amp;#039;rk018445&amp;#039;&amp;#039; is the Linux hostname of Mufasa). For instance, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; will see a prompt similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;mrossi@rk018445:~$&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the remote shell, you can issue commands to Mufasa by typing them after the prompt, then pressing the &amp;#039;&amp;#039;enter&amp;#039;&amp;#039; key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as &amp;lt;code&amp;gt;pwd&amp;lt;/code&amp;gt; (which prints the path to the current directory) or &amp;lt;code&amp;gt;cd &amp;lt;destination_dir&amp;gt;&amp;lt;/code&amp;gt; (which changes the current directory). On the internet you can find many tutorials about the Linux command line, such as [https://linuxcommand.org/index.php this one].&lt;br /&gt;
&lt;br /&gt;
To close the SSH session run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
from the command prompt of the remote shell.&lt;br /&gt;
&lt;br /&gt;
== VPN ==&lt;br /&gt;
&lt;br /&gt;
To be able to connect to Mufasa, your computer must belong to Polimi&amp;#039;s LAN. This happens either because the computer is physically located at Politecnico di Milano and connected via ethernet, or because you are using Polimi&amp;#039;s VPN (Virtual Private Network) to connect to its LAN from somewhere else (such as your home). In particular, using the VPN is the &amp;#039;&amp;#039;only&amp;#039;&amp;#039; way to use Mufasa from outside Polimi. See [https://intranet.deib.polimi.it/ita/vpn-wifi this DEIB webpage] for instructions about how to activate VPN access.&lt;br /&gt;
&lt;br /&gt;
== SSH timeout ==&lt;br /&gt;
&lt;br /&gt;
SSH sessions to Mufasa may be subjected to an inactivity timeout: i.e., after a given inactivity period the ssh session gets automatically closed. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should [[User Jobs#Detaching from a running job with screen|use the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; command]].&lt;br /&gt;
&lt;br /&gt;
== SSH and graphics ==&lt;br /&gt;
&lt;br /&gt;
The standard form of the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, i.e. the one described at the beginning of [[system#Accessing Mufasa|Accessing Mufasa]], should always be preferred. However, it only allows text communication with Mufasa. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user&amp;#039;s machine (which must use Linux as well). A special mode of operation of &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; is needed to enable this. This mode is engaged by running command &amp;lt;code&amp;gt;ssh&amp;lt;/code&amp;gt; like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt; ssh -X &amp;lt;your username on Mufasa&amp;gt;@&amp;lt;Mufasa&amp;#039;s IP address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File transfer ==&lt;br /&gt;
&lt;br /&gt;
Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the &amp;#039;&amp;#039;SFTP&amp;#039;&amp;#039; protocol (&amp;#039;&amp;#039;Secure File Transfer Protocol&amp;#039;&amp;#039;). &lt;br /&gt;
&lt;br /&gt;
Linux and MacOS users can directly use the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; package, as explained (for instance) by [https://geekflare.com/sftp-command-examples/ this guide]. Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package. MacOS users can interact with Mufasa via SFTP also with the [https://cyberduck.io/ Cyberduck] software package.&lt;br /&gt;
&lt;br /&gt;
For Linux and MacOS user, file transfer to/from Mufasa occurs via an &amp;#039;&amp;#039;interactive sftp shell&amp;#039;&amp;#039;, i.e. a remote shell very similar to the one one described in [[Accessing Mufasa|Accessing Mufasa]]. &lt;br /&gt;
The first thing to do is to open a terminal and run the following command (note the similarity to SSH connections):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user, and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will be asked your password. Once you provide it, you access an interactive sftp shell, where the command prompt takes the form&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
From this shell you can run the commands to exchange files. Most of these commands have two forms: one to act on the remote machine (in this case, Mufasa) and one to act on the local machine (i.e. your own computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L). &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
cd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the remote machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
lcd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
get &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to download (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the remote machine to the current directory of the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
put &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to upload (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the local machine to the current directory of the remote machine.&lt;br /&gt;
&lt;br /&gt;
Naturally, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories). Also, users can only download files from directories where they have read permission. (File permission on Mufasa follow the standard Linux rules.)&lt;br /&gt;
&lt;br /&gt;
In addition to the terminal interface, users of Linux distributions based on Gnome (such as Ubuntu) can use a handy graphical tool to exchange files with Mufasa. In Gnome&amp;#039;s Nautilus file manager, write&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sftp://&amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
in the address bar of Nautilus, where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is your username on Mufasa and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;10.79.23.96&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;10.79.23.97&amp;lt;/code&amp;gt;. Nautilus becomes a graphical interface to Mufasa&amp;#039;s remote filesystem.&lt;br /&gt;
&lt;br /&gt;
= Using Mufasa =&lt;br /&gt;
&lt;br /&gt;
This section provide a brief guide to Mufasa users (especially those who are not experienced in the use of Linux and/or remote servers) about interacting with the system.&lt;br /&gt;
&lt;br /&gt;
== Storage spaces ==&lt;br /&gt;
&lt;br /&gt;
User jobs require storage of programs and data files. On Mufasa, the space available to users for data storage is the &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; directory. &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; contains three types of directories:&lt;br /&gt;
&lt;br /&gt;
; Personal directories&lt;br /&gt;
: Each user has a personal &amp;#039;&amp;#039;home directory&amp;#039;&amp;#039; where they can store their own files. The home directory is the one with the same name of the user. By default, only the owner of a home directory can access its contents.&lt;br /&gt;
&lt;br /&gt;
; Group directories&lt;br /&gt;
: Each research group has a common &amp;#039;&amp;#039;group directory&amp;#039;&amp;#039; where group members can store files that they share with other group members. The group directory is the one called &amp;lt;code&amp;gt;shared-&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt; is the corresponding [[Users#Group_names|user group]]. The owner of group directory is user &amp;lt;code&amp;gt;root&amp;lt;/code&amp;gt;, while group ownership is assigned to &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;. On Mufasa, group directories have GUID activated. This means that any file or directory created inside &amp;lt;code&amp;gt;shared-&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt; has group ownership assigned to &amp;lt;code&amp;gt;&amp;lt;groupname&amp;gt;&amp;lt;/code&amp;gt;: so editing permissions on the new file or directory extend to all group members.&lt;br /&gt;
&lt;br /&gt;
; The &amp;#039;&amp;#039;&amp;lt;code&amp;gt;shared-public&amp;lt;/code&amp;gt;&amp;#039;&amp;#039; directory&lt;br /&gt;
: This is a shared directory common to all users of Mufasa. Users that share files but do not belong to the same research group can use it to store their shared files.&lt;br /&gt;
&lt;br /&gt;
== Disk quotas ==&lt;br /&gt;
&lt;br /&gt;
On Mufasa, the directories in &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt; must be used as a temporary storage area for user programs and their data, limited to the execution period of the jobs that use the data. They are not intended for long-term storage. For this reason, disk usage is subjected to a quota system.&lt;br /&gt;
&lt;br /&gt;
=== User quotas ===&lt;br /&gt;
&lt;br /&gt;
Each user is assigned a &amp;#039;&amp;#039;disk quota&amp;#039;&amp;#039;, i.e. an amount of space that they can use before the user is blocked by the quota system. Note that the quota applies not only to the data created and/or uploaded by you as a user, but also to data created by programs run by your user.&lt;br /&gt;
&lt;br /&gt;
The quotas assigned to your user and the amount of it that you are currently using can be inspected with command &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt; is similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
Filesystem   space   quota   limit   grace   files   quota   limit   grace&lt;br /&gt;
 /dev/sdb1  11104K    100G    150G               1       0       0        &lt;br /&gt;
 /dev/sdc2   5552K    100G    150G              60       0       0        &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is a simple guide to the output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
:; Column &amp;quot;Filesystems&amp;quot;&lt;br /&gt;
:: identifies the filesystems where the user has been assigned a disk quota. On Mufasa, &amp;lt;code&amp;gt;/dev/sdb1&amp;lt;/code&amp;gt; is the SSD disk space used as [[User Jobs#Automatic job caching|cache space]], while &amp;lt;code&amp;gt;/dev/sdc2&amp;lt;/code&amp;gt; is the HDD space used for the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directories.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;space&amp;quot; and &amp;quot;files&amp;quot;&lt;br /&gt;
:: tell the user how much of their quota they are actually using: the first in term of bytes, the second in term of number of files (more precisely, of &amp;#039;&amp;#039;inodes&amp;#039;&amp;#039;).&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;quota&amp;quot;&lt;br /&gt;
:: tell the user how much is their &amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;, in term of bytes and files respectively. If the value is 0, it means there is no limit.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;limit&amp;quot;&lt;br /&gt;
:: tell the user how much is their &amp;#039;&amp;#039;hard limit&amp;#039;&amp;#039;, in term of bytes and files respectively. If the value is 0, it means there is no limit.&lt;br /&gt;
&lt;br /&gt;
:; Columns titled &amp;quot;grace&amp;quot;&lt;br /&gt;
:: tell the user how long they are allowed to stay above their &amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;,  for what concerns bytes and files respectively. When these columns are empty (as in the example above) the user is not over quota.&lt;br /&gt;
&lt;br /&gt;
The meaning of &amp;#039;&amp;#039;&amp;#039;soft limit&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;hard limit&amp;#039;&amp;#039;&amp;#039; is the following. &lt;br /&gt;
&lt;br /&gt;
The hard limit cannot be exceeded. When a user reaches their hard limit, they cannot use any more disk space: for them, the filesystem behaves as if the disks are out of space. Disk writes will fail, temporary files will fail to be created, and the user will start to see warnings and errors while performing common tasks. The only disk operation allowed is file deletion.&lt;br /&gt;
&lt;br /&gt;
The soft limit is, as the word goes, softer. When a user exceeds it, they are not immediately prevented from using more disk space (provided that they stay below the hard limit). However, as the user goes beyond the soft limit, their &amp;#039;&amp;#039;&amp;#039;grace period&amp;#039;&amp;#039;&amp;#039; begins: i.e. a period within which the user must reduce their amount of data back to below the soft limit. During the grace period, the &amp;quot;grace&amp;quot; column(s) of the output of &amp;lt;code&amp;gt;quota&amp;lt;/code&amp;gt; show how much of the grace period remains to the user. If the user is still above their soft limit at the end of the grace period, the quota system will treat the soft limit as a hard limit: i.e. it will force the user to delete data until they are below the soft limit before they can write on disk again.&lt;br /&gt;
&lt;br /&gt;
In the output of &amp;lt;code&amp;gt;quota -s&amp;lt;/code&amp;gt;, the grace columns are blank except when a soft limit has been exceeded.&lt;br /&gt;
&lt;br /&gt;
=== Group and project quotas ===&lt;br /&gt;
&lt;br /&gt;
While on Mufasa disk quotas are usually assigned &amp;#039;&amp;#039;per-user&amp;#039;&amp;#039;, the quota system also enables the setup of &amp;#039;&amp;#039;per-group&amp;#039;&amp;#039; quotas (i.e., limits to the disk space that, collectively, a group of users can use) and &amp;#039;&amp;#039;per-project&amp;#039;&amp;#039; quotas (i.e., limits to the amount of data that a specific directory and all its subdirectories can contain).&lt;br /&gt;
&lt;br /&gt;
A comprehensive view of the quota situation for one&amp;#039;s user and user groups is provided by command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
quotainfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For what concerns project quotas, on Mufasa they are applied to group directories in &amp;lt;code&amp;gt;/home/&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Finding out how much disk space you are using ==&lt;br /&gt;
&lt;br /&gt;
If your user is the owner of directory &amp;lt;code&amp;gt;/path/to/dir/&amp;lt;/code&amp;gt; you can find out how much disk space is used by the directory with command &amp;lt;code&amp;gt;du&amp;lt;/code&amp;gt; like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -sh /path/to/dir/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;-sh&amp;lt;/code&amp;gt; flag is used to ask for options &amp;lt;code&amp;gt;-s&amp;lt;/code&amp;gt; (which provides the overall size of the directory) and &amp;lt;code&amp;gt;-h&amp;lt;/code&amp;gt; (which provides &amp;#039;&amp;#039;human-readable&amp;#039;&amp;#039; values using measurement units such as K (KBytes), M (MBytes), G (GBytes)).&lt;br /&gt;
&lt;br /&gt;
In particular, you can find out how much disk space is used by your home directory with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -sh ~&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In fact, in Linux the symbol &amp;lt;code&amp;gt;~&amp;lt;/code&amp;gt; is shorthand for the path to the current user&amp;#039;s home directory. &lt;br /&gt;
&lt;br /&gt;
If you want a detailed summary of how much disk space is used by each item (i.e., subdirectory or file) in a directory you own, use command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -h /path/to/dir/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For instance, for user gfontana the output of&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
du -h ~&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
may be similar to the following&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gfontana@rk018445:~$ du -h ~&lt;br /&gt;
12K	/home/gfontana/.ssh&lt;br /&gt;
356K	/home/gfontana/.cache/gstreamer-1.0&lt;br /&gt;
5.0M	/home/gfontana/.cache/tracker&lt;br /&gt;
5.3M	/home/gfontana/.cache&lt;br /&gt;
  [...other similar lines...]&lt;br /&gt;
4.0K	/home/gfontana/.config/htop&lt;br /&gt;
32K	/home/gfontana/.config&lt;br /&gt;
8.0K	/home/gfontana/.slurm&lt;br /&gt;
6.3M	/home/gfontana&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hidden files and directories ===&lt;br /&gt;
&lt;br /&gt;
In Linux, directories and files with a leading &amp;quot;.&amp;quot; in their name are &amp;#039;&amp;#039;hidden&amp;#039;&amp;#039;. These do not appear in listings, such as the output of the &amp;lt;code&amp;gt;ls&amp;lt;/code&amp;gt; command, to avoid cluttering them up: however, they still occupy disk space. &lt;br /&gt;
&lt;br /&gt;
The output of command &amp;lt;code&amp;gt;du&amp;lt;/code&amp;gt;, however, also considers hidden elements and provides their size: therefore it can help you understand why the quota system says that you are using more disk space than reported by &amp;lt;code&amp;gt;ls&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Changing file/directory ownership and permissions ==&lt;br /&gt;
&lt;br /&gt;
Every file or directory in a Linux system is owned by both a user and a group. User and group ownerships are not connected, so a file can have as group owner a group that its user ownwer does not belong to.&lt;br /&gt;
&lt;br /&gt;
Being able to manipulate who owns a file and what permissions any user has on that file is often important in a multi-user system such as Mufasa. This is a recapitulation of the main Linux commands to manipulate file permissions. Key commands are&lt;br /&gt;
&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chown&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change ownership - user part&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chgrp&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change ownership - group part&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;chmod&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; to change access permissions&lt;br /&gt;
&lt;br /&gt;
All three accept option &amp;lt;code&amp;gt;-R&amp;lt;/code&amp;gt; (uppercase) for recursive operation, so -if needed- you can change ownership and/or permissions of all contents of a directory and its subdirectories with a single command.&lt;br /&gt;
&lt;br /&gt;
The syntax of &amp;lt;code&amp;gt;chown&amp;lt;/code&amp;gt; commands is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chown &amp;lt;new_owner&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;new_owner&amp;gt;&amp;lt;/code&amp;gt; is the user part of the new file ownership.&lt;br /&gt;
&lt;br /&gt;
The syntax of &amp;lt;code&amp;gt;chgrp&amp;lt;/code&amp;gt; commands is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chgrp &amp;lt;new_group&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;new_owner&amp;gt;&amp;lt;/code&amp;gt; is the group part of the new file ownership.&lt;br /&gt;
&lt;br /&gt;
User and group ownership for a file can also be both changed at the same time with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chown &amp;lt;new_owner&amp;gt;:&amp;lt;new_group&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For what concerns &amp;lt;code&amp;gt;chmod&amp;lt;/code&amp;gt;, the easiest way to use it makes use of symbolic descriptions of the permissions. The format for this is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod [users]&amp;lt;+|-&amp;gt;&amp;lt;permissions&amp;gt; &amp;lt;path/to/file&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt; is the file or directory that the change is applied to&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;code&amp;gt;[users]&amp;lt;/code&amp;gt; is &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;ugo&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or a subset of it; the three letters correspond respectively:&lt;br /&gt;
:::to the &amp;#039;&amp;#039;&amp;#039;u&amp;#039;&amp;#039;&amp;#039;ser who owns &amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt; (also used if &amp;lt;code&amp;gt;[users]&amp;lt;/code&amp;gt; is not specified)&lt;br /&gt;
:::to the &amp;#039;&amp;#039;&amp;#039;g&amp;#039;&amp;#039;&amp;#039;roup that owns &amp;lt;code&amp;gt;&amp;lt;path/to/file&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
:::to everyone else (&amp;#039;&amp;#039;&amp;#039;o&amp;#039;&amp;#039;&amp;#039;thers)&lt;br /&gt;
:&amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;+&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;-&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; correspond to adding or removing permissions&lt;br /&gt;
:&amp;lt;code&amp;gt;&amp;lt;permissions&amp;gt;&amp;lt;/code&amp;gt; is &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;rwx&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039; or a subset, corresponding to &amp;#039;&amp;#039;&amp;#039;r&amp;#039;&amp;#039;&amp;#039;ead, &amp;#039;&amp;#039;&amp;#039;w&amp;#039;&amp;#039;&amp;#039;rite and e&amp;#039;&amp;#039;&amp;#039;x&amp;#039;&amp;#039;&amp;#039;ecute permissions&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; permission have a different meaning for files and for directories.&lt;br /&gt;
&lt;br /&gt;
;For files:&lt;br /&gt;
: permission &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt; allows to read the contents of the file&lt;br /&gt;
: permission &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; allows to change the contents of the file&lt;br /&gt;
: permission &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; allows to execute the file (provided that it is a program: e.g., a shell script)&lt;br /&gt;
&lt;br /&gt;
;For directories:&lt;br /&gt;
: permission &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt; allows to list the files within the directory&lt;br /&gt;
: permission &amp;lt;code&amp;gt;w&amp;lt;/code&amp;gt; allows to create, rename, or delete files within the directory&lt;br /&gt;
: permission &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; allows to enter the directory (i.e., &amp;lt;code&amp;gt;cd&amp;lt;/code&amp;gt; into it) and access its files&lt;br /&gt;
&lt;br /&gt;
For instance&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod g+rwx myfile.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
adds permission to read, write and execute myfile.txt to all the Linux users of the same group of the user that the file belongs to;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
chmod go-x mydir&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
takes away permission to enter directory &amp;lt;dirname&amp;gt; from everyone except the user who owns the directory.&lt;br /&gt;
&lt;br /&gt;
= Docker containers =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-docker_logo_cropped.jpg|right|262px]]&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;As a general rule, all computation performed on Mufasa must occur within [https://www.docker.com/ Docker containers]&amp;#039;&amp;#039;&amp;#039;. From [https://docs.docker.com/get-started/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
“&amp;#039;&amp;#039;Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.&lt;br /&gt;
&lt;br /&gt;
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.&lt;br /&gt;
&lt;br /&gt;
A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.&lt;br /&gt;
&lt;br /&gt;
A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&amp;amp;type=image dockerhub].&lt;br /&gt;
&lt;br /&gt;
How to run Docker containers on Mufasa is explained in [[User Jobs|User Jobs]]. There is also a page of this wiki [[Docker|dedicated to the preparation of Docker containers]].&lt;br /&gt;
&lt;br /&gt;
= The SLURM job scheduling system =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-Slurm logo.png|right|262px]]&lt;br /&gt;
Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] (&amp;#039;&amp;#039;Slurm Workload Manager&amp;#039;&amp;#039;, formerly known as &amp;#039;&amp;#039;Simple Linux Utility for Resource Management&amp;#039;&amp;#039;) to manage shared access to its resources.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Users of Mufasa must use SLURM to run and manage all processing-heavy jobs they run on the machine&amp;#039;&amp;#039;&amp;#039;. It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa&amp;#039;s resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/documentation.html SLURM&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The use of a job scheduling system such as SLURM ensures that Mufasa&amp;#039;s resources are exploited in an efficient way. The fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets &amp;#039;&amp;#039;queued&amp;#039;&amp;#039; and will be executed as soon as possible, according to the availability of resources in the machine.&lt;br /&gt;
&lt;br /&gt;
Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].&lt;br /&gt;
&lt;br /&gt;
SLURM is capable of managing complex computing systems composed of multiple &amp;#039;&amp;#039;&amp;#039;clusters&amp;#039;&amp;#039;&amp;#039; (i.e. sets) of machines, each comprising one &amp;#039;&amp;#039;&amp;#039;node&amp;#039;&amp;#039;&amp;#039; (i.e. machine) or more. The case of Mufasa is the simplest of all: &amp;#039;&amp;#039;Mufasa is the single node (called &amp;#039;&amp;#039;&amp;#039;&amp;lt;code&amp;gt;gn01&amp;lt;/code&amp;gt;&amp;#039;&amp;#039;&amp;#039;) of a SLURM computing cluster composed of that single machine.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. In managing process queues, SLURM considers such requirements and matches them with available resources. As a consequence, resource-heavy jobs generally take longer before thet get executed, while less demanding jobs are usually put into execution quickly. Processes that -while they are running- try to use more resources than they requested at launch time get killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
All in all, the take-away message is: [[User Jobs#Choosing the partition on which to run a job|&amp;#039;&amp;#039;consider carefully how much of each resource to ask for your job&amp;#039;&amp;#039;]].&lt;br /&gt;
&lt;br /&gt;
In [[User Jobs]] it will be explained how the process of requesting resources is greatly simplified by making use of process queues with predefined resource allocations called [[User Jobs#SLURM Partitions|&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;]].&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=Docker&amp;diff=679</id>
		<title>Docker</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=Docker&amp;diff=679"/>
		<updated>2022-02-08T09:52:49Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Precondition */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page intends to provide a very simple guide to the creation of [[System#Docker containers|Docker containers]].&lt;br /&gt;
&lt;br /&gt;
Docker containers are important for Mufasa users since [[User Jobs#Executing jobs on Mufasa|user processes must always be run within Docker containers]].&lt;br /&gt;
&lt;br /&gt;
On Mufasa, Docker containers are run [[User_Jobs#Using_SLURM_to_run_a_Docker_container|via the SLURM scheduling system]].&lt;br /&gt;
&lt;br /&gt;
=Precondition=&lt;br /&gt;
&lt;br /&gt;
In order to prepare Docker containers for Mufasa, you will need to install on your own computer the following software packages:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;[https://docs.docker.com/get-docker/ Docker]&amp;#039;&amp;#039;&amp;#039;, i.e. the package necessary to create docker containers&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;[https://github.com/NVIDIA/enroot Nvidia enroot]&amp;#039;&amp;#039;&amp;#039;, i.e. the package used to store docker containers in the .sqsh (“squash”) format that can be run via SLURM&lt;br /&gt;
&lt;br /&gt;
The procedure for installation varies according to the operating system. For instance, for Linux Ubuntu installation is done with commands&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sudo apt install docker.io&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sudo apt install enroot&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Basic concepts=&lt;br /&gt;
From [https://docs.docker.com/get-started/overview/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Images&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;An image is a read-only template with instructions for creating a Docker container&amp;#039;&amp;#039;. &lt;br /&gt;
&lt;br /&gt;
Often, an image is based on another image, with some additional customization. For example, you may build an image which is based on the ubuntu image, but installs the Apache web server and your application, as well as the configuration details needed to make your application run.&lt;br /&gt;
&lt;br /&gt;
You might create your own images or you might only use those created by others and published in a registry. To build your own image, you create a Dockerfile with a simple syntax for defining the steps needed to create the image and run it. [...]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Containers&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;A container is a runnable instance of an image&amp;#039;&amp;#039;. [...]&lt;br /&gt;
&lt;br /&gt;
A container is defined by its image as well as any configuration options you provide to it when you create or start it.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A Docker image does not get modified whenever a container is created from it. When the container is run, it too does not get modified: i.e., the original file describing the container remains the same, and any change is applied only to the container instance being executed. &lt;br /&gt;
&lt;br /&gt;
Usually the container instance is kept read-only, i.e. the only part(s) of its internal filesystem that are writable are those specified with a WORKDIR directive (see [[Docker#Creating a Docker container image|below]]).&lt;br /&gt;
&lt;br /&gt;
As a consequence, installation of software libraries in the Docker container is better managed by running commands within the container itself when it gets executed, not by installing the libraries within the original Docker image: this way it is possible to change the version of the libraries without having to re-create the original image file every time a new version of the libraries is released.&lt;br /&gt;
&lt;br /&gt;
=Creation of a Docker image=&lt;br /&gt;
&lt;br /&gt;
== Preparation ==&lt;br /&gt;
&lt;br /&gt;
The first step in the preparation of a new Docker image is to create a &amp;#039;&amp;#039;work directory&amp;#039;&amp;#039; where you will put all the elements to be used to create the image.&lt;br /&gt;
&lt;br /&gt;
Within such work directory, you will put:&lt;br /&gt;
&lt;br /&gt;
* subdirectories for all the things you need to create your container (e.g., a subdirectory “code” for the code of the job you will run on Mufasa)&lt;br /&gt;
&lt;br /&gt;
* a text file called &amp;#039;&amp;#039;&amp;#039;Dockerfile&amp;#039;&amp;#039;&amp;#039; (this is the full filename: it has no extension), where you specify how to create the container&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The Dockerfile contains directives that Docker uses to create the Docker container. Possible directives  are:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
; &amp;lt;code&amp;gt;FROM &amp;lt;container&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
: tells Docker that your container will be created on the basis of an already available container (created by you or by someone else; for instance, a container from dockerhub). This is useful because you can start from a basis container already including the libraries you need (e.g., Pytorch or Tensorflow).&lt;br /&gt;
: &amp;lt;code&amp;gt;&amp;lt;container&amp;gt;&amp;lt;/code&amp;gt; is the name of the basis container; this usually takes the form name:version (e.g. python:3.6)&lt;br /&gt;
: The FROM directive must be the first in the Dockerfile.&lt;br /&gt;
&lt;br /&gt;
: Example:&lt;br /&gt;
: &amp;lt;code&amp;gt;FROM python:3.6&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
; &amp;lt;code&amp;gt;WORKDIR &amp;lt;path/to/dir&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
: Directory in the filesystem of Mufasa that you want to import into the container. When the container is being executed, this directory will appear as if it is internal to the container, while in practice it is a link to the chosen directory in the host machine. It is used to exchange files between the host and the environment of the container. WORKDIR “mounts” a part of the host machine&amp;#039;s filesystem onto the filesystem of the container.&lt;br /&gt;
&lt;br /&gt;
: Example:&lt;br /&gt;
: &amp;lt;code&amp;gt;WORKDIR /opt&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
; &amp;lt;code&amp;gt;COPY &amp;lt;sourcedir&amp;gt; &amp;lt;destdir&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
: where &amp;lt;code&amp;gt;&amp;lt;destdir&amp;gt;&amp;lt;/code&amp;gt; is the directory specified by the WORKDIR directive or one of its subdirectories. This copies all contents of the directory in the host machine&amp;#039;s filesystem specified by &amp;lt;sourcedir&amp;gt; to the container&amp;#039;s own directory specified by &amp;lt;code&amp;gt;&amp;lt;destdir&amp;gt;&amp;lt;/code&amp;gt;. Note that the syntax of Docker&amp;#039;s COPY directive is not the same of Linux&amp;#039;s copy command.&lt;br /&gt;
&lt;br /&gt;
: Example:&lt;br /&gt;
: &amp;lt;code&amp;gt;COPY ./code ./opt&amp;lt;/code&amp;gt;&lt;br /&gt;
: copies all the files contained in &amp;lt;code&amp;gt;./code&amp;lt;/code&amp;gt; (i.e. subdirectory “code” of the parent directory of the directory where the Dockerfile resides) to &amp;lt;/code&amp;gt;./opt&amp;lt;/code&amp;gt; in the container&amp;#039;s own filesystem&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: For instance, if the host machine&amp;#039;s filesystem is this (where &amp;lt;code&amp;gt;/home/my_user/for_Docker&amp;lt;/code&amp;gt; is the work directory)&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
: /&lt;br /&gt;
:: ...&lt;br /&gt;
:: /home&lt;br /&gt;
::: /my_user&lt;br /&gt;
:::: /for_Docker&lt;br /&gt;
::::: /code&lt;br /&gt;
:::::: main.py&lt;br /&gt;
:::::: requirements.txt&lt;br /&gt;
:::::: run.sh&lt;br /&gt;
::::: Dockerfile&lt;br /&gt;
:: ...&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
:the &amp;lt;code&amp;gt;COPY&amp;lt;/code&amp;gt; directive of the example copies files &amp;lt;code&amp;gt;main.py&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;requirements.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;run.sh&amp;lt;/code&amp;gt; from &amp;lt;code&amp;gt;/home/my_user/for_Docker/code&amp;lt;/code&amp;gt; into the &amp;lt;code&amp;gt;/opt&amp;lt;/code&amp;gt; directory of the filesystem internal to the Docker container.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
; &amp;lt;code&amp;gt;RUN &amp;lt;command&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
: where &amp;lt;code&amp;gt;&amp;lt;command&amp;gt;&amp;lt;/code&amp;gt; is any command you can issue via a bash shell. Once the container is in execution, the commands specified by RUN directives are executed in the container. The commands are run (within the container) by the root user (of the container), i.e. they enjoy full administrative privileges. Being executed by root, RUN directives have full access to the container: they can, for instance, install software packages.&lt;br /&gt;
: &amp;lt;code&amp;gt;&amp;lt;command&amp;gt;&amp;lt;/code&amp;gt; should not involve interaction with a user, since such interaction is not possible.&lt;br /&gt;
&lt;br /&gt;
: Example:&lt;br /&gt;
: &amp;lt;code&amp;gt;RUN pip install -r requirements.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
: pip is the program used to install Python libraries: here it is used to install all the libraries specified in an external file called &amp;lt;code&amp;gt;requirements.txt&amp;lt;/code&amp;gt; containing statements of the form&lt;br /&gt;
:: &amp;lt;code&amp;gt;&amp;lt;name_of_package&amp;gt;==&amp;lt;version&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
: For instance, such a file may contain the following lines:&lt;br /&gt;
:: &amp;lt;code&amp;gt;opencv-contrib-python==4.3.0.3&amp;lt;/code&amp;gt;&lt;br /&gt;
:: &amp;lt;code&amp;gt;opencv-python==4.3.0.36&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
; &amp;lt;code&amp;gt;ENTRYPOINT [ “&amp;lt;command&amp;gt;”, “argument1”, “argument2”, “argument3”, ... ]&amp;lt;/code&amp;gt;&lt;br /&gt;
: where &amp;lt;code&amp;gt;&amp;lt;command&amp;gt;&amp;lt;/code&amp;gt; is a command and &amp;lt;code&amp;gt;“argumentk”&amp;lt;/code&amp;gt; are the arguments to be passed to it on the command line (this syntax is due to the fact that spaces cannot be used in an ENTRYPOINT directive). So &amp;lt;code&amp;gt;“argumentk”&amp;lt;/code&amp;gt; the is the &amp;#039;&amp;#039;k&amp;#039;&amp;#039;-th command line argument passed to &amp;lt;code&amp;gt;&amp;lt;command&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
: The “entrypoint”, specified by this directive, is the command that is executed as soon as the container is in execution. Tipically the entrypoint command launches a bash shell and uses it to run a script.&lt;br /&gt;
: The Docker container remains in execution only until the entrypoint command is in execution. If the entrypoint terminates or fails, the container gets terminated as well.&lt;br /&gt;
&lt;br /&gt;
: Example:&lt;br /&gt;
: &amp;lt;code&amp;gt;ENTRYPOINT [ “/bin/bash”, “/opt/run.sh” ]&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Creation of the Docker image==&lt;br /&gt;
&lt;br /&gt;
Once the Dockerfile is completed and all the material required by the image is in place in the work directory, it is time to create the image. The container image describes the container and can be subsequently used to create a container file: for instance, one formatted using the .sqsh format accepted by SLURM.&lt;br /&gt;
&lt;br /&gt;
To create the container image, use command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
docker build -t &amp;lt;name_of_image&amp;gt; .&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;name_of_image&amp;gt;&amp;lt;/code&amp;gt; is the name to be assigned to the image. This name is arbitrary, but usually is structured like&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;name&amp;gt;:vXX&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;name&amp;gt;&amp;lt;/code&amp;gt; is any name and &amp;lt;code&amp;gt;XX&amp;lt;/code&amp;gt; is a version number. The “.” in the docker command tells docker that the components of the container are in the current directory.&lt;br /&gt;
&lt;br /&gt;
An example of command for the creation of an image is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;docker build -t docker_test:v1 .&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
During image creation, all the commands specified in the Dockerfile are executed (e.g., for the installation of libraries).&lt;br /&gt;
&lt;br /&gt;
==Image library==&lt;br /&gt;
&lt;br /&gt;
Docker maintains a local repository of (compressed) images that are available on the local computer (i.e. the one used for image creation). Every time a new image is created on the machine, it gets automatically added to the local repository.&lt;br /&gt;
&lt;br /&gt;
To get a list of available  images in the local repository, use command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
docker image list&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The local repository is in a system directory managed by Docker.&lt;br /&gt;
&lt;br /&gt;
In addition to local repositories, Docker allows access to remote repositories; the main one among these is [https://hub.docker.com/ Docker Hub].&lt;br /&gt;
&lt;br /&gt;
=Creation of a Docker container from an image=&lt;br /&gt;
In order to be run on Mufasa, Docker containers must take the form of a single &amp;lt;code&amp;gt;.sqsh&amp;lt;/code&amp;gt; compressed file (it&amp;#039;s pronounced &amp;quot;squash&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
Creation of a &amp;lt;code&amp;gt;.sqsh&amp;lt;/code&amp;gt; container file can be done from a local or remote image. Command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
enroot import docker://&amp;lt;remote_container_image&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
creates a container file called &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;remote_container_image&amp;gt;.sqsh&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
from a remote Docker image called &amp;lt;code&amp;gt;&amp;lt;remote_container_image&amp;gt;&amp;lt;/code&amp;gt; downloaded from the [https://hub.docker.com/ Dockerhub] remote image repository.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;enroot import docker://python:3.6&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To create a Docker container from a local Docker image (i.e., one stored in the local image library on your own computer) the command to use is instead&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
enroot import dockerd://&amp;lt;local_container_image&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(note that we are now using &amp;lt;code&amp;gt;import dockerd&amp;lt;/code&amp;gt; instead of &amp;lt;code&amp;gt;import docker&amp;lt;/code&amp;gt;). &lt;br /&gt;
&lt;br /&gt;
Running the command above creates a container file called &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;local_container_image&amp;gt;.sqsh&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
from a local image called &amp;lt;code&amp;gt;&amp;lt;local_container_image&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;enroot import dockerd://docker_test:v1&amp;lt;/code&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=625</id>
		<title>System</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=625"/>
		<updated>2022-02-01T15:23:53Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Hardware */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Mufasa is a Linux server located in a server room managed by the [[Roles|System Administrators]]. [[Roles|Job Users]] and [[Roles|Job Administrators]] can only access Mufasa remotely. &lt;br /&gt;
&lt;br /&gt;
Remote access to Mufasa is performed using the [[System#Accessing Mufasa|SSH protocol]] for the execution of commands and the [[System#File transfer|SFTP protocol]] for the exchange of files. Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Hardware =&lt;br /&gt;
[[File:hw.png|right|320px]]&lt;br /&gt;
Mufasa is a server for massively parallel computation. It has been set up and configured by [https://www.e4company.com/en/ E4 Computer Engineering] with the support of [https://nearlab.polimi.it/ NearLab], [http://www.biomech.polimi.it/ Biomechanics Group] and [http://www.cartcas.polimi.it/ CartCasLab].&lt;br /&gt;
&lt;br /&gt;
Mufasa&amp;#039;s main hardware components are:&lt;br /&gt;
&lt;br /&gt;
* 32-core, 64-thread AMD processor&lt;br /&gt;
* 1 TB RAM&lt;br /&gt;
* 9 TB of SSDs (for OS and execution cache)&lt;br /&gt;
* 28TB of HDDs (for user /home directories)&lt;br /&gt;
* 5 Nvidia A100 GPUs [based on the &amp;#039;&amp;#039;Ampere&amp;#039;&amp;#039; architecture]&lt;br /&gt;
* Linux Ubuntu operating system&lt;br /&gt;
&lt;br /&gt;
Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, access resources are shared among different users and processes in order to optimise their usage and availability.&lt;br /&gt;
&lt;br /&gt;
== CPUs and GPUs ==&lt;br /&gt;
&lt;br /&gt;
Mufasa is fitted with a 32-core CPU. Each core is able to run 2 threads in parallel, so the system has a total of 64 virtual CPUs. Of these, 2 are reserved for jobs run outside the [[System#The SLURM job scheduling system|SLURM job scheduling system]] (i.e., for low-power &amp;quot;housekeeping&amp;quot; tasks) while the remaining 62 are reserved for jobs run via SLURM.&lt;br /&gt;
&lt;br /&gt;
For what concerns GPUs, some of the 5 physical A100 processing cards (i.e., GPUs) are subdivided into “virtual” GPUs with different capabilities using [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ Nvidia&amp;#039;s MIG system]. From MIG&amp;#039;s user guide:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;The Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In practice, MIG allows flexible partitioning of a very powerful (but single) GPU to create multiple virtual GPUs with different capabilities, that are then made available to users as if they were separate devices.&lt;br /&gt;
&lt;br /&gt;
Command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;nvidia-smi&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(see [https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi Nvidia&amp;#039;s documentation]) provides an overview of the physical and virtual GPUs available to users in a system (“smi” stands for System Management Interface). On Mufasa, this command may require to be launched via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs. Its output (not reported here since it is quite extensive) is subdivided into three parts:&lt;br /&gt;
&lt;br /&gt;
* the first part describes the physical GPUs&lt;br /&gt;
* the second describes the virtual GPUs obtained by subdividing physical GPUs using MIG&lt;br /&gt;
* the third describes processes currently using the GPUs&lt;br /&gt;
&lt;br /&gt;
However, the most useful way to use &amp;lt;code&amp;gt;nvidia-smi&amp;lt;/code&amp;gt; for Mufasa users is this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
nvidia-smi -L&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output of this command, in fact, shows the set of GPUs available in the system:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-a9f6e4f2-2877-8642-1802-5eeb3518d415)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-abe13a42-013b-5bef-aa5e-bbd268d72447)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-268c6b30-d10c-59db-babd-3eda7b89da34)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-90e26aa7-cf69-5672-b758-419679238cd3)&lt;br /&gt;
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-5f28ca0a-5b2c-bfc7-5b9f-581b5ca1d110)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-07372a92-2e37-5ad6-b334-add0100cf5e3)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-4ca248b0-ab87-5f91-a788-5fe169d0623e)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-a93ffffb-9a0d-51d1-b9df-36bc624a2084)&lt;br /&gt;
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-fb86701b-5781-b63c-5cda-911cff3a5edb)&lt;br /&gt;
GPU 3: NVIDIA A100-PCIE-40GB (UUID: GPU-bbeed512-ab4c-e984-cfea-8067c009a600)&lt;br /&gt;
  MIG 3g.20gb     Device  0: (UUID: MIG-bdbcf24a-a0aa-56fb-a7e4-fc18f17b7f24)&lt;br /&gt;
  MIG 2g.10gb     Device  1: (UUID: MIG-4c44132b-7499-562d-a85f-55a0a2cbb5ba)&lt;br /&gt;
  MIG 2g.10gb     Device  2: (UUID: MIG-fe354ead-4f87-53ab-9271-1d98190248f4)&lt;br /&gt;
GPU 4: NVIDIA A100-PCIE-40GB (UUID: GPU-a9511357-2476-7ddf-c4c5-c90feb68acfd)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As &amp;lt;code&amp;gt;nvidia-smi -L&amp;lt;/code&amp;gt; shows, the physical Nvidia A100 GPUs installed on Mufasa have been so subdivided:&lt;br /&gt;
&lt;br /&gt;
* two of the physical GPUs (GPU 2 and GPU 4) have not been subdivided at all&lt;br /&gt;
* three of the physical GPUs (GPU 0, GPU 1 and GPU 3) have been subdivided into 3 virtual GPUs each:&lt;br /&gt;
** one virtual GPU with 20 GB of RAM&lt;br /&gt;
** two virtual GPUs with 10 GB of RAM each&lt;br /&gt;
&lt;br /&gt;
Thanks to MIG, users can use all the GPUs listed above as if they were all physical devices installed on Mufasa, without having to worry (or even know) which actually are and which instead are virtual GPUs.&lt;br /&gt;
&lt;br /&gt;
All in all, then, users of Mufasa are provided with the following set of &amp;#039;&amp;#039;&amp;#039;11 GPUs&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2 GPUs with 40 GB of RAM each&amp;#039;&amp;#039;&amp;#039; &lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;3 GPUs with 20 GB of RAM each&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;6 GPUs with 10 GB of RAM each&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
How these devices are made available to Mufasa users is explained in [[User Jobs]].&lt;br /&gt;
&lt;br /&gt;
= Accessing Mufasa =&lt;br /&gt;
&lt;br /&gt;
User access to Mufasa is always remote and exploits the &amp;#039;&amp;#039;SSH&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure SHell&amp;#039;&amp;#039;) protocol. &lt;br /&gt;
&lt;br /&gt;
To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
ssh &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is one of the IP addresses of Mufasa, i.e. either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; may access Mufasa with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;ssh mrossi@10.79.23.97&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Access via SSH works with Linux, MacOs and Windows 10 (and later) terminals. For Windows users, a handy alternative tool (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].&lt;br /&gt;
&lt;br /&gt;
If you don&amp;#039;t have a user account on Mufasa, you first have to ask your supervisor for one. See [[System#Users and groups|Users and groups]] for more information about Mufasa&amp;#039;s users.&lt;br /&gt;
&lt;br /&gt;
As soon as you launch the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, you will be asked to type the password (i.e., the one of your user account on Mufasa). Once you provide the password, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa. The remote shell sports a &amp;#039;&amp;#039;command prompt&amp;#039;&amp;#039; such as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;username&amp;gt;@rk018445:~$&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(&amp;#039;&amp;#039;rk018445&amp;#039;&amp;#039; is the Linux hostname of Mufasa). For instance, user &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt; will see a prompt similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;mrossi@rk018445:~$&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the remote shell, you can issue commands to Mufasa by typing them after the prompt, then pressing the &amp;#039;&amp;#039;enter&amp;#039;&amp;#039; key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as &amp;lt;code&amp;gt;pwd&amp;lt;/code&amp;gt; (which prints the path to the current directory) or &amp;lt;code&amp;gt;cd &amp;lt;destination_dir&amp;gt;&amp;lt;/code&amp;gt; (which changes the current directory). On the internet you can find many tutorials about the Linux command line, such as [https://linuxcommand.org/index.php this one].&lt;br /&gt;
&lt;br /&gt;
To close the SSH session run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
from the command prompt of the remote shell.&lt;br /&gt;
&lt;br /&gt;
== VPN ==&lt;br /&gt;
To be able to connect to Mufasa, your computer must belong to Polimi&amp;#039;s LAN. This happens either because the computer is physically located at Politecnico di Milano and connected via ethernet, or because you are using Polimi&amp;#039;s VPN (Virtual Private Network) to connect to its LAN from somewhere else (such as your home). In particular, using the VPN is the &amp;#039;&amp;#039;only&amp;#039;&amp;#039; way to use Mufasa from outside Polimi. See [https://intranet.deib.polimi.it/ita/vpn-wifi this DEIB webpage] for instructions about how to activate VPN access.&lt;br /&gt;
&lt;br /&gt;
== SSH timeout ==&lt;br /&gt;
&lt;br /&gt;
SSH sessions to Mufasa may be subjected to an inactivity timeout: i.e., after a given inactivity period the ssh session gets automatically closed. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should [[User Jobs#Detaching from a running job with screen|use the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; command]].&lt;br /&gt;
&lt;br /&gt;
== SSH and graphics ==&lt;br /&gt;
&lt;br /&gt;
The standard form of the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, i.e. the one described at the beginning of [[system#Accessing Mufasa|Accessing Mufasa]], should always be preferred. However, it only allows text communication with Mufasa. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user&amp;#039;s machine (which must use Linux as well). A special mode of operation of &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; is needed to enable this. This mode is engaged by running command &amp;lt;code&amp;gt;ssh&amp;lt;/code&amp;gt; like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt; ssh -X &amp;lt;your username on Mufasa&amp;gt;@&amp;lt;Mufasa&amp;#039;s IP address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= File transfer =&lt;br /&gt;
&lt;br /&gt;
Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the &amp;#039;&amp;#039;SFTP&amp;#039;&amp;#039; protocol (&amp;#039;&amp;#039;Secure File Transfer Protocol&amp;#039;&amp;#039;). &lt;br /&gt;
&lt;br /&gt;
Linux and MacOS users can directly use the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; package, as explained (for instance) by [https://geekflare.com/sftp-command-examples/ this guide]. Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package. MacOS users can interact with Mufasa via SFTP also with the [https://cyberduck.io/ Cyberduck] software package.&lt;br /&gt;
&lt;br /&gt;
For Linux and MacOS user, file transfer to/from Mufasa occurs via an &amp;#039;&amp;#039;interactive sftp shell&amp;#039;&amp;#039;, i.e. a remote shell very similar to the one one described in [[Accessing Mufasa|Accessing Mufasa]]. &lt;br /&gt;
The first thing to do is to open a terminal and run the following command (note the similarity to SSH connections):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp &amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is the username on Mufasa of the user, and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;10.79.23.97&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will be asked your password. Once you provide it, you access an interactive sftp shell, where the command prompt takes the form&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sftp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
From this shell you can run the commands to exchange files. Most of these commands have two forms: one to act on the remote machine (in this case, Mufasa) and one to act on the local machine (i.e. your own computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L). &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
cd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the remote machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
lcd &amp;lt;path&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to change directory to &amp;lt;code&amp;gt;&amp;lt;path&amp;gt;&amp;lt;/code&amp;gt; on the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
get &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to download (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the remote machine to the current directory of the local machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
put &amp;lt;filename&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
to upload (i.e. copy) &amp;lt;code&amp;gt;&amp;lt;filename&amp;gt;&amp;lt;/code&amp;gt; from the current directory of the local machine to the current directory of the remote machine.&lt;br /&gt;
&lt;br /&gt;
Naturally, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories). Also, users can only download files from directories where they have read permission. (File permission on Mufasa follow the standard Linux rules.)&lt;br /&gt;
&lt;br /&gt;
In addition to the terminal interface, users of Linux distributions based on Gnome (such as Ubuntu) can use a handy graphical tool to exchange files with Mufasa. In Gnome&amp;#039;s Nautilus file manager, write&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sftp://&amp;lt;username&amp;gt;@&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
in the address bar, where &amp;lt;code&amp;gt;username&amp;lt;/code&amp;gt; is your username on Mufasa and &amp;lt;code&amp;gt;&amp;lt;IP_address&amp;gt;&amp;lt;/code&amp;gt; is either &amp;lt;code&amp;gt;10.79.23.96&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;10.79.23.97&amp;lt;/code&amp;gt;. Nautilus becomes a graphical interface to Mufasa&amp;#039;s remote filesystem.&lt;br /&gt;
&lt;br /&gt;
= Docker containers =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-docker_logo_cropped.jpg|right|262px]]&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;As a general rule, all computation performed on Mufasa must occur within [https://www.docker.com/ Docker containers]&amp;#039;&amp;#039;&amp;#039;. From [https://docs.docker.com/get-started/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
“&amp;#039;&amp;#039;Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.&lt;br /&gt;
&lt;br /&gt;
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.&lt;br /&gt;
&lt;br /&gt;
A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.&lt;br /&gt;
&lt;br /&gt;
A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&amp;amp;type=image dockerhub].&lt;br /&gt;
&lt;br /&gt;
How to run Docker containers on Mufasa is explained in [[User Jobs|User Jobs]]. See [[Docker|Docker]] for directions about preparing Docker containers.&lt;br /&gt;
&lt;br /&gt;
= The SLURM job scheduling system =&lt;br /&gt;
&lt;br /&gt;
[[File:262px-Slurm logo.png|right|262px]]&lt;br /&gt;
Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] (&amp;#039;&amp;#039;Slurm Workload Manager&amp;#039;&amp;#039;, formerly known as &amp;#039;&amp;#039;Simple Linux Utility for Resource Management&amp;#039;&amp;#039;) to manage shared access to its resources.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Users of Mufasa must use SLURM to run and manage all processing-heavy jobs they run on the machine&amp;#039;&amp;#039;&amp;#039;. It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa&amp;#039;s resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/documentation.html SLURM&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The use of a job scheduling system such as SLURM ensures that Mufasa&amp;#039;s resources are exploited in an efficient way. The fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets &amp;#039;&amp;#039;queued&amp;#039;&amp;#039; and will be executed as soon as possible, according to the availability of resources in the machine.&lt;br /&gt;
&lt;br /&gt;
Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].&lt;br /&gt;
&lt;br /&gt;
In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. In managing process queues, SLURM considers such requirements and matches them with available resources. As a consequence, resource-heavy jobs generally take longer before thet get executed, while less demanding jobs are usually put into execution quickly. Processes that -while they are running- try to use more resources than they requested at launch time get killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
All in all, the take-away message is: &amp;#039;&amp;#039;consider carefully how much of each resource to ask for your job&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
In [[User Jobs]] it will be explained how the process of requesting resources is greatly simplified by making use of process queues with predefined resource allocations called [[User Jobs#SLURM Partitions|&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;]].&lt;br /&gt;
&lt;br /&gt;
= Users and groups =&lt;br /&gt;
&lt;br /&gt;
Only Mufasa users (i.e., people with a user account on Mufasa) can access the machine and interact with it. Creation of new users is done by Job Administrators or by specially designated users within each research group.&lt;br /&gt;
&lt;br /&gt;
Mufasa usernames have the form &amp;lt;code&amp;gt;xyyy&amp;lt;/code&amp;gt; (all lowercase), where &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; is the first letter of the first name of the person, and &amp;lt;code&amp;gt;yyy&amp;lt;/code&amp;gt; is their complete surname. For instance, a person called Mario Rossi will be assigned username &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt;. If multiple users with the same surname &amp;#039;&amp;#039;and&amp;#039;&amp;#039; first letter of the first name exist, those created after the very first one are given usernames including a two-digit counter: &amp;lt;code&amp;gt;mrossi&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;mrossi01&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;mrossi02&amp;lt;/code&amp;gt; and so on.&lt;br /&gt;
&lt;br /&gt;
On Linux machines such as Mufasa, users belong to &amp;#039;&amp;#039;groups&amp;#039;&amp;#039;. On Mufasa, groups are used to identify the research group that a specific user is part of. Assigment of Mufasa&amp;#039;s users to groups follow these rules:&lt;br /&gt;
&lt;br /&gt;
* All users corresponding to people belong to group &amp;lt;code&amp;gt;users&amp;lt;/code&amp;gt;&lt;br /&gt;
* Additionally, each user must belong to &amp;#039;&amp;#039;one and only one&amp;#039;&amp;#039; of the following groups (within brackets is the name of the faculty who is in charge of Mufasa for each group):&lt;br /&gt;
** &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;cartcas&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;, i.e. [http://www.cartcas.polimi.it/ CartCasLab] (prof. Cerveri);&lt;br /&gt;
** &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;biomech&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;, i.e. [http://www.biomech.polimi.it/ Biomechanics Research Group] (prof. Votta);&lt;br /&gt;
** &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;nearmrs&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;, i.e. [https://nearlab.polimi.it/medical/ Medical Robotics Section of NearLab] (prof. De Momi);&lt;br /&gt;
** &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;nearnes&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;, i.e. [https://nearlab.polimi.it/neuroengineering/ NeuroEngineering Section of NearLab] (prof. Ferrante);&lt;br /&gt;
** &amp;lt;code&amp;gt;&amp;#039;&amp;#039;&amp;#039;bio&amp;#039;&amp;#039;&amp;#039;&amp;lt;/code&amp;gt;, for BioEngineering users not belonging to any of the research groups listed above.&lt;br /&gt;
&lt;br /&gt;
Mufasa users who have the power to create new users do so with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sudo  /opt/share/sbin/add_user.sh -u &amp;lt;user&amp;gt; -g users,&amp;lt;group&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;code&amp;gt;&amp;lt;user&amp;gt;&amp;lt;/code&amp;gt; is the username of the new user and &amp;lt;code&amp;gt;&amp;lt;group&amp;gt;&amp;lt;/code&amp;gt; is one of the 6 groups from the list above.&lt;br /&gt;
&lt;br /&gt;
For instance, in order to create a user on Mufasa for a person named Mario Rossi belonging to CartCasLab, the following command will be used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sudo  /opt/share/sbin/add_user.sh -u mrossi -g users,cartcas&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At first login, new users will be asked to change the password initially assigned to them. For security reason, it is important that such first login occurs as soon as possible after user creation.&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=327</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=327"/>
		<updated>2022-01-20T18:41:00Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Scratch Storage */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page presents the features of Mufasa that are most relevant to Mufasa&amp;#039;s [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
= SLURM Partitions =&lt;br /&gt;
&lt;br /&gt;
Several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;&amp;#039; in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
debug         up   infinite      1    mix gn01&lt;br /&gt;
small*        up   12:00:00      1    mix gn01&lt;br /&gt;
normal        up 1-00:00:00      1    mix gn01&lt;br /&gt;
longnormal    up 3-00:00:00      1    mix gn01&lt;br /&gt;
gpu           up 1-00:00:00      1    mix gn01&lt;br /&gt;
gpulong       up 3-00:00:00      1    mix gn01&lt;br /&gt;
fat           up 3-00:00:00      1    mix gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside &amp;quot;small&amp;quot; indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.&lt;br /&gt;
&lt;br /&gt;
On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo --Format=all&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
but its output can be overwhelming. For instance, in the example above the output of &amp;lt;code&amp;gt;sinfo --Format=all&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A less comprehensive but more readable view of partition features can be obtained via a tailored &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo -o &amp;quot;%.10P %.6a %.4c %.17B %.54G %.11l %.11L %.4r&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Such command provides an output similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
 PARTITION  AVAIL CPUS MAX_CPUS_PER_NODE                                                   GRES   TIMELIMIT DEFAULTTIME ROOT&lt;br /&gt;
     debug     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    infinite         n/a  yes&lt;br /&gt;
    small*     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    12:00:00       15:00   no&lt;br /&gt;
    normal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
longnormal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       gpu     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
   gpulong     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       fat     up   62                48  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the &amp;#039;&amp;#039;node&amp;#039;&amp;#039; is Mufasa:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
: %P Partition name followed by &amp;quot;*&amp;quot; for the default partition&lt;br /&gt;
: %a State/availability of a partition&lt;br /&gt;
: %c Number of CPUs per node [&amp;#039;&amp;#039;for Mufasa these are [[System#CPUs and GPUs|the 64 CPUs minus the 2 dedicated to non-SLURM jobs]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %B The max number of CPUs per node available to jobs in the partition&lt;br /&gt;
: %G Generic resources (gres) associated with the nodes [&amp;#039;&amp;#039;for Mufasa these correspond to the [[System#CPUs and GPUs|virtual GPUs defined with MIG]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %l Maximum time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %L Default time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %r Only user root may initiate jobs, &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command used, field identifiers &amp;lt;code&amp;gt;%...&amp;lt;/code&amp;gt; have been preceded by width specifiers in the form &amp;lt;code&amp;gt;.N&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;N&amp;lt;/code&amp;gt; is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and are used to increase readability.&lt;br /&gt;
&lt;br /&gt;
== Partition availability ==&lt;br /&gt;
&lt;br /&gt;
An important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides is the &amp;#039;&amp;#039;availability&amp;#039;&amp;#039; (also called &amp;#039;&amp;#039;state&amp;#039;&amp;#039;) of partitions. Possible partition states are:&lt;br /&gt;
&lt;br /&gt;
; up&lt;br /&gt;
: the partition is available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; drain&lt;br /&gt;
: the partition is not available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; down&lt;br /&gt;
: the same as &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; but the partition failed: i.e., it suffered a disruption&lt;br /&gt;
&lt;br /&gt;
A partition in state &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; or &amp;#039;&amp;#039;down&amp;#039;&amp;#039; requires intervention by a [[Roles|Job Administrator]] to be restored to &amp;#039;&amp;#039;up&amp;#039;&amp;#039;. Jobs waiting for that partition are paused unless the partition returns available.&lt;br /&gt;
&lt;br /&gt;
== Choosing the partition on which to run a job ==&lt;br /&gt;
&lt;br /&gt;
When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job&amp;#039;s features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition.&lt;br /&gt;
&lt;br /&gt;
The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the most important resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined tmaximum. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
An interesting part of the output of &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; ([[User Jobs#SLURM Partitions|see above]]) is the one concerning the GPUs, since GPUs are usually the less plentiful resource in a system such as Mufasa. For all partitions this part of the output is identical and equal to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The string above means that whatever the partition on which it is run, a job can request up to 2 40 GB GPUs, up to 3 20 GB GPUs and up to 6 10 GB GPUs (even all together): i.e., [[System#CPUs and GPUs|the complete set of 11 GPUs available in the system]]. In other words, at the moment none of the partitions defined on Mufasa sets limits on the maximum amount of GPUs that they allow jobs to request.&lt;br /&gt;
&lt;br /&gt;
Of course, the larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa&amp;#039;s limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, witl have the effect of delaying (possibly for a long time) job execution.&lt;br /&gt;
&lt;br /&gt;
= Executing jobs on Mufasa =&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done. &lt;br /&gt;
&lt;br /&gt;
Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
:; Step 1&lt;br /&gt;
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]&lt;br /&gt;
&lt;br /&gt;
:; Step 2&lt;br /&gt;
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.&lt;br /&gt;
&lt;br /&gt;
The commands that SLURM provides to run jobs are &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.&lt;br /&gt;
&lt;br /&gt;
Among the options available for &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, one of the most important is &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;K&amp;lt;/code&amp;gt; is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option &amp;lt;code&amp;gt;--p &amp;lt;partition name&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small ./my_program&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;lt;code&amp;gt;my_program&amp;lt;/code&amp;gt; on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
= Using SLURM to run a Docker container =&lt;br /&gt;
&lt;br /&gt;
The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun ‑‑p &amp;lt;partition_name&amp;gt; ‑‑container-image=&amp;lt;container_path.sqsh&amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt; ‑‑gres=&amp;lt;gpu_resources&amp;gt; ‑‑mem=&amp;lt;mem_resources&amp;gt; ‑‑cpus‑per‑task &amp;lt;cpu_amount&amp;gt; ‑‑pty ‑‑time=&amp;lt;hh:mm:ss&amp;gt; &amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Below these options are explained.&lt;br /&gt;
&lt;br /&gt;
;‑‑p &amp;lt;partition_name&amp;gt;&lt;br /&gt;
: specifies the resource partition on which the job will be run.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! If &amp;lt;code&amp;gt;‑‑p &amp;lt;partition_name&amp;gt;&amp;lt;/code&amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;lt;code&amp;gt;‑‑mem=&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑cpus‑per‑task &amp;lt;cpu_number&amp;gt;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;‑‑time=&amp;lt;hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;lt;code&amp;gt;‑‑gres=&amp;lt;gpu_resources&amp;gt;&amp;lt;/code&amp;gt;: GPU resources, in fact, must always be explicitly requested with option &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;, otherwise no access to GPUs is granted to the job.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑container-image=&amp;lt;container_path.sqsh&amp;gt;&lt;br /&gt;
: specifies the container to be run&lt;br /&gt;
&lt;br /&gt;
;‑‑no‑container‑entrypoint&lt;br /&gt;
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;lt;code&amp;gt;&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/code&amp;gt; takes the value &amp;lt;code&amp;gt;/home/mrossi:/data&amp;lt;/code&amp;gt; this tells srun to mount Mufasa&amp;#039;s directory &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; in position &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; within the filesystem of the Docker container. When the docker container reads or writes files in directory &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; of its own (internal) filesystem, what actually happens is that files in &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; get manipulated instead. &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&lt;br /&gt;
;‑‑gres=&amp;lt;gpu_resources&amp;gt;&lt;br /&gt;
: specifies what GPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;gpus&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;gpu:40gb:2&amp;lt;/code&amp;gt;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! The &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑mem=&amp;lt;mem_resources&amp;gt;&lt;br /&gt;
: specifies the amount of RAM to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;200G&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑cpus-per-task &amp;lt;cpu_amount&amp;gt;&lt;br /&gt;
: specifies how many CPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;cpu_amount&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;2&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑pty&lt;br /&gt;
: specifies that the job will be interactive (this is necessary when &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑time=&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies the maximum time allowed to the job to run, in the format &amp;lt;code&amp;gt;days-hours:minutes:seconds&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;days&amp;lt;/code&amp;gt; is optional; for instance, &amp;lt;code&amp;gt;&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;72:00:00&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
: the executable that will be run within the Docker container as soon as it is operative. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;python&amp;lt;/code&amp;gt;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; to launch non-interactive programs.&lt;br /&gt;
&lt;br /&gt;
== Nvidia Pyxis ==&lt;br /&gt;
&lt;br /&gt;
Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options &amp;lt;code&amp;gt;‑‑container-image&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑no‑container‑entrypoint&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑container-mounts&amp;lt;/code&amp;gt; are provided to &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; by Pyxis.&lt;br /&gt;
&lt;br /&gt;
= Launching a user job from within a Docker container =&lt;br /&gt;
&lt;br /&gt;
Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, the user is dropped to the interactive environment specified by &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
= Running interactive jobs via SLURM =&lt;br /&gt;
&lt;br /&gt;
As explained, SLURM command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with a command similar to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --gres=gpu:1 --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;lt;code&amp;gt;(SLURM ID xx)&amp;lt;/code&amp;gt; (where &amp;lt;code&amp;gt;xx&amp;lt;/code&amp;gt; is the ID of the &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
echo $SLURM_JOB_ID&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
= Detach from a running job with &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; =&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; inside a &amp;#039;&amp;#039;screen session&amp;#039;&amp;#039; (often simply called &amp;quot;a screen&amp;quot;), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;lt;code&amp;gt;screen&amp;lt;/code&amp;gt; available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, to create a screen session and run a job in it:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* From the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* In the screen session thus created (it has the look of an empty shell), launch your job with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;Detach&amp;#039;&amp;#039; from the screen session with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, while your process will go on running in the screen session&lt;br /&gt;
* You can now close the SSH connection to Mufasa without damaging your process&lt;br /&gt;
&lt;br /&gt;
Later, when you are ready to resume contact with your running process:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* In the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen -r&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* You are now back to the screen session where you launched your job&lt;br /&gt;
&lt;br /&gt;
* When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;\&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
= Using execution scripts to run jobs =&lt;br /&gt;
&lt;br /&gt;
Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.&lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;,  where the user specifies the values to be given to parameters, each preceded by the keyword &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;srun commands&amp;#039;&amp;#039;&amp;#039; that launch jobs with SLURM using the parameter values specified in the preamble&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set&lt;br /&gt;
* have &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; as its very first line&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;, but this is not mandatory.&lt;br /&gt;
&lt;br /&gt;
To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press the &amp;lt;enter&amp;gt; key. The script is executed in the terminal, and any output (e.g., whatever is printed by any &amp;lt;code&amp;gt;echo&amp;lt;/code&amp;gt; commands in the script) is shown on the terminal.&lt;br /&gt;
&lt;br /&gt;
Within a bash script, lines preceded by &amp;lt;code&amp;gt;#&amp;lt;/code&amp;gt; are comments (with the notable exception of the initial &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold; the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------start of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Note: these are examples. Put your own SBATCH directives below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=myjob&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; name assigned to the job&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --cpus-per-task=1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of threads allocated to each task&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --mem-per-cpu=500M&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; amount of memory per CPU core&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --gres=gpu:10gb:1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of GPUs per node&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; the partition to run your jobs on&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --time=0-00:01:00&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt;----------------end of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------srun commands-----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Put your own srun command(s) below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------end of srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As the example above shows, beyond the initial directive &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; the script includes a series of &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directives used to specify parameter values, and finally one or more &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; commands that run the jobs. Any parameter accepted by commands &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; can be used as an &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
= Scratch Storage =&lt;br /&gt;
&lt;br /&gt;
When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.&lt;br /&gt;
&lt;br /&gt;
Each time a job is run via SLURM, this is what happens automatically:&lt;br /&gt;
&lt;br /&gt;
# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;) to a cache space located on system SSDs&lt;br /&gt;
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files&lt;br /&gt;
# The executables create their output files in the cache space&lt;br /&gt;
# When the user jobs end, Mufasa copies the output files from the cache space back to the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The caching mechanism requires that &amp;#039;&amp;#039;during job execution&amp;#039;&amp;#039; the user does not modify the contents of the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.&lt;br /&gt;
&lt;br /&gt;
= Monitoring and managing jobs =&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/overview.html SLURM&amp;#039;s own overview]:&lt;br /&gt;
&lt;br /&gt;
“&amp;#039;&amp;#039;User tools include&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs, &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
scancel&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
squeue&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.&amp;#039;&amp;#039;”&lt;br /&gt;
&lt;br /&gt;
An example of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
  520       fat     bash acasella  R 2-04:10:25      1 gn01&lt;br /&gt;
  523       fat     bash amarzull  R    1:30:35      1 gn01&lt;br /&gt;
  522       gpu     bash    clena  R   20:51:16      1 gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Job state ==&lt;br /&gt;
&lt;br /&gt;
Jobs typically pass through several states in the course of their execution. Job state is shown in column &amp;quot;ST&amp;quot; of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; as an abbreviated code (e.g., &amp;quot;R&amp;quot; for RUNNING).&lt;br /&gt;
&lt;br /&gt;
The most relevant codes and states are the following:&lt;br /&gt;
&lt;br /&gt;
; PD PENDING&lt;br /&gt;
: Job is awaiting resource allocation. &lt;br /&gt;
&lt;br /&gt;
; R RUNNING&lt;br /&gt;
: Job currently has an allocation.&lt;br /&gt;
&lt;br /&gt;
; S SUSPENDED&lt;br /&gt;
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. &lt;br /&gt;
 &lt;br /&gt;
; CG COMPLETING&lt;br /&gt;
: Job is in the process of completing. Some processes on some nodes may still be active. &lt;br /&gt;
&lt;br /&gt;
; CD COMPLETED&lt;br /&gt;
: Job has terminated all processes on all nodes with an exit code of zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt;] provides a complete list of them, reported here for completeness:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
; BF BOOT_FAIL&lt;br /&gt;
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). &lt;br /&gt;
&lt;br /&gt;
; CA CANCELLED&lt;br /&gt;
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. &lt;br /&gt;
&lt;br /&gt;
; CF CONFIGURING&lt;br /&gt;
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). &lt;br /&gt;
&lt;br /&gt;
; DL DEADLINE&lt;br /&gt;
: Job terminated on deadline. &lt;br /&gt;
&lt;br /&gt;
; F FAILED&lt;br /&gt;
: Job terminated with non-zero exit code or other failure condition. &lt;br /&gt;
&lt;br /&gt;
; NF NODE_FAIL&lt;br /&gt;
: Job terminated due to failure of one or more allocated nodes. &lt;br /&gt;
&lt;br /&gt;
; OOM OUT_OF_MEMORY&lt;br /&gt;
: Job experienced out of memory error. &lt;br /&gt;
&lt;br /&gt;
; PR PREEMPTED&lt;br /&gt;
: Job terminated due to preemption. &lt;br /&gt;
&lt;br /&gt;
; RD RESV_DEL_HOLD&lt;br /&gt;
: Job is being held after requested reservation was deleted. &lt;br /&gt;
&lt;br /&gt;
; RF REQUEUE_FED&lt;br /&gt;
: Job is being requeued by a federation. &lt;br /&gt;
&lt;br /&gt;
; RH REQUEUE_HOLD&lt;br /&gt;
: Held job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RQ REQUEUED&lt;br /&gt;
: Completing job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RS RESIZING&lt;br /&gt;
: Job is about to change size. &lt;br /&gt;
&lt;br /&gt;
; RV REVOKED&lt;br /&gt;
: Sibling was removed from cluster due to other cluster starting the job. &lt;br /&gt;
&lt;br /&gt;
; SI SIGNALING&lt;br /&gt;
: Job is being signaled. &lt;br /&gt;
&lt;br /&gt;
; SE SPECIAL_EXIT&lt;br /&gt;
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. &lt;br /&gt;
&lt;br /&gt;
; SO STAGE_OUT&lt;br /&gt;
: Job is staging out files. &lt;br /&gt;
&lt;br /&gt;
; ST STOPPED&lt;br /&gt;
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. &lt;br /&gt;
&lt;br /&gt;
; TO TIMEOUT&lt;br /&gt;
: Job terminated upon reaching its time limit.&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=326</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=326"/>
		<updated>2022-01-20T18:40:29Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Job caching */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page presents the features of Mufasa that are most relevant to Mufasa&amp;#039;s [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
= SLURM Partitions =&lt;br /&gt;
&lt;br /&gt;
Several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;&amp;#039; in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
debug         up   infinite      1    mix gn01&lt;br /&gt;
small*        up   12:00:00      1    mix gn01&lt;br /&gt;
normal        up 1-00:00:00      1    mix gn01&lt;br /&gt;
longnormal    up 3-00:00:00      1    mix gn01&lt;br /&gt;
gpu           up 1-00:00:00      1    mix gn01&lt;br /&gt;
gpulong       up 3-00:00:00      1    mix gn01&lt;br /&gt;
fat           up 3-00:00:00      1    mix gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside &amp;quot;small&amp;quot; indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.&lt;br /&gt;
&lt;br /&gt;
On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo --Format=all&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
but its output can be overwhelming. For instance, in the example above the output of &amp;lt;code&amp;gt;sinfo --Format=all&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A less comprehensive but more readable view of partition features can be obtained via a tailored &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo -o &amp;quot;%.10P %.6a %.4c %.17B %.54G %.11l %.11L %.4r&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Such command provides an output similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
 PARTITION  AVAIL CPUS MAX_CPUS_PER_NODE                                                   GRES   TIMELIMIT DEFAULTTIME ROOT&lt;br /&gt;
     debug     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    infinite         n/a  yes&lt;br /&gt;
    small*     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    12:00:00       15:00   no&lt;br /&gt;
    normal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
longnormal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       gpu     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
   gpulong     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       fat     up   62                48  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the &amp;#039;&amp;#039;node&amp;#039;&amp;#039; is Mufasa:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
: %P Partition name followed by &amp;quot;*&amp;quot; for the default partition&lt;br /&gt;
: %a State/availability of a partition&lt;br /&gt;
: %c Number of CPUs per node [&amp;#039;&amp;#039;for Mufasa these are [[System#CPUs and GPUs|the 64 CPUs minus the 2 dedicated to non-SLURM jobs]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %B The max number of CPUs per node available to jobs in the partition&lt;br /&gt;
: %G Generic resources (gres) associated with the nodes [&amp;#039;&amp;#039;for Mufasa these correspond to the [[System#CPUs and GPUs|virtual GPUs defined with MIG]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %l Maximum time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %L Default time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %r Only user root may initiate jobs, &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command used, field identifiers &amp;lt;code&amp;gt;%...&amp;lt;/code&amp;gt; have been preceded by width specifiers in the form &amp;lt;code&amp;gt;.N&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;N&amp;lt;/code&amp;gt; is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and are used to increase readability.&lt;br /&gt;
&lt;br /&gt;
== Partition availability ==&lt;br /&gt;
&lt;br /&gt;
An important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides is the &amp;#039;&amp;#039;availability&amp;#039;&amp;#039; (also called &amp;#039;&amp;#039;state&amp;#039;&amp;#039;) of partitions. Possible partition states are:&lt;br /&gt;
&lt;br /&gt;
; up&lt;br /&gt;
: the partition is available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; drain&lt;br /&gt;
: the partition is not available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; down&lt;br /&gt;
: the same as &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; but the partition failed: i.e., it suffered a disruption&lt;br /&gt;
&lt;br /&gt;
A partition in state &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; or &amp;#039;&amp;#039;down&amp;#039;&amp;#039; requires intervention by a [[Roles|Job Administrator]] to be restored to &amp;#039;&amp;#039;up&amp;#039;&amp;#039;. Jobs waiting for that partition are paused unless the partition returns available.&lt;br /&gt;
&lt;br /&gt;
== Choosing the partition on which to run a job ==&lt;br /&gt;
&lt;br /&gt;
When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job&amp;#039;s features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition.&lt;br /&gt;
&lt;br /&gt;
The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the most important resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined tmaximum. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
An interesting part of the output of &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; ([[User Jobs#SLURM Partitions|see above]]) is the one concerning the GPUs, since GPUs are usually the less plentiful resource in a system such as Mufasa. For all partitions this part of the output is identical and equal to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The string above means that whatever the partition on which it is run, a job can request up to 2 40 GB GPUs, up to 3 20 GB GPUs and up to 6 10 GB GPUs (even all together): i.e., [[System#CPUs and GPUs|the complete set of 11 GPUs available in the system]]. In other words, at the moment none of the partitions defined on Mufasa sets limits on the maximum amount of GPUs that they allow jobs to request.&lt;br /&gt;
&lt;br /&gt;
Of course, the larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa&amp;#039;s limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, witl have the effect of delaying (possibly for a long time) job execution.&lt;br /&gt;
&lt;br /&gt;
= Executing jobs on Mufasa =&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done. &lt;br /&gt;
&lt;br /&gt;
Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
:; Step 1&lt;br /&gt;
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]&lt;br /&gt;
&lt;br /&gt;
:; Step 2&lt;br /&gt;
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.&lt;br /&gt;
&lt;br /&gt;
The commands that SLURM provides to run jobs are &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.&lt;br /&gt;
&lt;br /&gt;
Among the options available for &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, one of the most important is &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;K&amp;lt;/code&amp;gt; is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option &amp;lt;code&amp;gt;--p &amp;lt;partition name&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small ./my_program&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;lt;code&amp;gt;my_program&amp;lt;/code&amp;gt; on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
= Using SLURM to run a Docker container =&lt;br /&gt;
&lt;br /&gt;
The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun ‑‑p &amp;lt;partition_name&amp;gt; ‑‑container-image=&amp;lt;container_path.sqsh&amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt; ‑‑gres=&amp;lt;gpu_resources&amp;gt; ‑‑mem=&amp;lt;mem_resources&amp;gt; ‑‑cpus‑per‑task &amp;lt;cpu_amount&amp;gt; ‑‑pty ‑‑time=&amp;lt;hh:mm:ss&amp;gt; &amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Below these options are explained.&lt;br /&gt;
&lt;br /&gt;
;‑‑p &amp;lt;partition_name&amp;gt;&lt;br /&gt;
: specifies the resource partition on which the job will be run.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! If &amp;lt;code&amp;gt;‑‑p &amp;lt;partition_name&amp;gt;&amp;lt;/code&amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;lt;code&amp;gt;‑‑mem=&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑cpus‑per‑task &amp;lt;cpu_number&amp;gt;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;‑‑time=&amp;lt;hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;lt;code&amp;gt;‑‑gres=&amp;lt;gpu_resources&amp;gt;&amp;lt;/code&amp;gt;: GPU resources, in fact, must always be explicitly requested with option &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;, otherwise no access to GPUs is granted to the job.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑container-image=&amp;lt;container_path.sqsh&amp;gt;&lt;br /&gt;
: specifies the container to be run&lt;br /&gt;
&lt;br /&gt;
;‑‑no‑container‑entrypoint&lt;br /&gt;
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;lt;code&amp;gt;&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/code&amp;gt; takes the value &amp;lt;code&amp;gt;/home/mrossi:/data&amp;lt;/code&amp;gt; this tells srun to mount Mufasa&amp;#039;s directory &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; in position &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; within the filesystem of the Docker container. When the docker container reads or writes files in directory &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; of its own (internal) filesystem, what actually happens is that files in &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; get manipulated instead. &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&lt;br /&gt;
;‑‑gres=&amp;lt;gpu_resources&amp;gt;&lt;br /&gt;
: specifies what GPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;gpus&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;gpu:40gb:2&amp;lt;/code&amp;gt;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! The &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑mem=&amp;lt;mem_resources&amp;gt;&lt;br /&gt;
: specifies the amount of RAM to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;200G&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑cpus-per-task &amp;lt;cpu_amount&amp;gt;&lt;br /&gt;
: specifies how many CPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;cpu_amount&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;2&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑pty&lt;br /&gt;
: specifies that the job will be interactive (this is necessary when &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑time=&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies the maximum time allowed to the job to run, in the format &amp;lt;code&amp;gt;days-hours:minutes:seconds&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;days&amp;lt;/code&amp;gt; is optional; for instance, &amp;lt;code&amp;gt;&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;72:00:00&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
: the executable that will be run within the Docker container as soon as it is operative. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;python&amp;lt;/code&amp;gt;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; to launch non-interactive programs.&lt;br /&gt;
&lt;br /&gt;
== Nvidia Pyxis ==&lt;br /&gt;
&lt;br /&gt;
Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options &amp;lt;code&amp;gt;‑‑container-image&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑no‑container‑entrypoint&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑container-mounts&amp;lt;/code&amp;gt; are provided to &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; by Pyxis.&lt;br /&gt;
&lt;br /&gt;
= Launching a user job from within a Docker container =&lt;br /&gt;
&lt;br /&gt;
Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, the user is dropped to the interactive environment specified by &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
= Running interactive jobs via SLURM =&lt;br /&gt;
&lt;br /&gt;
As explained, SLURM command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with a command similar to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --gres=gpu:1 --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;lt;code&amp;gt;(SLURM ID xx)&amp;lt;/code&amp;gt; (where &amp;lt;code&amp;gt;xx&amp;lt;/code&amp;gt; is the ID of the &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
echo $SLURM_JOB_ID&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
= Detach from a running job with &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; =&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; inside a &amp;#039;&amp;#039;screen session&amp;#039;&amp;#039; (often simply called &amp;quot;a screen&amp;quot;), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;lt;code&amp;gt;screen&amp;lt;/code&amp;gt; available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, to create a screen session and run a job in it:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* From the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* In the screen session thus created (it has the look of an empty shell), launch your job with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;Detach&amp;#039;&amp;#039; from the screen session with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, while your process will go on running in the screen session&lt;br /&gt;
* You can now close the SSH connection to Mufasa without damaging your process&lt;br /&gt;
&lt;br /&gt;
Later, when you are ready to resume contact with your running process:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* In the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen -r&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* You are now back to the screen session where you launched your job&lt;br /&gt;
&lt;br /&gt;
* When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;\&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
= Using execution scripts to run jobs =&lt;br /&gt;
&lt;br /&gt;
Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.&lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;,  where the user specifies the values to be given to parameters, each preceded by the keyword &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;srun commands&amp;#039;&amp;#039;&amp;#039; that launch jobs with SLURM using the parameter values specified in the preamble&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set&lt;br /&gt;
* have &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; as its very first line&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;, but this is not mandatory.&lt;br /&gt;
&lt;br /&gt;
To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press the &amp;lt;enter&amp;gt; key. The script is executed in the terminal, and any output (e.g., whatever is printed by any &amp;lt;code&amp;gt;echo&amp;lt;/code&amp;gt; commands in the script) is shown on the terminal.&lt;br /&gt;
&lt;br /&gt;
Within a bash script, lines preceded by &amp;lt;code&amp;gt;#&amp;lt;/code&amp;gt; are comments (with the notable exception of the initial &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold; the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------start of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Note: these are examples. Put your own SBATCH directives below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=myjob&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; name assigned to the job&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --cpus-per-task=1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of threads allocated to each task&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --mem-per-cpu=500M&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; amount of memory per CPU core&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --gres=gpu:10gb:1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of GPUs per node&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; the partition to run your jobs on&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --time=0-00:01:00&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt;----------------end of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------srun commands-----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Put your own srun command(s) below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------end of srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As the example above shows, beyond the initial directive &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; the script includes a series of &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directives used to specify parameter values, and finally one or more &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; commands that run the jobs. Any parameter accepted by commands &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; can be used as an &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
= Use of Scratch Storage =&lt;br /&gt;
&lt;br /&gt;
When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.&lt;br /&gt;
&lt;br /&gt;
Each time a job is run via SLURM, this is what happens automatically:&lt;br /&gt;
&lt;br /&gt;
# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;) to a cache space located on system SSDs&lt;br /&gt;
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files&lt;br /&gt;
# The executables create their output files in the cache space&lt;br /&gt;
# When the user jobs end, Mufasa copies the output files from the cache space back to the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The caching mechanism requires that &amp;#039;&amp;#039;during job execution&amp;#039;&amp;#039; the user does not modify the contents of the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.&lt;br /&gt;
&lt;br /&gt;
= Monitoring and managing jobs =&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/overview.html SLURM&amp;#039;s own overview]:&lt;br /&gt;
&lt;br /&gt;
“&amp;#039;&amp;#039;User tools include&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs, &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
scancel&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
squeue&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.&amp;#039;&amp;#039;”&lt;br /&gt;
&lt;br /&gt;
An example of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
  520       fat     bash acasella  R 2-04:10:25      1 gn01&lt;br /&gt;
  523       fat     bash amarzull  R    1:30:35      1 gn01&lt;br /&gt;
  522       gpu     bash    clena  R   20:51:16      1 gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Job state ==&lt;br /&gt;
&lt;br /&gt;
Jobs typically pass through several states in the course of their execution. Job state is shown in column &amp;quot;ST&amp;quot; of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; as an abbreviated code (e.g., &amp;quot;R&amp;quot; for RUNNING).&lt;br /&gt;
&lt;br /&gt;
The most relevant codes and states are the following:&lt;br /&gt;
&lt;br /&gt;
; PD PENDING&lt;br /&gt;
: Job is awaiting resource allocation. &lt;br /&gt;
&lt;br /&gt;
; R RUNNING&lt;br /&gt;
: Job currently has an allocation.&lt;br /&gt;
&lt;br /&gt;
; S SUSPENDED&lt;br /&gt;
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. &lt;br /&gt;
 &lt;br /&gt;
; CG COMPLETING&lt;br /&gt;
: Job is in the process of completing. Some processes on some nodes may still be active. &lt;br /&gt;
&lt;br /&gt;
; CD COMPLETED&lt;br /&gt;
: Job has terminated all processes on all nodes with an exit code of zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt;] provides a complete list of them, reported here for completeness:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
; BF BOOT_FAIL&lt;br /&gt;
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). &lt;br /&gt;
&lt;br /&gt;
; CA CANCELLED&lt;br /&gt;
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. &lt;br /&gt;
&lt;br /&gt;
; CF CONFIGURING&lt;br /&gt;
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). &lt;br /&gt;
&lt;br /&gt;
; DL DEADLINE&lt;br /&gt;
: Job terminated on deadline. &lt;br /&gt;
&lt;br /&gt;
; F FAILED&lt;br /&gt;
: Job terminated with non-zero exit code or other failure condition. &lt;br /&gt;
&lt;br /&gt;
; NF NODE_FAIL&lt;br /&gt;
: Job terminated due to failure of one or more allocated nodes. &lt;br /&gt;
&lt;br /&gt;
; OOM OUT_OF_MEMORY&lt;br /&gt;
: Job experienced out of memory error. &lt;br /&gt;
&lt;br /&gt;
; PR PREEMPTED&lt;br /&gt;
: Job terminated due to preemption. &lt;br /&gt;
&lt;br /&gt;
; RD RESV_DEL_HOLD&lt;br /&gt;
: Job is being held after requested reservation was deleted. &lt;br /&gt;
&lt;br /&gt;
; RF REQUEUE_FED&lt;br /&gt;
: Job is being requeued by a federation. &lt;br /&gt;
&lt;br /&gt;
; RH REQUEUE_HOLD&lt;br /&gt;
: Held job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RQ REQUEUED&lt;br /&gt;
: Completing job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RS RESIZING&lt;br /&gt;
: Job is about to change size. &lt;br /&gt;
&lt;br /&gt;
; RV REVOKED&lt;br /&gt;
: Sibling was removed from cluster due to other cluster starting the job. &lt;br /&gt;
&lt;br /&gt;
; SI SIGNALING&lt;br /&gt;
: Job is being signaled. &lt;br /&gt;
&lt;br /&gt;
; SE SPECIAL_EXIT&lt;br /&gt;
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. &lt;br /&gt;
&lt;br /&gt;
; SO STAGE_OUT&lt;br /&gt;
: Job is staging out files. &lt;br /&gt;
&lt;br /&gt;
; ST STOPPED&lt;br /&gt;
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. &lt;br /&gt;
&lt;br /&gt;
; TO TIMEOUT&lt;br /&gt;
: Job terminated upon reaching its time limit.&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=325</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=325"/>
		<updated>2022-01-20T18:35:17Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Using execution scripts to run jobs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page presents the features of Mufasa that are most relevant to Mufasa&amp;#039;s [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
= SLURM Partitions =&lt;br /&gt;
&lt;br /&gt;
Several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;&amp;#039; in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
debug         up   infinite      1    mix gn01&lt;br /&gt;
small*        up   12:00:00      1    mix gn01&lt;br /&gt;
normal        up 1-00:00:00      1    mix gn01&lt;br /&gt;
longnormal    up 3-00:00:00      1    mix gn01&lt;br /&gt;
gpu           up 1-00:00:00      1    mix gn01&lt;br /&gt;
gpulong       up 3-00:00:00      1    mix gn01&lt;br /&gt;
fat           up 3-00:00:00      1    mix gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside &amp;quot;small&amp;quot; indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.&lt;br /&gt;
&lt;br /&gt;
On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo --Format=all&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
but its output can be overwhelming. For instance, in the example above the output of &amp;lt;code&amp;gt;sinfo --Format=all&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A less comprehensive but more readable view of partition features can be obtained via a tailored &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo -o &amp;quot;%.10P %.6a %.4c %.17B %.54G %.11l %.11L %.4r&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Such command provides an output similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
 PARTITION  AVAIL CPUS MAX_CPUS_PER_NODE                                                   GRES   TIMELIMIT DEFAULTTIME ROOT&lt;br /&gt;
     debug     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    infinite         n/a  yes&lt;br /&gt;
    small*     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    12:00:00       15:00   no&lt;br /&gt;
    normal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
longnormal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       gpu     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
   gpulong     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       fat     up   62                48  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the &amp;#039;&amp;#039;node&amp;#039;&amp;#039; is Mufasa:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
: %P Partition name followed by &amp;quot;*&amp;quot; for the default partition&lt;br /&gt;
: %a State/availability of a partition&lt;br /&gt;
: %c Number of CPUs per node [&amp;#039;&amp;#039;for Mufasa these are [[System#CPUs and GPUs|the 64 CPUs minus the 2 dedicated to non-SLURM jobs]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %B The max number of CPUs per node available to jobs in the partition&lt;br /&gt;
: %G Generic resources (gres) associated with the nodes [&amp;#039;&amp;#039;for Mufasa these correspond to the [[System#CPUs and GPUs|virtual GPUs defined with MIG]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %l Maximum time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %L Default time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %r Only user root may initiate jobs, &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command used, field identifiers &amp;lt;code&amp;gt;%...&amp;lt;/code&amp;gt; have been preceded by width specifiers in the form &amp;lt;code&amp;gt;.N&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;N&amp;lt;/code&amp;gt; is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and are used to increase readability.&lt;br /&gt;
&lt;br /&gt;
== Partition availability ==&lt;br /&gt;
&lt;br /&gt;
An important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides is the &amp;#039;&amp;#039;availability&amp;#039;&amp;#039; (also called &amp;#039;&amp;#039;state&amp;#039;&amp;#039;) of partitions. Possible partition states are:&lt;br /&gt;
&lt;br /&gt;
; up&lt;br /&gt;
: the partition is available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; drain&lt;br /&gt;
: the partition is not available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; down&lt;br /&gt;
: the same as &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; but the partition failed: i.e., it suffered a disruption&lt;br /&gt;
&lt;br /&gt;
A partition in state &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; or &amp;#039;&amp;#039;down&amp;#039;&amp;#039; requires intervention by a [[Roles|Job Administrator]] to be restored to &amp;#039;&amp;#039;up&amp;#039;&amp;#039;. Jobs waiting for that partition are paused unless the partition returns available.&lt;br /&gt;
&lt;br /&gt;
== Choosing the partition on which to run a job ==&lt;br /&gt;
&lt;br /&gt;
When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job&amp;#039;s features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition.&lt;br /&gt;
&lt;br /&gt;
The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the most important resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined tmaximum. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
An interesting part of the output of &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; ([[User Jobs#SLURM Partitions|see above]]) is the one concerning the GPUs, since GPUs are usually the less plentiful resource in a system such as Mufasa. For all partitions this part of the output is identical and equal to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The string above means that whatever the partition on which it is run, a job can request up to 2 40 GB GPUs, up to 3 20 GB GPUs and up to 6 10 GB GPUs (even all together): i.e., [[System#CPUs and GPUs|the complete set of 11 GPUs available in the system]]. In other words, at the moment none of the partitions defined on Mufasa sets limits on the maximum amount of GPUs that they allow jobs to request.&lt;br /&gt;
&lt;br /&gt;
Of course, the larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa&amp;#039;s limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, witl have the effect of delaying (possibly for a long time) job execution.&lt;br /&gt;
&lt;br /&gt;
= Executing jobs on Mufasa =&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done. &lt;br /&gt;
&lt;br /&gt;
Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
:; Step 1&lt;br /&gt;
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]&lt;br /&gt;
&lt;br /&gt;
:; Step 2&lt;br /&gt;
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.&lt;br /&gt;
&lt;br /&gt;
The commands that SLURM provides to run jobs are &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.&lt;br /&gt;
&lt;br /&gt;
Among the options available for &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, one of the most important is &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;K&amp;lt;/code&amp;gt; is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option &amp;lt;code&amp;gt;--p &amp;lt;partition name&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small ./my_program&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;lt;code&amp;gt;my_program&amp;lt;/code&amp;gt; on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
= Using SLURM to run a Docker container =&lt;br /&gt;
&lt;br /&gt;
The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun ‑‑p &amp;lt;partition_name&amp;gt; ‑‑container-image=&amp;lt;container_path.sqsh&amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt; ‑‑gres=&amp;lt;gpu_resources&amp;gt; ‑‑mem=&amp;lt;mem_resources&amp;gt; ‑‑cpus‑per‑task &amp;lt;cpu_amount&amp;gt; ‑‑pty ‑‑time=&amp;lt;hh:mm:ss&amp;gt; &amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Below these options are explained.&lt;br /&gt;
&lt;br /&gt;
;‑‑p &amp;lt;partition_name&amp;gt;&lt;br /&gt;
: specifies the resource partition on which the job will be run.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! If &amp;lt;code&amp;gt;‑‑p &amp;lt;partition_name&amp;gt;&amp;lt;/code&amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;lt;code&amp;gt;‑‑mem=&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑cpus‑per‑task &amp;lt;cpu_number&amp;gt;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;‑‑time=&amp;lt;hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;lt;code&amp;gt;‑‑gres=&amp;lt;gpu_resources&amp;gt;&amp;lt;/code&amp;gt;: GPU resources, in fact, must always be explicitly requested with option &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;, otherwise no access to GPUs is granted to the job.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑container-image=&amp;lt;container_path.sqsh&amp;gt;&lt;br /&gt;
: specifies the container to be run&lt;br /&gt;
&lt;br /&gt;
;‑‑no‑container‑entrypoint&lt;br /&gt;
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;lt;code&amp;gt;&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/code&amp;gt; takes the value &amp;lt;code&amp;gt;/home/mrossi:/data&amp;lt;/code&amp;gt; this tells srun to mount Mufasa&amp;#039;s directory &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; in position &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; within the filesystem of the Docker container. When the docker container reads or writes files in directory &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; of its own (internal) filesystem, what actually happens is that files in &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; get manipulated instead. &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&lt;br /&gt;
;‑‑gres=&amp;lt;gpu_resources&amp;gt;&lt;br /&gt;
: specifies what GPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;gpus&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;gpu:40gb:2&amp;lt;/code&amp;gt;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! The &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑mem=&amp;lt;mem_resources&amp;gt;&lt;br /&gt;
: specifies the amount of RAM to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;200G&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑cpus-per-task &amp;lt;cpu_amount&amp;gt;&lt;br /&gt;
: specifies how many CPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;cpu_amount&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;2&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑pty&lt;br /&gt;
: specifies that the job will be interactive (this is necessary when &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑time=&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies the maximum time allowed to the job to run, in the format &amp;lt;code&amp;gt;days-hours:minutes:seconds&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;days&amp;lt;/code&amp;gt; is optional; for instance, &amp;lt;code&amp;gt;&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;72:00:00&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
: the executable that will be run within the Docker container as soon as it is operative. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;python&amp;lt;/code&amp;gt;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; to launch non-interactive programs.&lt;br /&gt;
&lt;br /&gt;
== Nvidia Pyxis ==&lt;br /&gt;
&lt;br /&gt;
Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options &amp;lt;code&amp;gt;‑‑container-image&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑no‑container‑entrypoint&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑container-mounts&amp;lt;/code&amp;gt; are provided to &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; by Pyxis.&lt;br /&gt;
&lt;br /&gt;
= Launching a user job from within a Docker container =&lt;br /&gt;
&lt;br /&gt;
Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, the user is dropped to the interactive environment specified by &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
= Running interactive jobs via SLURM =&lt;br /&gt;
&lt;br /&gt;
As explained, SLURM command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with a command similar to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --gres=gpu:1 --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;lt;code&amp;gt;(SLURM ID xx)&amp;lt;/code&amp;gt; (where &amp;lt;code&amp;gt;xx&amp;lt;/code&amp;gt; is the ID of the &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
echo $SLURM_JOB_ID&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
= Detach from a running job with &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; =&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; inside a &amp;#039;&amp;#039;screen session&amp;#039;&amp;#039; (often simply called &amp;quot;a screen&amp;quot;), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;lt;code&amp;gt;screen&amp;lt;/code&amp;gt; available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, to create a screen session and run a job in it:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* From the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* In the screen session thus created (it has the look of an empty shell), launch your job with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;Detach&amp;#039;&amp;#039; from the screen session with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, while your process will go on running in the screen session&lt;br /&gt;
* You can now close the SSH connection to Mufasa without damaging your process&lt;br /&gt;
&lt;br /&gt;
Later, when you are ready to resume contact with your running process:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* In the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen -r&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* You are now back to the screen session where you launched your job&lt;br /&gt;
&lt;br /&gt;
* When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;\&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
= Using execution scripts to run jobs =&lt;br /&gt;
&lt;br /&gt;
Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.&lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;,  where the user specifies the values to be given to parameters, each preceded by the keyword &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;srun commands&amp;#039;&amp;#039;&amp;#039; that launch jobs with SLURM using the parameter values specified in the preamble&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set&lt;br /&gt;
* have &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; as its very first line&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;, but this is not mandatory.&lt;br /&gt;
&lt;br /&gt;
To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press the &amp;lt;enter&amp;gt; key. The script is executed in the terminal, and any output (e.g., whatever is printed by any &amp;lt;code&amp;gt;echo&amp;lt;/code&amp;gt; commands in the script) is shown on the terminal.&lt;br /&gt;
&lt;br /&gt;
Within a bash script, lines preceded by &amp;lt;code&amp;gt;#&amp;lt;/code&amp;gt; are comments (with the notable exception of the initial &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold; the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------start of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Note: these are examples. Put your own SBATCH directives below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=myjob&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; name assigned to the job&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --cpus-per-task=1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of threads allocated to each task&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --mem-per-cpu=500M&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; amount of memory per CPU core&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --gres=gpu:10gb:1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of GPUs per node&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; the partition to run your jobs on&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --time=0-00:01:00&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt;----------------end of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------srun commands-----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Put your own srun command(s) below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------end of srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As the example above shows, beyond the initial directive &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; the script includes a series of &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directives used to specify parameter values, and finally one or more &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; commands that run the jobs. Any parameter accepted by commands &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; can be used as an &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
= Job caching =&lt;br /&gt;
&lt;br /&gt;
When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.&lt;br /&gt;
&lt;br /&gt;
Each time a job is run via SLURM, this is what happens automatically:&lt;br /&gt;
&lt;br /&gt;
# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;) to a cache space located on system SSDs&lt;br /&gt;
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files&lt;br /&gt;
# The executables create their output files in the cache space&lt;br /&gt;
# When the user jobs end, Mufasa copies the output files from the cache space back to the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The caching mechanism requires that &amp;#039;&amp;#039;during job execution&amp;#039;&amp;#039; the user does not modify the contents of the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.&lt;br /&gt;
&lt;br /&gt;
= Monitoring and managing jobs =&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/overview.html SLURM&amp;#039;s own overview]:&lt;br /&gt;
&lt;br /&gt;
“&amp;#039;&amp;#039;User tools include&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs, &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
scancel&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
squeue&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.&amp;#039;&amp;#039;”&lt;br /&gt;
&lt;br /&gt;
An example of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
  520       fat     bash acasella  R 2-04:10:25      1 gn01&lt;br /&gt;
  523       fat     bash amarzull  R    1:30:35      1 gn01&lt;br /&gt;
  522       gpu     bash    clena  R   20:51:16      1 gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Job state ==&lt;br /&gt;
&lt;br /&gt;
Jobs typically pass through several states in the course of their execution. Job state is shown in column &amp;quot;ST&amp;quot; of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; as an abbreviated code (e.g., &amp;quot;R&amp;quot; for RUNNING).&lt;br /&gt;
&lt;br /&gt;
The most relevant codes and states are the following:&lt;br /&gt;
&lt;br /&gt;
; PD PENDING&lt;br /&gt;
: Job is awaiting resource allocation. &lt;br /&gt;
&lt;br /&gt;
; R RUNNING&lt;br /&gt;
: Job currently has an allocation.&lt;br /&gt;
&lt;br /&gt;
; S SUSPENDED&lt;br /&gt;
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. &lt;br /&gt;
 &lt;br /&gt;
; CG COMPLETING&lt;br /&gt;
: Job is in the process of completing. Some processes on some nodes may still be active. &lt;br /&gt;
&lt;br /&gt;
; CD COMPLETED&lt;br /&gt;
: Job has terminated all processes on all nodes with an exit code of zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt;] provides a complete list of them, reported here for completeness:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
; BF BOOT_FAIL&lt;br /&gt;
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). &lt;br /&gt;
&lt;br /&gt;
; CA CANCELLED&lt;br /&gt;
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. &lt;br /&gt;
&lt;br /&gt;
; CF CONFIGURING&lt;br /&gt;
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). &lt;br /&gt;
&lt;br /&gt;
; DL DEADLINE&lt;br /&gt;
: Job terminated on deadline. &lt;br /&gt;
&lt;br /&gt;
; F FAILED&lt;br /&gt;
: Job terminated with non-zero exit code or other failure condition. &lt;br /&gt;
&lt;br /&gt;
; NF NODE_FAIL&lt;br /&gt;
: Job terminated due to failure of one or more allocated nodes. &lt;br /&gt;
&lt;br /&gt;
; OOM OUT_OF_MEMORY&lt;br /&gt;
: Job experienced out of memory error. &lt;br /&gt;
&lt;br /&gt;
; PR PREEMPTED&lt;br /&gt;
: Job terminated due to preemption. &lt;br /&gt;
&lt;br /&gt;
; RD RESV_DEL_HOLD&lt;br /&gt;
: Job is being held after requested reservation was deleted. &lt;br /&gt;
&lt;br /&gt;
; RF REQUEUE_FED&lt;br /&gt;
: Job is being requeued by a federation. &lt;br /&gt;
&lt;br /&gt;
; RH REQUEUE_HOLD&lt;br /&gt;
: Held job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RQ REQUEUED&lt;br /&gt;
: Completing job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RS RESIZING&lt;br /&gt;
: Job is about to change size. &lt;br /&gt;
&lt;br /&gt;
; RV REVOKED&lt;br /&gt;
: Sibling was removed from cluster due to other cluster starting the job. &lt;br /&gt;
&lt;br /&gt;
; SI SIGNALING&lt;br /&gt;
: Job is being signaled. &lt;br /&gt;
&lt;br /&gt;
; SE SPECIAL_EXIT&lt;br /&gt;
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. &lt;br /&gt;
&lt;br /&gt;
; SO STAGE_OUT&lt;br /&gt;
: Job is staging out files. &lt;br /&gt;
&lt;br /&gt;
; ST STOPPED&lt;br /&gt;
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. &lt;br /&gt;
&lt;br /&gt;
; TO TIMEOUT&lt;br /&gt;
: Job terminated upon reaching its time limit.&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=324</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=324"/>
		<updated>2022-01-20T18:22:27Z</updated>

		<summary type="html">&lt;p&gt;Admin: /* Detach from a running job with screen */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page presents the features of Mufasa that are most relevant to Mufasa&amp;#039;s [[Roles|Job Users]]. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Job Users are by necessity SLURM users (see [[System#The SLURM job scheduling system|The SLURM job scheduling system]]) so you may also want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
= SLURM Partitions =&lt;br /&gt;
&lt;br /&gt;
Several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;&amp;#039;partitions&amp;#039;&amp;#039;&amp;#039; in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
([https://slurm.schedmd.com/sinfo.html link to SLURM docs]) provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
debug         up   infinite      1    mix gn01&lt;br /&gt;
small*        up   12:00:00      1    mix gn01&lt;br /&gt;
normal        up 1-00:00:00      1    mix gn01&lt;br /&gt;
longnormal    up 3-00:00:00      1    mix gn01&lt;br /&gt;
gpu           up 1-00:00:00      1    mix gn01&lt;br /&gt;
gpulong       up 3-00:00:00      1    mix gn01&lt;br /&gt;
fat           up 3-00:00:00      1    mix gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. The asterisk beside &amp;quot;small&amp;quot; indicates that this is the default partition, i.e. the one that SLURM selects to run a job when no partition has been specified.&lt;br /&gt;
&lt;br /&gt;
On Mufasa, partition names have been chosen to reflect the type of job that they are dedicated to. A complete list of the features of each partition can be obtained with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo --Format=all&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
but its output can be overwhelming. For instance, in the example above the output of &amp;lt;code&amp;gt;sinfo --Format=all&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1 &lt;br /&gt;
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A less comprehensive but more readable view of partition features can be obtained via a tailored &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command, i.e. one that only asks for the features that are most relevant to Mufasa users. An example of such command is this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo -o &amp;quot;%.10P %.6a %.4c %.17B %.54G %.11l %.11L %.4r&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Such command provides an output similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
 PARTITION  AVAIL CPUS MAX_CPUS_PER_NODE                                                   GRES   TIMELIMIT DEFAULTTIME ROOT&lt;br /&gt;
     debug     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    infinite         n/a  yes&lt;br /&gt;
    small*     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)    12:00:00       15:00   no&lt;br /&gt;
    normal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
longnormal     up   62                24  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       gpu     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  1-00:00:00       15:00   no&lt;br /&gt;
   gpulong     up   62         UNLIMITED  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
       fat     up   62                48  gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)  3-00:00:00     1:00:00   no&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The columns in this output correspond to the following information (from [https://slurm.schedmd.com/sinfo.html SLURM docs]), where the &amp;#039;&amp;#039;node&amp;#039;&amp;#039; is Mufasa:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
: %P Partition name followed by &amp;quot;*&amp;quot; for the default partition&lt;br /&gt;
: %a State/availability of a partition&lt;br /&gt;
: %c Number of CPUs per node [&amp;#039;&amp;#039;for Mufasa these are [[System#CPUs and GPUs|the 64 CPUs minus the 2 dedicated to non-SLURM jobs]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %B The max number of CPUs per node available to jobs in the partition&lt;br /&gt;
: %G Generic resources (gres) associated with the nodes [&amp;#039;&amp;#039;for Mufasa these correspond to the [[System#CPUs and GPUs|virtual GPUs defined with MIG]]&amp;#039;&amp;#039;]&lt;br /&gt;
: %l Maximum time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %L Default time for any job in the format &amp;quot;days-hours:minutes:seconds&amp;quot;&lt;br /&gt;
: %r Only user root may initiate jobs, &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; command used, field identifiers &amp;lt;code&amp;gt;%...&amp;lt;/code&amp;gt; have been preceded by width specifiers in the form &amp;lt;code&amp;gt;.N&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;N&amp;lt;/code&amp;gt; is a positive integer. The specifiers define how many characters to reserve to each field in the command output, and are used to increase readability.&lt;br /&gt;
&lt;br /&gt;
== Partition availability ==&lt;br /&gt;
&lt;br /&gt;
An important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides is the &amp;#039;&amp;#039;availability&amp;#039;&amp;#039; (also called &amp;#039;&amp;#039;state&amp;#039;&amp;#039;) of partitions. Possible partition states are:&lt;br /&gt;
&lt;br /&gt;
; up&lt;br /&gt;
: the partition is available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; drain&lt;br /&gt;
: the partition is not available to be allocated work&lt;br /&gt;
&lt;br /&gt;
; down&lt;br /&gt;
: the same as &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; but the partition failed: i.e., it suffered a disruption&lt;br /&gt;
&lt;br /&gt;
A partition in state &amp;#039;&amp;#039;drain&amp;#039;&amp;#039; or &amp;#039;&amp;#039;down&amp;#039;&amp;#039; requires intervention by a [[Roles|Job Administrator]] to be restored to &amp;#039;&amp;#039;up&amp;#039;&amp;#039;. Jobs waiting for that partition are paused unless the partition returns available.&lt;br /&gt;
&lt;br /&gt;
== Choosing the partition on which to run a job ==&lt;br /&gt;
&lt;br /&gt;
When launching a job (as explained in [[User Jobs#Executing jobs on Mufasa|Executing jobs on Mufasa]]) a user should select the partition that is most suitable for it according to the job&amp;#039;s features. Launching a job on a partition avoids the need for the user to specify explicitly all of the resources that the job requires, relying instead (for unspecified resources) on the default amounts defined for the partition.&lt;br /&gt;
&lt;br /&gt;
The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. However, users can -if needed- change the resource requested by their jobs wrt the default values associated to the chosen partition. Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job, so users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if set. If a user tries to run on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the run command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the most important resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined tmaximum. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
An interesting part of the output of &amp;lt;code&amp;gt;sinfo&amp;lt;/code&amp;gt; ([[User Jobs#SLURM Partitions|see above]]) is the one concerning the GPUs, since GPUs are usually the less plentiful resource in a system such as Mufasa. For all partitions this part of the output is identical and equal to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The string above means that whatever the partition on which it is run, a job can request up to 2 40 GB GPUs, up to 3 20 GB GPUs and up to 6 10 GB GPUs (even all together): i.e., [[System#CPUs and GPUs|the complete set of 11 GPUs available in the system]]. In other words, at the moment none of the partitions defined on Mufasa sets limits on the maximum amount of GPUs that they allow jobs to request.&lt;br /&gt;
&lt;br /&gt;
Of course, the larger the fraction of system resources that a job asks for, the heavier the job becomes for Mufasa&amp;#039;s limited capabilities. Since SLURM prioritises lighter jobs over heavier ones (in order to maximise the number of completed jobs) it is a very bad idea for a user to ask for their job more resources than it actually needs: this, in fact, witl have the effect of delaying (possibly for a long time) job execution.&lt;br /&gt;
&lt;br /&gt;
= Executing jobs on Mufasa =&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation for Mufasa users: what follows explains how it is done. &lt;br /&gt;
&lt;br /&gt;
Considering that [[System#Docker Containers|all computation on Mufasa must occur within Docker containers]], the jobs run by Mufasa users are always containers except for menial, non-computationally intensive jobs. The process of launching a user job on Mufasa involves two steps:&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
:; Step 1&lt;br /&gt;
:: [[User Jobs#Using SLURM to run a Docker container|Use SLURM to run the Docker container where the job will take place]]&lt;br /&gt;
&lt;br /&gt;
:; Step 2&lt;br /&gt;
:: [[User Jobs#Launching a user job from within a Docker container|Launch the job from within the Docker container]]&lt;br /&gt;
----&lt;br /&gt;
----&lt;br /&gt;
As an optional preparatory step, it is often useful to define an [[User Jobs#Using execution scripts to run jobs|execution script]] to simplify the launching process and reduce the possibility of mistakes.&lt;br /&gt;
&lt;br /&gt;
The commands that SLURM provides to run jobs are &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun &amp;lt;options&amp;gt; &amp;lt;path_of_the_program_to_be_run_via_SLURM&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(see SLURM documentation: [https://slurm.schedmd.com/srun.html srun], [https://slurm.schedmd.com/sbatch.html sbatch]). The main difference between &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; is that the first locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, on the other side, does not lock the shell and simply adds the job to the queue, but does not allow the user to interact with the process while it is running.&lt;br /&gt;
&lt;br /&gt;
Among the options available for &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt;, one of the most important is &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;K&amp;lt;/code&amp;gt; is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
As [[User Jobs#SLURM Partition|already explained]], a quick way to define the set of resources that a program will have access to is to use option &amp;lt;code&amp;gt;--p &amp;lt;partition name&amp;gt;&amp;lt;/code&amp;gt;.&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small ./my_program&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;lt;code&amp;gt;my_program&amp;lt;/code&amp;gt; on the partition named “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
= Using SLURM to run a Docker container =&lt;br /&gt;
&lt;br /&gt;
The first step to run a user job on Mufasa is to run the [[System#Docker Containers|Docker container]] where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun ‑‑p &amp;lt;partition_name&amp;gt; ‑‑container-image=&amp;lt;container_path.sqsh&amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt; ‑‑gres=&amp;lt;gpu_resources&amp;gt; ‑‑mem=&amp;lt;mem_resources&amp;gt; ‑‑cpus‑per‑task &amp;lt;cpu_amount&amp;gt; ‑‑pty ‑‑time=&amp;lt;hh:mm:ss&amp;gt; &amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Below these options are explained.&lt;br /&gt;
&lt;br /&gt;
;‑‑p &amp;lt;partition_name&amp;gt;&lt;br /&gt;
: specifies the resource partition on which the job will be run.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! If &amp;lt;code&amp;gt;‑‑p &amp;lt;partition_name&amp;gt;&amp;lt;/code&amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;lt;code&amp;gt;‑‑mem=&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑cpus‑per‑task &amp;lt;cpu_number&amp;gt;&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;‑‑time=&amp;lt;hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;lt;code&amp;gt;‑‑gres=&amp;lt;gpu_resources&amp;gt;&amp;lt;/code&amp;gt;: GPU resources, in fact, must always be explicitly requested with option &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;, otherwise no access to GPUs is granted to the job.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑container-image=&amp;lt;container_path.sqsh&amp;gt;&lt;br /&gt;
: specifies the container to be run&lt;br /&gt;
&lt;br /&gt;
;‑‑no‑container‑entrypoint&lt;br /&gt;
: specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑container‑mounts=&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;lt;code&amp;gt;&amp;lt;mufasa_dir&amp;gt;:&amp;lt;docker_dir&amp;gt;&amp;lt;/code&amp;gt; takes the value &amp;lt;code&amp;gt;/home/mrossi:/data&amp;lt;/code&amp;gt; this tells srun to mount Mufasa&amp;#039;s directory &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; in position &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; within the filesystem of the Docker container. When the docker container reads or writes files in directory &amp;lt;code&amp;gt;/data&amp;lt;/code&amp;gt; of its own (internal) filesystem, what actually happens is that files in &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; get manipulated instead. &amp;lt;code&amp;gt;/home/mrossi&amp;lt;/code&amp;gt; is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&lt;br /&gt;
;‑‑gres=&amp;lt;gpu_resources&amp;gt;&lt;br /&gt;
: specifies what GPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;gpus&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;gpu:40gb:2&amp;lt;/code&amp;gt;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Important! The &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with &amp;lt;code&amp;gt;‑‑gres&amp;lt;/code&amp;gt;.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
;‑‑mem=&amp;lt;mem_resources&amp;gt;&lt;br /&gt;
: specifies the amount of RAM to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;mem_resources&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;200G&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑cpus-per-task &amp;lt;cpu_amount&amp;gt;&lt;br /&gt;
: specifies how many CPUs to assign to the container; for instance, &amp;lt;code&amp;gt;&amp;lt;cpu_amount&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;2&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;‑‑pty&lt;br /&gt;
: specifies that the job will be interactive (this is necessary when &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;: see [[User Jobs#Running interactive jobs via SLURM|Running interactive jobs via SLURM]])&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;nowiki&amp;gt;‑‑time=&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
: specifies the maximum time allowed to the job to run, in the format &amp;lt;code&amp;gt;days-hours:minutes:seconds&amp;lt;/code&amp;gt;, where &amp;lt;code&amp;gt;days&amp;lt;/code&amp;gt; is optional; for instance, &amp;lt;code&amp;gt;&amp;lt;d-hh:mm:ss&amp;gt;&amp;lt;/code&amp;gt; may be &amp;lt;code&amp;gt;72:00:00&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;&amp;lt;command_to_run_within_container&amp;gt;&lt;br /&gt;
: the executable that will be run within the Docker container as soon as it is operative. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt;. This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;python&amp;lt;/code&amp;gt;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt; to launch non-interactive programs.&lt;br /&gt;
&lt;br /&gt;
== Nvidia Pyxis ==&lt;br /&gt;
&lt;br /&gt;
Some of the options described below are specifically dedicated to Docker containers: these are provided by the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package that has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. More specifically, options &amp;lt;code&amp;gt;‑‑container-image&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑no‑container‑entrypoint&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;‑‑container-mounts&amp;lt;/code&amp;gt; are provided to &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; by Pyxis.&lt;br /&gt;
&lt;br /&gt;
= Launching a user job from within a Docker container =&lt;br /&gt;
&lt;br /&gt;
Once the Docker container (run as [[User Jobs#Using SLURM to run a Docker container|explained here]]) is up and running, the user is dropped to the interactive environment specified by &amp;lt;code&amp;gt;&amp;lt;command_to_run_within_container&amp;gt;&amp;lt;/code&amp;gt;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
= Running interactive jobs via SLURM =&lt;br /&gt;
&lt;br /&gt;
As explained, SLURM command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with a command similar to&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them, and they can only access 2 CPUs). On the contrary, running programs with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter &amp;lt;code&amp;gt;--res=gpu:K&amp;lt;/code&amp;gt;. For instance, in order to run an interactive program which needs one GPU we may first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun --gres=gpu:1 --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun -p small --pty /bin/bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;lt;code&amp;gt;(SLURM ID xx)&amp;lt;/code&amp;gt; (where &amp;lt;code&amp;gt;xx&amp;lt;/code&amp;gt; is the ID of the &amp;lt;code&amp;gt;/bin/bash&amp;lt;/code&amp;gt; process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM-run one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or one run via SLURM is to execute command&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
echo $SLURM_JOB_ID&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
= Detach from a running job with &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; =&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; inside a &amp;#039;&amp;#039;screen session&amp;#039;&amp;#039; (often simply called &amp;quot;a screen&amp;quot;), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;lt;code&amp;gt;screen&amp;lt;/code&amp;gt; available online). Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, to create a screen session and run a job in it:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* From the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* In the screen session thus created (it has the look of an empty shell), launch your job with &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;Detach&amp;#039;&amp;#039; from the screen session with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, while your process will go on running in the screen session&lt;br /&gt;
* You can now close the SSH connection to Mufasa without damaging your process&lt;br /&gt;
&lt;br /&gt;
Later, when you are ready to resume contact with your running process:&lt;br /&gt;
&lt;br /&gt;
* Connect to Mufasa with SSH&lt;br /&gt;
* In the Mufasa shell, run&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
screen -r&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* You are now back to the screen session where you launched your job&lt;br /&gt;
&lt;br /&gt;
* When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;\&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
= Using execution scripts to run jobs =&lt;br /&gt;
&lt;br /&gt;
Previous Sections of this page explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line.&lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;,  where the user specifies the values to be given to parameters, each preceded by the keyword &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;srun commands&amp;#039;&amp;#039;&amp;#039; that launch jobs with SLURM using the parameter values specified in the preamble&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set&lt;br /&gt;
* have &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; as its very first line&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;, but this is not mandatory.&lt;br /&gt;
&lt;br /&gt;
To execute the script, just open a terminal (such as the one provided by an SSH connection with Mufasa), write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press the &amp;lt;enter&amp;gt; key. The script is executed in the terminal, and any output (e.g., whatever is printed by any &amp;lt;code&amp;gt;echo&amp;lt;/code&amp;gt; commands in the script) is shown on the terminal.&lt;br /&gt;
&lt;br /&gt;
Within a bash script, lines preceded by &amp;lt;code&amp;gt;#&amp;lt;/code&amp;gt; are comments (with the notable exception of the initial &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold; the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------start of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Note: these are examples. Put your own SBATCH directives below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=myjob&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; name assigned to the job&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --cpus-per-task=1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of threads allocated to each task&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --mem-per-cpu=500M&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; amount of memory per CPU core&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --gres=gpu:1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; number of GPUs per node&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; the partition to run your jobs on&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SBATCH --time=0-00:01:00&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; time assigned to your jobs to run (format: days-hours:minutes:seconds, with days optional)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt;----------------end of preamble----------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------srun commands-----------------&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; Put your own srun command(s) below&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;srun ...&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;#&amp;lt;/nowiki&amp;gt; ----------------end of srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As the example above shows, beyond the initial directive &amp;lt;code&amp;gt;#!/bin/bash&amp;lt;/code&amp;gt; the script includes a series of &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directives used to specify parameter values, and finally one or more &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; commands that run the jobs. Any parameter accepted by commands &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;sbatch&amp;lt;/code&amp;gt; can be used as an &amp;lt;code&amp;gt;SBATCH&amp;lt;/code&amp;gt; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
= Job caching =&lt;br /&gt;
&lt;br /&gt;
When a job is run via SLURM (with or without an execution script), Mufasa exploits a (fully tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical and therefore relatively slow) HDDs where &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; partitions reside, substituting them with accesses to (solid-state and therefore much faster) SSDs.&lt;br /&gt;
&lt;br /&gt;
Each time a job is run via SLURM, this is what happens automatically:&lt;br /&gt;
&lt;br /&gt;
# Mufasa temporarily copies code and associated data from the directory where the executables are located (in the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;) to a cache space located on system SSDs&lt;br /&gt;
# Mufasa launches the cached copy of the user executables, using the cached copies of the data as its input files&lt;br /&gt;
# The executables create their output files in the cache space&lt;br /&gt;
# When the user jobs end, Mufasa copies the output files from the cache space back to the user&amp;#039;s own &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares the executable (or the [[User Jobs# Using execution scripts to wrap user jobs|execution script]]) in a subdirectory of their &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; directory and runs the job. When job execution is complete, the user finds their output data in the origin subdirectory of &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt;, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The caching mechanism requires that &amp;#039;&amp;#039;during job execution&amp;#039;&amp;#039; the user does not modify the contents of the &amp;lt;code&amp;gt;/home&amp;lt;/code&amp;gt; subdirectory where executable and data were at execution time. Any such change, in fact, will be overwritten by Mufasa at the end of the execution, when files are copied back from the caching space.&lt;br /&gt;
&lt;br /&gt;
= Monitoring and managing jobs =&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From [https://slurm.schedmd.com/overview.html SLURM&amp;#039;s own overview]:&lt;br /&gt;
&lt;br /&gt;
“&amp;#039;&amp;#039;User tools include&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
srun&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html (link to SLURM docs)] to initiate jobs, &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
scancel&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/scancel.html (link to SLURM docs)] to terminate queued or running jobs,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sinfo.html (link to SLURM docs)] to report system status,&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
squeue&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/squeue.html (link to SLURM docs)] to report the status of jobs [i.e. to inspect the scheduling queue], and&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/sacct.html (link to SLURM docs)] to get information about jobs and job steps that are running or have completed.&amp;#039;&amp;#039;”&lt;br /&gt;
&lt;br /&gt;
An example of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre style=&amp;quot;color: lightgrey; background: black;&amp;quot;&amp;gt;&lt;br /&gt;
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
  520       fat     bash acasella  R 2-04:10:25      1 gn01&lt;br /&gt;
  523       fat     bash amarzull  R    1:30:35      1 gn01&lt;br /&gt;
  522       gpu     bash    clena  R   20:51:16      1 gn01&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Job state ==&lt;br /&gt;
&lt;br /&gt;
Jobs typically pass through several states in the course of their execution. Job state is shown in column &amp;quot;ST&amp;quot; of the output of &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; as an abbreviated code (e.g., &amp;quot;R&amp;quot; for RUNNING).&lt;br /&gt;
&lt;br /&gt;
The most relevant codes and states are the following:&lt;br /&gt;
&lt;br /&gt;
; PD PENDING&lt;br /&gt;
: Job is awaiting resource allocation. &lt;br /&gt;
&lt;br /&gt;
; R RUNNING&lt;br /&gt;
: Job currently has an allocation.&lt;br /&gt;
&lt;br /&gt;
; S SUSPENDED&lt;br /&gt;
: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. &lt;br /&gt;
 &lt;br /&gt;
; CG COMPLETING&lt;br /&gt;
: Job is in the process of completing. Some processes on some nodes may still be active. &lt;br /&gt;
&lt;br /&gt;
; CD COMPLETED&lt;br /&gt;
: Job has terminated all processes on all nodes with an exit code of zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Beyond these, there are other (less frequent) job states. [https://slurm.schedmd.com/squeue.html The SLURM doc page for &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt;] provides a complete list of them, reported here for completeness:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;small&amp;gt;&lt;br /&gt;
; BF BOOT_FAIL&lt;br /&gt;
: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). &lt;br /&gt;
&lt;br /&gt;
; CA CANCELLED&lt;br /&gt;
: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. &lt;br /&gt;
&lt;br /&gt;
; CF CONFIGURING&lt;br /&gt;
: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). &lt;br /&gt;
&lt;br /&gt;
; DL DEADLINE&lt;br /&gt;
: Job terminated on deadline. &lt;br /&gt;
&lt;br /&gt;
; F FAILED&lt;br /&gt;
: Job terminated with non-zero exit code or other failure condition. &lt;br /&gt;
&lt;br /&gt;
; NF NODE_FAIL&lt;br /&gt;
: Job terminated due to failure of one or more allocated nodes. &lt;br /&gt;
&lt;br /&gt;
; OOM OUT_OF_MEMORY&lt;br /&gt;
: Job experienced out of memory error. &lt;br /&gt;
&lt;br /&gt;
; PR PREEMPTED&lt;br /&gt;
: Job terminated due to preemption. &lt;br /&gt;
&lt;br /&gt;
; RD RESV_DEL_HOLD&lt;br /&gt;
: Job is being held after requested reservation was deleted. &lt;br /&gt;
&lt;br /&gt;
; RF REQUEUE_FED&lt;br /&gt;
: Job is being requeued by a federation. &lt;br /&gt;
&lt;br /&gt;
; RH REQUEUE_HOLD&lt;br /&gt;
: Held job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RQ REQUEUED&lt;br /&gt;
: Completing job is being requeued. &lt;br /&gt;
&lt;br /&gt;
; RS RESIZING&lt;br /&gt;
: Job is about to change size. &lt;br /&gt;
&lt;br /&gt;
; RV REVOKED&lt;br /&gt;
: Sibling was removed from cluster due to other cluster starting the job. &lt;br /&gt;
&lt;br /&gt;
; SI SIGNALING&lt;br /&gt;
: Job is being signaled. &lt;br /&gt;
&lt;br /&gt;
; SE SPECIAL_EXIT&lt;br /&gt;
: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. &lt;br /&gt;
&lt;br /&gt;
; SO STAGE_OUT&lt;br /&gt;
: Job is staging out files. &lt;br /&gt;
&lt;br /&gt;
; ST STOPPED&lt;br /&gt;
: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. &lt;br /&gt;
&lt;br /&gt;
; TO TIMEOUT&lt;br /&gt;
: Job terminated upon reaching its time limit.&lt;br /&gt;
&amp;lt;/small&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=5</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=5"/>
		<updated>2021-12-22T13:56:59Z</updated>

		<summary type="html">&lt;p&gt;Admin: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= &amp;lt;span id=&amp;quot;anchor-8&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Mufasa for Job Users =&lt;br /&gt;
&lt;br /&gt;
This section briefly presents the features of SLURM that are most relevant to Mufasa&amp;#039;s Job Users: i.e., people who need to run their own jobs on Mufasa. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Since Job Users are by necessity SLURM users (see the following Sections for details), you may want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-9&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Partitions ==&lt;br /&gt;
&lt;br /&gt;
Via SLURM, several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;partitions&amp;#039;&amp;#039; in SLURM terminology. Each partition has specific features that make it suitable for the type of jobs it is dedicated to. SLURM command [https://slurm.schedmd.com/sinfo.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “mid”, “longmid”, “fat”. On Mufasa, partition names usually make reference to the features of the job: for instance, partition “debug” is used for test jobs. The asterisk after the name of partition “small” marks it as the default partition, i.e. the one on which jobs are launched if no partition is specified.&lt;br /&gt;
&lt;br /&gt;
When launching a job, users may exploit partitions by selecting the most suitable one and specifying that their job must be run on that partition. This avoids the need for the user to specify the amount of each resource that the job requires, since a set of resources has already been defined for each partition.The difference between partitions is in the default amount of resources that they assign to processes. The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. A complete description of the default amount of resources that the partitions assign to their jobs can be obtained using SLURM command &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo --Format=All&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; (an example is shown below)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Partition defaults are defined by Job Administrators and cannot be modified by Job Users. Users can, however, select the partitions on which each of their jobs is launched, and ‑if needed‑ change the resource requested by their jobs wrt the default values associated to such partitions.&lt;br /&gt;
&lt;br /&gt;
Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job. Therefore users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if defined. For each resourse, the maximum value is an additional parameter of the partition that System Administrators have the possibiltiy of specifying. If a user tries to launch on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the launch command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined time duration. As with any other resource provided by a partition, this duration takes the default value unless the user specifies a different value. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-10&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Partition availability ===&lt;br /&gt;
&lt;br /&gt;
The most important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides about a partition is its &amp;#039;&amp;#039;partition state&amp;#039;&amp;#039;, i.e. its &amp;#039;&amp;#039;availability&amp;#039;&amp;#039;. Partition state is shown in column &amp;#039;&amp;#039;AVAIL&amp;#039;&amp;#039; (note that there is also another column named &amp;#039;&amp;#039;STATE&amp;#039;&amp;#039;: it provides, instead, the state of the &amp;#039;&amp;#039;node(s)&amp;#039;&amp;#039;, i.e. the machine(s), providing resources to the partition).&lt;br /&gt;
&lt;br /&gt;
The standard value for partition state/availability is &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;up&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, as in the example above, meaning that the partition is available for jobs. If the availability of a partition is stated as &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;down&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;drain&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, all jobs waiting for that partition are paused and the intervention of a Job Administrator is required to restore the partition&amp;#039;s operation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-11&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Executing jobs on Mufasa ==&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to make it execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation that users will perform on Mufasa: this section explains how it is done. Considering that all computation run on Mufasa must occur within Docker containers, the processes run by Mufasa users are always containers except for menial, non-computationally intensive jobs.&lt;br /&gt;
&lt;br /&gt;
The process of launching user jobs requires two steps:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 1: use SLURM to run the Docker container where the job will take place&amp;#039;&amp;#039;&amp;#039;;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 2: launch the user job from within the Docker container&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
These steps are described in the following sections of this document.&lt;br /&gt;
&lt;br /&gt;
An optional (but recommended) operation is to &amp;#039;&amp;#039;&amp;#039;use an execution script&amp;#039;&amp;#039;&amp;#039; to manage the launching process. How to do this is described below, by a specific section of this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-12&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Step 1: using SLURM to run a Docker container ===&lt;br /&gt;
&lt;br /&gt;
As explained above, the first step to run a user job on Mufasa is to run the Docker container where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s /home directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
srun ‑‑p &amp;amp;lt;partition_name&amp;amp;gt; ‑‑container-image=&amp;amp;lt;container_path.sqsh&amp;amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; ‑‑gres=&amp;amp;lt;gpu_resources&amp;amp;gt; ‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt; ‑‑cpus‑per‑task &amp;amp;lt;cpu_amount&amp;amp;gt; ‑‑pty ‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt;&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;amp;lt;command_to_run_within_container&amp;amp;gt;&lt;br /&gt;
&lt;br /&gt;
We will now decompose this command into its constituent parts.&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] is one of SLURM&amp;#039;s commands to run jobs (see Section 2.3 for an alternative command, &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;). The following sections will provide additional details about &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and other ways to run jobs via SLURM.&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Some of the options are specifically dedicated to Docker containers&amp;lt;ref&amp;gt;To facilitate the execution of Docker containers, the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. Options &amp;#039;&amp;#039;‑‑container-image&amp;#039;&amp;#039;, &amp;#039;&amp;#039;‑‑no‑container‑entrypoint&amp;#039;&amp;#039;, &amp;#039;&amp;#039;‑‑container-mounts &amp;#039;&amp;#039;are provided to &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; by Pyxis.&lt;br /&gt;
&amp;lt;/ref&amp;gt;. Below is a description of the options:&lt;br /&gt;
&lt;br /&gt;
‑‑p &amp;amp;lt;partition_name&amp;amp;gt; specifies the resource partition on which the job will be run.&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; If ‑‑p &amp;amp;lt;partition_name&amp;amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;#039;&amp;#039;‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt;, ‑‑cpus‑per‑task &amp;amp;lt;cpu_number&amp;amp;gt; &amp;#039;&amp;#039;or &amp;#039;&amp;#039;‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt;&amp;#039;&amp;#039;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;#039;&amp;#039;‑‑gres=&amp;amp;lt;gpu_resources&amp;amp;gt;&amp;#039;&amp;#039;: GPU resources, in fact, must always be explicitly requested with option &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039;, otherwise no access to GPUs is granted to the job.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑container-image=&amp;amp;lt;container_path.sqsh&amp;amp;gt; specifies the container to be run&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑no‑container‑entrypoint specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑container‑mounts=&amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; takes the value /home/mrossi:/data this tells srun to mount Mufasa&amp;#039;s directory /home/mrossi in position /data within the filesystem of the Docker container. When the docker container reads or writes files in directory /data of its own (internal) filesystem, what actually happens is that files in /home/mrossi get manipulated instead. /home/mrossi is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;‑‑gres=&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;gpu_resources&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;&amp;#039;specifies what GPUs to assign to the container; for instance, &amp;#039;&amp;#039;&amp;amp;lt;gpus&amp;amp;gt;&amp;#039;&amp;#039; may be &amp;#039;&amp;#039;gpu:40gb:2&amp;#039;&amp;#039;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must &amp;#039;&amp;#039;always&amp;#039;&amp;#039; be explicitly requested with &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039;.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt; specifies the amount of RAM to assign to the container; for instance, &amp;amp;lt;mem_resources&amp;amp;gt; may be 200G&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑cpus-per-task &amp;amp;lt;cpu_amount&amp;amp;gt; specifies how many CPUs to assign to the container; for instance, &amp;amp;lt;cpu_amount&amp;amp;gt; may be 2&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑pty specifies that the job will be interactive (this is necessary when &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is /bin/bash)&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt; specifies the maximum time allowed to the job to run, in the format hours:minutes:seconds; for instance, &amp;amp;lt;hh:mm:ss&amp;amp;gt; may be 72:00:00&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;amp;lt;command_to_run_within_container&amp;amp;gt; the executable that will be run within the Docker container as soon as it is operative. A typical value for &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is /bin/bash . This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is &amp;#039;&amp;#039;python&amp;#039;&amp;#039;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;amp;lt;command_to_run_within_container&amp;amp;gt; to launch non-interactive programs.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-13&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Step 2: launching a user job from within a Docker container ===&lt;br /&gt;
&lt;br /&gt;
Once the container is up and running, usually the user is dropped to the interactive environment specified by &amp;#039;&amp;#039;&amp;amp;lt;command_to_run_within_container&amp;amp;gt;&amp;#039;&amp;#039;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-14&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;#039;&amp;#039;&amp;#039;U&amp;#039;&amp;#039;&amp;#039;sing SLURM to run jobs: additional information ==&lt;br /&gt;
&lt;br /&gt;
In SLURM, jobs are launched using commands [https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] (for interactive programs) or [https://slurm.schedmd.com/sbatch.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] (for non-interactive ones). The preceding sections illustrated the use of &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; that is most important to Mufasa&amp;#039;s users: i.e., to run a Docker container; this section will provide a broader overview of their use.&lt;br /&gt;
&lt;br /&gt;
Mufasa&amp;#039;s Job Users do not need to know the contents of this section in order to use the machine. These contents are provided to enhance the user&amp;#039;s knowledge of SLURM and its usage, but are optional.&lt;br /&gt;
&lt;br /&gt;
In the following, we provide more general information about SLURM commands &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;. The main difference between them is that &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;, on the contrary, does not lock the shell and simply adds the job to the queue.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-15&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Basic &amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; syntax ===&lt;br /&gt;
&lt;br /&gt;
The basic syntax of an &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; command (the one of an &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; command is similar) is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;options&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;of&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;the&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;program&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;to&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;be&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;run&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;via&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;SLURM&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Among the options, one of the most important is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;--res=gpu:K&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where K is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must &amp;#039;&amp;#039;always&amp;#039;&amp;#039; be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
A quick way to define the set of resources that a program will have access to is to use option&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;--p &amp;amp;lt;partition name&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑&amp;#039;&amp;#039;res=gpu:K&amp;#039;&amp;#039;, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun -p small ./my_program&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;#039;&amp;#039;my_program&amp;#039;&amp;#039; on the partition called “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
If I don&amp;#039;t want to run &amp;#039;&amp;#039;my_program&amp;#039;&amp;#039; on a partition but still want to ensure that it gets access to one GPU to operate correctly, I will need to specify in the &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; command this as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun --gres=gpu:1 ./my_program&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-16&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Running interactive jobs via SLURM ===&lt;br /&gt;
&lt;br /&gt;
As explained, &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun --pty /bin/bash&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell), &amp;#039;&amp;#039;exit&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and therefore are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them). On the contrary, running programs with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;--res=gpu:K&amp;#039;&amp;#039; . For instance, to run an interactive program which needs one GPU I will first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun --gres=gpu:1 --pty /bin/bash&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;#039;&amp;#039;/bin/bash&amp;#039;&amp;#039; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun -p small --pty /bin/bash&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;#039;&amp;#039;(SLURM ID xx)&amp;#039;&amp;#039; (where &amp;#039;&amp;#039;xx&amp;#039;&amp;#039; is the ID of the /bin/bash process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or a new one run via SLURM is to run command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;echo $SLURM_JOB_ID&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-17&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Using &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; ===&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; inside a &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; available online), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;. Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, the succession of operations is:&lt;br /&gt;
&lt;br /&gt;
# From the Mufasa shell, run &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;screen&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
# In the screen thus created (it has the look of an empty shell), launch your job with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&lt;br /&gt;
# Detach from the screen with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, and your process will go on running in the screen&lt;br /&gt;
# Close the SSH session to Mufasa&lt;br /&gt;
# (later) To resume contact with your running process, connect to Mufasa with SSH &lt;br /&gt;
# In the Mufasa shell, run &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;screen -r&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
# You are now back to the screen where you launched your job&lt;br /&gt;
# When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;X&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-18&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Using execution scripts to wrap user jobs ==&lt;br /&gt;
&lt;br /&gt;
Sections 2.2 and 2.3 explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line. Each parameter value is provided to SLURM by including an argument such as&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;--parameter_name=parameter_value&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
into the command line. &lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;where the user specifies the values to be given to parameters, each preceded by the keyword &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039;;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; commands&amp;#039;&amp;#039;&amp;#039; that use SLURM to run jobs, using the parameter values specified by the preamble.&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set;&lt;br /&gt;
* have “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” as its very first line.&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;. To execute the script, just open a terminal, write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press &amp;amp;lt;&amp;#039;&amp;#039;enter&amp;#039;&amp;#039;&amp;amp;gt;. Within a bash script, lines preceded by “&amp;#039;&amp;#039;#&amp;#039;&amp;#039;” are comments (with the notable exception of the initial “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold, the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;#!/bin/bash&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# ----------------preamble----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# Note: these are examples. Put your own SBATCH directives below&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;myjob&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# name assigned to the job&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --cpus-per-task=1&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# number of threads allocated to each task&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --mem-per-cpu=500M&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# amount of memory per CPU core&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --gres=gpu:1&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# number of GPUs per node&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# the partition to run your jobs in&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --time=0-00:01:00 &lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# time assigned to your jobs to run (format: day-hour:min:sec)&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# ----------------srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# Put your own srun command(s) below: see Section 2.2&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;srun ...&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
As the example above shows, beyond the initial directive “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” the script includes a series of &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039; directives used to specify parameter values, and finally one or more &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; commands that run the jobs. Any parameter accepted by commands &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; can be used as an &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-19&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Job caching ===&lt;br /&gt;
&lt;br /&gt;
When a Job User runs a job via SLURM (with or without an execution script), Mufasa exploits a (tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical, slow) HDDs where /home partitions reside, and substituting them with accesses to (solid-state, fast) SSDs.&lt;br /&gt;
&lt;br /&gt;
Precisely, each time a job is run via SLURM Mufasa:&lt;br /&gt;
&lt;br /&gt;
# temporarily copies code and associated data from the user&amp;#039;s own /home partition to a cache space located on system SSDs;&lt;br /&gt;
# runs the user job from the SSDs, using the copy of the data on the SSD as input;&lt;br /&gt;
# creates the output file(s) on the SSDs;&lt;br /&gt;
# when the job ends, copies the output files from the SSDs to the user&amp;#039;s own /home partition .&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares executable and data in their /home folder, then runs the job (possibly via an execution script). When job execution ends, the user finds their output data in the /home folder, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-20&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Monitoring and managing jobs ==&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From SLURM&amp;#039;s overview (the links point to the appropriate URLs in SLURM&amp;#039;s online documentation): “User tools include [https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to initiate jobs, [https://slurm.schedmd.com/scancel.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;scancel&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to terminate queued or running jobs, [https://slurm.schedmd.com/sinfo.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to report system status, [https://slurm.schedmd.com/squeue.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;squeue&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to report the status of jobs [i.e. to inspect the scheduling queue], and [https://slurm.schedmd.com/sacct.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sacct&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to get information about jobs and job steps that are running or have completed.”&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=4</id>
		<title>User Jobs</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=User_Jobs&amp;diff=4"/>
		<updated>2021-12-22T13:56:43Z</updated>

		<summary type="html">&lt;p&gt;Admin: Creata pagina con &amp;quot;= &amp;lt;span id=&amp;quot;anchor-8&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;2. Mufasa for Job Users =  This section briefly presents the features of SLURM that are most relevant to Mufasa&amp;#039;s Job Users: i.e., people who need...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= &amp;lt;span id=&amp;quot;anchor-8&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;2. Mufasa for Job Users =&lt;br /&gt;
&lt;br /&gt;
This section briefly presents the features of SLURM that are most relevant to Mufasa&amp;#039;s Job Users: i.e., people who need to run their own jobs on Mufasa. Job Users can submit jobs for execution, cancel their own jobs, and see other users&amp;#039; jobs (but not intervene on them).&lt;br /&gt;
&lt;br /&gt;
Since Job Users are by necessity SLURM users (see the following Sections for details), you may want to read [https://slurm.schedmd.com/quickstart.html SLURM&amp;#039;s own Quick Start User Guide].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-9&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Partitions ==&lt;br /&gt;
&lt;br /&gt;
Via SLURM, several execution queues for jobs have been defined on Mufasa. Such queues are called &amp;#039;&amp;#039;partitions&amp;#039;&amp;#039; in SLURM terminology. Each partition has specific features that make it suitable for the type of jobs it is dedicated to. SLURM command [https://slurm.schedmd.com/sinfo.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] provides a list of available partitions. Its output is similar to this:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “mid”, “longmid”, “fat”. On Mufasa, partition names usually make reference to the features of the job: for instance, partition “debug” is used for test jobs. The asterisk after the name of partition “small” marks it as the default partition, i.e. the one on which jobs are launched if no partition is specified.&lt;br /&gt;
&lt;br /&gt;
When launching a job, users may exploit partitions by selecting the most suitable one and specifying that their job must be run on that partition. This avoids the need for the user to specify the amount of each resource that the job requires, since a set of resources has already been defined for each partition.The difference between partitions is in the default amount of resources that they assign to processes. The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. A complete description of the default amount of resources that the partitions assign to their jobs can be obtained using SLURM command &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo --Format=All&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; (an example is shown below)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Partition defaults are defined by Job Administrators and cannot be modified by Job Users. Users can, however, select the partitions on which each of their jobs is launched, and ‑if needed‑ change the resource requested by their jobs wrt the default values associated to such partitions.&lt;br /&gt;
&lt;br /&gt;
Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job. Therefore users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job&amp;#039;s requirements only for those resources that have an unsuitable default value.&lt;br /&gt;
&lt;br /&gt;
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if defined. For each resourse, the maximum value is an additional parameter of the partition that System Administrators have the possibiltiy of specifying. If a user tries to launch on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the launch command is refused.&lt;br /&gt;
&lt;br /&gt;
One of the resources provided to jobs by partitions is &amp;#039;&amp;#039;time&amp;#039;&amp;#039;, in the sense that a job is permitted to run for no longer than a predefined time duration. As with any other resource provided by a partition, this duration takes the default value unless the user specifies a different value. Jobs that exceed their allotted time are killed by SLURM.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-10&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Partition availability ===&lt;br /&gt;
&lt;br /&gt;
The most important information that &amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039; provides about a partition is its &amp;#039;&amp;#039;partition state&amp;#039;&amp;#039;, i.e. its &amp;#039;&amp;#039;availability&amp;#039;&amp;#039;. Partition state is shown in column &amp;#039;&amp;#039;AVAIL&amp;#039;&amp;#039; (note that there is also another column named &amp;#039;&amp;#039;STATE&amp;#039;&amp;#039;: it provides, instead, the state of the &amp;#039;&amp;#039;node(s)&amp;#039;&amp;#039;, i.e. the machine(s), providing resources to the partition).&lt;br /&gt;
&lt;br /&gt;
The standard value for partition state/availability is &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;up&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, as in the example above, meaning that the partition is available for jobs. If the availability of a partition is stated as &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;down&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;drain&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, all jobs waiting for that partition are paused and the intervention of a Job Administrator is required to restore the partition&amp;#039;s operation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-11&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Executing jobs on Mufasa ==&lt;br /&gt;
&lt;br /&gt;
The main reason for a user to interact with Mufasa is to make it execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation that users will perform on Mufasa: this section explains how it is done. Considering that all computation run on Mufasa must occur within Docker containers, the processes run by Mufasa users are always containers except for menial, non-computationally intensive jobs.&lt;br /&gt;
&lt;br /&gt;
The process of launching user jobs requires two steps:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 1: use SLURM to run the Docker container where the job will take place&amp;#039;&amp;#039;&amp;#039;;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 2: launch the user job from within the Docker container&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
These steps are described in the following sections of this document.&lt;br /&gt;
&lt;br /&gt;
An optional (but recommended) operation is to &amp;#039;&amp;#039;&amp;#039;use an execution script&amp;#039;&amp;#039;&amp;#039; to manage the launching process. How to do this is described below, by a specific section of this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-12&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Step 1: using SLURM to run a Docker container ===&lt;br /&gt;
&lt;br /&gt;
As explained above, the first step to run a user job on Mufasa is to run the Docker container where the job will take place. A container is a “sandbox” containing the environment where the user&amp;#039;s application operates. Parts of Mufasa&amp;#039;s filesystem can be made visible (and writable, if they belong to the user&amp;#039;s /home directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa&amp;#039;s filesystem: for instance, to read data and write results.&lt;br /&gt;
&lt;br /&gt;
Each user is in charge of preparing the Docker container(s) where the user&amp;#039;s jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.&lt;br /&gt;
&lt;br /&gt;
In order to run a Docker container via SLURM, a user must use a command similar to the following:&lt;br /&gt;
&lt;br /&gt;
srun ‑‑p &amp;amp;lt;partition_name&amp;amp;gt; ‑‑container-image=&amp;amp;lt;container_path.sqsh&amp;amp;gt; ‑‑no‑container‑entrypoint ‑‑container‑mounts=&amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; ‑‑gres=&amp;amp;lt;gpu_resources&amp;amp;gt; ‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt; ‑‑cpus‑per‑task &amp;amp;lt;cpu_amount&amp;amp;gt; ‑‑pty ‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt;&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;amp;lt;command_to_run_within_container&amp;amp;gt;&lt;br /&gt;
&lt;br /&gt;
We will now decompose this command into its constituent parts.&lt;br /&gt;
&lt;br /&gt;
[https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] is one of SLURM&amp;#039;s commands to run jobs (see Section 2.3 for an alternative command, &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;). The following sections will provide additional details about &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and other ways to run jobs via SLURM.&lt;br /&gt;
&lt;br /&gt;
All parts of the command above that come after &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; are options that specify what to execute and how. Some of the options are specifically dedicated to Docker containers&amp;lt;ref&amp;gt;To facilitate the execution of Docker containers, the [https://github.com/NVIDIA/pyxis Nvidia Pyxis] package has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. Options &amp;#039;&amp;#039;‑‑container-image&amp;#039;&amp;#039;, &amp;#039;&amp;#039;‑‑no‑container‑entrypoint&amp;#039;&amp;#039;, &amp;#039;&amp;#039;‑‑container-mounts &amp;#039;&amp;#039;are provided to &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; by Pyxis.&lt;br /&gt;
&amp;lt;/ref&amp;gt;. Below is a description of the options:&lt;br /&gt;
&lt;br /&gt;
‑‑p &amp;amp;lt;partition_name&amp;amp;gt; specifies the resource partition on which the job will be run.&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; If ‑‑p &amp;amp;lt;partition_name&amp;amp;gt; is used, options that specify how many resources to assign to the job (such as &amp;#039;&amp;#039;‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt;, ‑‑cpus‑per‑task &amp;amp;lt;cpu_number&amp;amp;gt; &amp;#039;&amp;#039;or &amp;#039;&amp;#039;‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt;&amp;#039;&amp;#039;) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option &amp;#039;&amp;#039;‑‑gres=&amp;amp;lt;gpu_resources&amp;amp;gt;&amp;#039;&amp;#039;: GPU resources, in fact, must always be explicitly requested with option &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039;, otherwise no access to GPUs is granted to the job.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑container-image=&amp;amp;lt;container_path.sqsh&amp;amp;gt; specifies the container to be run&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑no‑container‑entrypoint specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑container‑mounts=&amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; specifies what parts of Mufasa&amp;#039;s filesystem will be available within the container&amp;#039;s filesystem, and where they will be mounted; for instance, if &amp;amp;lt;mufasa_dir&amp;amp;gt;:&amp;amp;lt;docker_dir&amp;amp;gt; takes the value /home/mrossi:/data this tells srun to mount Mufasa&amp;#039;s directory /home/mrossi in position /data within the filesystem of the Docker container. When the docker container reads or writes files in directory /data of its own (internal) filesystem, what actually happens is that files in /home/mrossi get manipulated instead. /home/mrossi is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;‑‑gres=&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;gpu_resources&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;&amp;#039;specifies what GPUs to assign to the container; for instance, &amp;#039;&amp;#039;&amp;amp;lt;gpus&amp;amp;gt;&amp;#039;&amp;#039; may be &amp;#039;&amp;#039;gpu:40gb:2&amp;#039;&amp;#039;, that corresponds to giving the job control to 2 entire large‑size GPUs.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;Important!&amp;#039;&amp;#039;&amp;#039; The &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039; parameter is mandatory if the job needs to use the system&amp;#039;s GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must &amp;#039;&amp;#039;always&amp;#039;&amp;#039; be explicitly requested with &amp;#039;&amp;#039;‑‑gres&amp;#039;&amp;#039;.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑mem=&amp;amp;lt;mem_resources&amp;amp;gt; specifies the amount of RAM to assign to the container; for instance, &amp;amp;lt;mem_resources&amp;amp;gt; may be 200G&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑cpus-per-task &amp;amp;lt;cpu_amount&amp;amp;gt; specifies how many CPUs to assign to the container; for instance, &amp;amp;lt;cpu_amount&amp;amp;gt; may be 2&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑pty specifies that the job will be interactive (this is necessary when &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is /bin/bash)&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;‑‑time=&amp;amp;lt;hh:mm:ss&amp;amp;gt; specifies the maximum time allowed to the job to run, in the format hours:minutes:seconds; for instance, &amp;amp;lt;hh:mm:ss&amp;amp;gt; may be 72:00:00&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;amp;lt;command_to_run_within_container&amp;amp;gt; the executable that will be run within the Docker container as soon as it is operative. A typical value for &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is /bin/bash . This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for &amp;amp;lt;command_to_run_within_container&amp;amp;gt; is &amp;#039;&amp;#039;python&amp;#039;&amp;#039;, which launches an interactive Python session from which the user will then run their job. It is also possible to use &amp;amp;lt;command_to_run_within_container&amp;amp;gt; to launch non-interactive programs.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-13&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Step 2: launching a user job from within a Docker container ===&lt;br /&gt;
&lt;br /&gt;
Once the container is up and running, usually the user is dropped to the interactive environment specified by &amp;#039;&amp;#039;&amp;amp;lt;command_to_run_within_container&amp;amp;gt;&amp;#039;&amp;#039;. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-14&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;#039;&amp;#039;&amp;#039;U&amp;#039;&amp;#039;&amp;#039;sing SLURM to run jobs: additional information ==&lt;br /&gt;
&lt;br /&gt;
In SLURM, jobs are launched using commands [https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] (for interactive programs) or [https://slurm.schedmd.com/sbatch.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] (for non-interactive ones). The preceding sections illustrated the use of &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; that is most important to Mufasa&amp;#039;s users: i.e., to run a Docker container; this section will provide a broader overview of their use.&lt;br /&gt;
&lt;br /&gt;
Mufasa&amp;#039;s Job Users do not need to know the contents of this section in order to use the machine. These contents are provided to enhance the user&amp;#039;s knowledge of SLURM and its usage, but are optional.&lt;br /&gt;
&lt;br /&gt;
In the following, we provide more general information about SLURM commands &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;. The main difference between them is that &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039;, on the contrary, does not lock the shell and simply adds the job to the queue.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-15&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Basic &amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; syntax ===&lt;br /&gt;
&lt;br /&gt;
The basic syntax of an &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; command (the one of an &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; command is similar) is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;options&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;of&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;the&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;program&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;to&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;be&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;run&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;via&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;SLURM&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Among the options, one of the most important is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;--res=gpu:K&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where K is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must &amp;#039;&amp;#039;always&amp;#039;&amp;#039; be explicitly specified when running a job that requires GPUs.&lt;br /&gt;
&lt;br /&gt;
A quick way to define the set of resources that a program will have access to is to use option&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;--p &amp;amp;lt;partition name&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑&amp;#039;&amp;#039;res=gpu:K&amp;#039;&amp;#039;, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.&lt;br /&gt;
&lt;br /&gt;
For instance, running&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun -p small ./my_program&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
makes SLURM run &amp;#039;&amp;#039;my_program&amp;#039;&amp;#039; on the partition called “small”. Running the program this way means that the resources associated to this partition will be available to it for use.&lt;br /&gt;
&lt;br /&gt;
If I don&amp;#039;t want to run &amp;#039;&amp;#039;my_program&amp;#039;&amp;#039; on a partition but still want to ensure that it gets access to one GPU to operate correctly, I will need to specify in the &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; command this as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun --gres=gpu:1 ./my_program&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-16&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Running interactive jobs via SLURM ===&lt;br /&gt;
&lt;br /&gt;
As explained, &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; is suitable for launching &amp;#039;&amp;#039;interactive&amp;#039;&amp;#039; user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a &amp;#039;&amp;#039;bash shell&amp;#039;&amp;#039; (i.e. a terminal session) with&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun --pty /bin/bash&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell), &amp;#039;&amp;#039;exit&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and therefore are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them). On the contrary, running programs with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; ensures that they can access all the resources managed by SLURM.&lt;br /&gt;
&lt;br /&gt;
As usual, GPU resources (if needed) must always be requested explicitly with parameter&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;--res=gpu:K&amp;#039;&amp;#039; . For instance, to run an interactive program which needs one GPU I will first run a bash shell via SLURM with command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun --gres=gpu:1 --pty /bin/bash&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
an then run the interactive program from the newly opened shell.&lt;br /&gt;
&lt;br /&gt;
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run &amp;#039;&amp;#039;/bin/bash&amp;#039;&amp;#039; on one of the available partitions. For instance, to run the shell on partition “small” the command is&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;srun -p small --pty /bin/bash&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as &amp;#039;&amp;#039;(SLURM ID xx)&amp;#039;&amp;#039; (where &amp;#039;&amp;#039;xx&amp;#039;&amp;#039; is the ID of the /bin/bash process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM one.&lt;br /&gt;
&lt;br /&gt;
Another way to know if the current shell is the “base” shell or a new one run via SLURM is to run command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;echo $SLURM_JOB_ID&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-17&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Using &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; ===&lt;br /&gt;
&lt;br /&gt;
A consequence of the way &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; inside a &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; ([https://linuxize.com/post/how-to-use-linux-screen/ here] is one of many tutorials about &amp;#039;&amp;#039;screen&amp;#039;&amp;#039; available online), then detach from the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;. Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
More specifically, the succession of operations is:&lt;br /&gt;
&lt;br /&gt;
# From the Mufasa shell, run &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;screen&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
# In the screen thus created (it has the look of an empty shell), launch your job with &amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&lt;br /&gt;
# Detach from the screen with &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;D&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;: you will come back to the original Mufasa shell, and your process will go on running in the screen&lt;br /&gt;
# Close the SSH session to Mufasa&lt;br /&gt;
# (later) To resume contact with your running process, connect to Mufasa with SSH &lt;br /&gt;
# In the Mufasa shell, run &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;screen -r&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
# You are now back to the screen where you launched your job&lt;br /&gt;
# When you do not need the screen containing your job anymore, destroy it by using (within the screen) &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ctrl + A&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; followed by &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;X&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-18&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Using execution scripts to wrap user jobs ==&lt;br /&gt;
&lt;br /&gt;
Sections 2.2 and 2.3 explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line. Each parameter value is provided to SLURM by including an argument such as&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;--parameter_name=parameter_value&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
into the command line. &lt;br /&gt;
&lt;br /&gt;
In general, though, it is preferable to wrap the commands that run jobs into &amp;#039;&amp;#039;execution scripts&amp;#039;&amp;#039;. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.&lt;br /&gt;
&lt;br /&gt;
An execution script is a Linux shell script composed of two parts:&lt;br /&gt;
&lt;br /&gt;
# a &amp;#039;&amp;#039;&amp;#039;preamble&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;#039;&amp;#039;where the user specifies the values to be given to parameters, each preceded by the keyword &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039;;&lt;br /&gt;
# one or more &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; commands&amp;#039;&amp;#039;&amp;#039; that use SLURM to run jobs, using the parameter values specified by the preamble.&lt;br /&gt;
&lt;br /&gt;
An execution script is a special type of Linux &amp;#039;&amp;#039;bash script&amp;#039;&amp;#039;. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:&lt;br /&gt;
&lt;br /&gt;
* have the “executable” flag set;&lt;br /&gt;
* have “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” as its very first line.&lt;br /&gt;
&lt;br /&gt;
Usually, a Linux bash script is given a name ending in &amp;#039;&amp;#039;.sh,&amp;#039;&amp;#039; such as &amp;#039;&amp;#039;my_execution_script.sh&amp;#039;&amp;#039;. To execute the script, just open a terminal, write the scripts&amp;#039;s full path (e.g., &amp;#039;&amp;#039;./my_execution_script.sh&amp;#039;&amp;#039;) and press &amp;amp;lt;&amp;#039;&amp;#039;enter&amp;#039;&amp;#039;&amp;amp;gt;. Within a bash script, lines preceded by “&amp;#039;&amp;#039;#&amp;#039;&amp;#039;” are comments (with the notable exception of the initial “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” line). Use of blank lines as spacers is allowed.&lt;br /&gt;
&lt;br /&gt;
Below is an example of execution script (actual instructions are shown in bold, the rest are comments):&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;#!/bin/bash&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# ----------------preamble----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# Note: these are examples. Put your own SBATCH directives below&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;SBATCH --job-name=&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;myjob&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# name assigned to the job&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --cpus-per-task=1&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# number of threads allocated to each task&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --mem-per-cpu=500M&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# amount of memory per CPU core&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --gres=gpu:1&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# number of GPUs per node&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;SBATCH --partition=small&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# the partition to run your jobs in&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;SBATCH --time=0-00:01:00 &lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# time assigned to your jobs to run (format: day-hour:min:sec)&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;# ----------------srun commands-----------------&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;# Put your own srun command(s) below: see Section 2.2&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;srun ...&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
As the example above shows, beyond the initial directive “&amp;#039;&amp;#039;#!/bin/bash&amp;#039;&amp;#039;” the script includes a series of &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039; directives used to specify parameter values, and finally one or more &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; commands that run the jobs. Any parameter accepted by commands &amp;#039;&amp;#039;srun&amp;#039;&amp;#039; and &amp;#039;&amp;#039;sbatch&amp;#039;&amp;#039; can be used as an &amp;#039;&amp;#039;SBATCH&amp;#039;&amp;#039; directive in an execution script.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== &amp;lt;span id=&amp;quot;anchor-19&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Job caching ===&lt;br /&gt;
&lt;br /&gt;
When a Job User runs a job via SLURM (with or without an execution script), Mufasa exploits a (tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical, slow) HDDs where /home partitions reside, and substituting them with accesses to (solid-state, fast) SSDs.&lt;br /&gt;
&lt;br /&gt;
Precisely, each time a job is run via SLURM Mufasa:&lt;br /&gt;
&lt;br /&gt;
# temporarily copies code and associated data from the user&amp;#039;s own /home partition to a cache space located on system SSDs;&lt;br /&gt;
# runs the user job from the SSDs, using the copy of the data on the SSD as input;&lt;br /&gt;
# creates the output file(s) on the SSDs;&lt;br /&gt;
# when the job ends, copies the output files from the SSDs to the user&amp;#039;s own /home partition .&lt;br /&gt;
&lt;br /&gt;
The whole process is completely transparent to the user. The user simply prepares executable and data in their /home folder, then runs the job (possibly via an execution script). When job execution ends, the user finds their output data in the /home folder, exactly as if the execution actually occurred there.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-20&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Monitoring and managing jobs ==&lt;br /&gt;
&lt;br /&gt;
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users&amp;#039; jobs, they are only allowed to modify the condition of their own jobs.&lt;br /&gt;
&lt;br /&gt;
From SLURM&amp;#039;s overview (the links point to the appropriate URLs in SLURM&amp;#039;s online documentation): “User tools include [https://slurm.schedmd.com/srun.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;srun&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to initiate jobs, [https://slurm.schedmd.com/scancel.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;scancel&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to terminate queued or running jobs, [https://slurm.schedmd.com/sinfo.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sinfo&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to report system status, [https://slurm.schedmd.com/squeue.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;squeue&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to report the status of jobs [i.e. to inspect the scheduling queue], and [https://slurm.schedmd.com/sacct.html &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;sacct&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;] to get information about jobs and job steps that are running or have completed.”&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=3</id>
		<title>System</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=3"/>
		<updated>2021-12-22T13:53:20Z</updated>

		<summary type="html">&lt;p&gt;Admin: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= &amp;lt;span id=&amp;quot;anchor-1&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;System =&lt;br /&gt;
&lt;br /&gt;
Mufasa is a Linux server located in a server room managed by the System Administrators. Job Administrators and Job Users can only access Mufasa remotely. Section 1 provides a brief description of the system and of the ways to interact with it.&lt;br /&gt;
&lt;br /&gt;
Remote access to Mufasa is performed using the SSH protocol for the execution of commands (see Section 1.2) and the SFTP protocol for the exchange of files (see Section 1.3). Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-2&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Hardware ==&lt;br /&gt;
&lt;br /&gt;
Mufasa is a server for massively parallel computation. Its main hardware components are:&lt;br /&gt;
&lt;br /&gt;
* 32-core, 64-thread AMD processor&lt;br /&gt;
* 1 TB RAM&lt;br /&gt;
* 9 TB of SSDs (for OS and execution cache)&lt;br /&gt;
* 28TB of HDDs (for user /home directories)&lt;br /&gt;
* 5 Nvidia A100 GPUs [based on the &amp;#039;&amp;#039;Ampere&amp;#039;&amp;#039; architecture]&lt;br /&gt;
* Linux Ubuntu operating system&lt;br /&gt;
&lt;br /&gt;
Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, access resources are shared among different users and processes in order to optimise their usage and availability.&lt;br /&gt;
&lt;br /&gt;
For what concerns GPUs, the 5 physical A100 GPUs are subdivided into “virtual” GPUs with different capabilities using Nvidia&amp;#039; MIG system. From [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ MIG&amp;#039;s user guide]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;The Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
In practice, MIG allows flexible partitioning of a very powerful (but single) GPU to create multiple virtual GPUs with different capabilities, that are then made available to users as if they were separate devices.&lt;br /&gt;
&lt;br /&gt;
Command&lt;br /&gt;
&lt;br /&gt;
[https://developer.nvidia.com/nvidia-system-management-interface &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;nvidia-smi&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;]&lt;br /&gt;
&lt;br /&gt;
(“smi” stands for System Management Interface) provides an overview of the physical and virtual GPUs available to users in a system&amp;lt;ref&amp;gt;On Mufasa, this command may require to be launched via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.&lt;br /&gt;
&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-3&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Accessing Mufasa ==&lt;br /&gt;
&lt;br /&gt;
User access to Mufasa is always remote and exploits the &amp;#039;&amp;#039;SSH&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure SHell&amp;#039;&amp;#039;) protocol. To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command&amp;lt;ref&amp;gt;Linux, MacOs and Windows 10 (and later) terminals can be used. All, in fact, include the required SSH client. A handy alternative tool for Windows (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ssh &amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;your&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;username&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;on&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Mufasa&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;IP&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;address&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where &amp;#039;&amp;#039;&amp;amp;lt;Mufasa&amp;#039;s_IP_address&amp;amp;gt;&amp;#039;&amp;#039; is any of the following two addresses:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
10.79.23.97&lt;br /&gt;
&lt;br /&gt;
If you don&amp;#039;t have a user configured on Mufasa, you first have to ask your supervisor for one. Information about the creation of users are provided by Section 1.6.&lt;br /&gt;
&lt;br /&gt;
In order to connect to Mufasa your computer must belong to Polimi&amp;#039;s LAN, either because it is physically located at Politecnico di Milano, or because you are using Polimi&amp;#039;s VPN. Ask your supervisor about the VPN if you need to connect to Mufasa from non-Polimi locations, such as your home.&lt;br /&gt;
&lt;br /&gt;
As soon as you launch the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, you will be asked to type the password (i.e. the one of your user account on Mufasa). Once the password has been provided, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa&amp;lt;ref&amp;gt;The standard form of the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, i.e. the one described above, should always be preferred. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user&amp;#039;s Linux machine, and a special mode of operation of &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; is needed to enable this. This mode is engaged by running the command like this:&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;ssh -X &amp;amp;lt;your username on Mufasa&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s IP address&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/ref&amp;gt;. The shell sports a command prompt such as&lt;br /&gt;
&lt;br /&gt;
&amp;amp;lt;your_username_on_Mufasa&amp;amp;gt;@rk018445:~$&lt;br /&gt;
&lt;br /&gt;
(&amp;#039;&amp;#039;rk018445&amp;#039;&amp;#039; is the Linux hostname of Mufasa). You can issue commands to Mufasa by typing them after the prompt, then pressing the &amp;#039;&amp;#039;enter&amp;#039;&amp;#039; key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as &amp;#039;&amp;#039;pwd&amp;#039;&amp;#039; (which prints the path to the current directory) or &amp;#039;&amp;#039;cd &amp;amp;lt;destination_dir&amp;amp;gt;&amp;#039;&amp;#039; (which changes the current directory&amp;#039;&amp;#039;)&amp;#039;&amp;#039;. On the internet you can find many tutorials about the Linux command line: for instance [https://linuxcommand.org/index.php this one].&lt;br /&gt;
&lt;br /&gt;
 To close the SSH session, just run&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;exit&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
from the command prompt of the remote shell.&lt;br /&gt;
&lt;br /&gt;
SSH sessions to Mufasa are subjected to an inactivity timeout, i.e. after a given period during which no interaction between user and Mufasa occurred, the ssh session gets automatically closed and a new one must be opened in order to continue work. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should use the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;, as explained later in this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-4&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;File transfer ==&lt;br /&gt;
&lt;br /&gt;
Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure File Transfer Protocol&amp;#039;&amp;#039;) protocol. &lt;br /&gt;
&lt;br /&gt;
For this, Linux and MacOS users can directly use the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; package, as explained (for instance) in [https://geekflare.com/sftp-command-examples/ this guide]. In order to access Mufasa for file transfer, the first thing to do is to run the following command (note the similarity to SSH connections):&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ftp&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;your&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;username&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;on&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Mufasa&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;IP&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;address&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
You will be asked your password. Once you provide it, you access (via the terminal) an interactive sftp shell, where the command prompt takes the form&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sftp&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
You can run the required &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; commands from this shell. Most of these commands have two forms: one to act on the remote machine (i.e. Mufasa) and one to act on the local machine (i.e. the user&amp;#039;s computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L). &lt;br /&gt;
&lt;br /&gt;
MacOS users can interact with Mufasa via SFTP also using the [https://cyberduck.io/ Cyberduck] software package.&lt;br /&gt;
&lt;br /&gt;
Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package.&lt;br /&gt;
&lt;br /&gt;
The most basic &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; commands (to be issued from the sftp command prompt) are:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;cd &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Change directory to &amp;amp;lt;path&amp;amp;gt; on remote machine (i.e. Mufasa)&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;lcd &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Change directory to &amp;amp;lt;path&amp;amp;gt; on local machine (i.e. user&amp;#039;s machine)&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;get &amp;amp;lt;file&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Downloads (i.e. copies) &amp;amp;lt;file&amp;amp;gt; from current directory of remote&amp;lt;br /&amp;gt;&lt;br /&gt;
machine tocurrent directory of local machine&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;put &amp;amp;lt;file&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Uploads (i.e. copies) &amp;amp;lt;file&amp;amp;gt; from current directory of local machine to&amp;lt;br /&amp;gt;&lt;br /&gt;
current directory of remote machine&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;exit&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Quit sftp&lt;br /&gt;
&lt;br /&gt;
Of course, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories), and can only download files that they have read permission.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-5&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;Docker containers ==&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;As a general rule, all computation performed on Mufasa must occur within &amp;#039;&amp;#039;&amp;#039;[https://www.docker.com/ &amp;#039;&amp;#039;&amp;#039;Docker containers&amp;#039;&amp;#039;&amp;#039;]. This allows every user to configure their own execution environment without any risk of interfering with everyone else&amp;#039;s.&lt;br /&gt;
&lt;br /&gt;
From [https://docs.docker.com/get-started/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.&lt;br /&gt;
&lt;br /&gt;
A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&amp;amp;type=image dockerhub].&lt;br /&gt;
&lt;br /&gt;
How to run Docker containers on Mufasa will be explained in Part 2 of this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-6&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;The SLURM job scheduling system ==&lt;br /&gt;
&lt;br /&gt;
Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] to manage shared access to its resources. &amp;#039;&amp;#039;&amp;#039;Users of Mufasa must use SLURM to run and manage the jobs they run on the machine&amp;#039;&amp;#039;&amp;#039;&amp;lt;ref&amp;gt;It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa&amp;#039;s resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.&lt;br /&gt;
&amp;lt;/ref&amp;gt;. From [https://slurm.schedmd.com/documentation.html SLURM&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
The use of a job scheduling system ensures that Mufasa&amp;#039;s resources are exploited in an efficient way. However, the fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets &amp;#039;&amp;#039;queued&amp;#039;&amp;#039; and will be executed as soon as possible, according to the availability of resources in the machine.&lt;br /&gt;
&lt;br /&gt;
Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].&lt;br /&gt;
&lt;br /&gt;
In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. While managing process queues, SLURM will consider such requirements and match them with the available resources. As a consequence, resource-heavy jobs generally take longer to get executed, while less demanding jobs are usually put into execution quickly. On the other hand, processes that -while running- try to use more resources than they requested get killed by SLURM to avoid damaging other jobs.&lt;br /&gt;
&lt;br /&gt;
All in all, the take-away message is: &amp;#039;&amp;#039;consider carefully how much resources to ask for your job&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
In Part 2 of this document it will be explained how resource requests can be greatly simplified by making use of predefined resource sets called &amp;#039;&amp;#039;SLURM partitions&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-7&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;#039;&amp;#039;&amp;#039;U&amp;#039;&amp;#039;&amp;#039;sers and group&amp;#039;&amp;#039;&amp;#039;s&amp;#039;&amp;#039;&amp;#039; ==&lt;br /&gt;
&lt;br /&gt;
As already explained, only Mufasa users can access the machine and interact with it. Creation of new users is done by Job Administrators or by specially designated users within each research group.&lt;br /&gt;
&lt;br /&gt;
Mufasa usernames have the form &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;xyyy&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; (all lowercase) where &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;x&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; is the first letter of the first name and &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;yyy&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; is the complete surname. For instance, user Mario Rossi will be assigned user name &amp;#039;&amp;#039;mrossi&amp;#039;&amp;#039;. If multiple users with the same surname and first letter of the name exist, those created after the first are given usernames &amp;#039;&amp;#039;xyyy01&amp;#039;&amp;#039;, &amp;#039;&amp;#039;xyyy02&amp;#039;&amp;#039;, and so on.&lt;br /&gt;
&lt;br /&gt;
On Linux machines such as Mufasa, users belong to &amp;#039;&amp;#039;groups&amp;#039;&amp;#039;. On Mufasa, groups are used to identify the research group that a specific user is part of. Assigment of Mufasa&amp;#039;s users to groups follow these rules:&lt;br /&gt;
&lt;br /&gt;
* All users belong to group &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;users&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
* Additionally, each user must belong to &amp;#039;&amp;#039;one and only one&amp;#039;&amp;#039; of the following (within brackets is the name of the faculty who is in charge of Mufasa for each group):&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;near&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;mrs&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [https://nearlab.polimi.it/medical/ Medical Robotics Section of NearLab] (prof. De Momi);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;near&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;nes&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [https://nearlab.polimi.it/neuroengineering/ NeuroEngineering Section of NearLab] (prof. Ferrante);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;cartcas&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [http://www.cartcas.polimi.it/ CartCasLab] (prof. Cerveri);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;biomech&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [http://www.biomech.polimi.it/ Biomechanics Research Group] (prof. Votta);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;bio&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, for BioEngineering users not belonging to the research groups listed above.&lt;br /&gt;
&lt;br /&gt;
Users who are not Job Administrators but have been given the power to create users can do so with command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sudo /opt/share/sbin/add_user.sh -u &amp;amp;lt;user&amp;amp;gt; -g users,&amp;amp;lt;group&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where &amp;#039;&amp;#039;&amp;amp;lt;user&amp;amp;gt;&amp;#039;&amp;#039; is the username of the new user and &amp;#039;&amp;#039;&amp;amp;lt;group&amp;amp;gt;&amp;#039;&amp;#039; is one of the 6 groups from the list above.&lt;br /&gt;
&lt;br /&gt;
For instance, in order to create a user on Mufasa for a person named Mario Rossi belonging to the NeuroEngineering Section of NearLab, the following command will be used:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sudo /opt/share/sbin/add_user.sh -u mrossi -g users,nearnes&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
New users are created with a predefined password, that they will be asked to change at their first login. For security reason, it is important that such first login occurs as soon as possible.&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=2</id>
		<title>System</title>
		<link rel="alternate" type="text/html" href="https://biohpc.deib.polimi.it/index.php?title=System&amp;diff=2"/>
		<updated>2021-12-22T13:52:05Z</updated>

		<summary type="html">&lt;p&gt;Admin: Creata pagina con &amp;quot;= &amp;lt;span id=&amp;quot;anchor-1&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1. System =  Mufasa is a Linux server located in a server room managed by the System Administrators. Job Administrators and Job Users can only acc...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= &amp;lt;span id=&amp;quot;anchor-1&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1. System =&lt;br /&gt;
&lt;br /&gt;
Mufasa is a Linux server located in a server room managed by the System Administrators. Job Administrators and Job Users can only access Mufasa remotely. Section 1 provides a brief description of the system and of the ways to interact with it.&lt;br /&gt;
&lt;br /&gt;
Remote access to Mufasa is performed using the SSH protocol for the execution of commands (see Section 1.2) and the SFTP protocol for the exchange of files (see Section 1.3). Once logged in, a user interacts with Mufasa via a terminal (text-based) interface.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-2&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.1 Hardware ==&lt;br /&gt;
&lt;br /&gt;
Mufasa is a server for massively parallel computation. Its main hardware components are:&lt;br /&gt;
&lt;br /&gt;
* 32-core, 64-thread AMD processor&lt;br /&gt;
* 1 TB RAM&lt;br /&gt;
* 9 TB of SSDs (for OS and execution cache)&lt;br /&gt;
* 28TB of HDDs (for user /home directories)&lt;br /&gt;
* 5 Nvidia A100 GPUs [based on the &amp;#039;&amp;#039;Ampere&amp;#039;&amp;#039; architecture]&lt;br /&gt;
* Linux Ubuntu operating system&lt;br /&gt;
&lt;br /&gt;
Usually each of these resources (e.g., a GPU) is not fully assigned to a single user or a single job. On the contrary, access resources are shared among different users and processes in order to optimise their usage and availability.&lt;br /&gt;
&lt;br /&gt;
For what concerns GPUs, the 5 physical A100 GPUs are subdivided into “virtual” GPUs with different capabilities using Nvidia&amp;#039; MIG system. From [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ MIG&amp;#039;s user guide]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;The Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
In practice, MIG allows flexible partitioning of a very powerful (but single) GPU to create multiple virtual GPUs with different capabilities, that are then made available to users as if they were separate devices.&lt;br /&gt;
&lt;br /&gt;
Command&lt;br /&gt;
&lt;br /&gt;
[https://developer.nvidia.com/nvidia-system-management-interface &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;nvidia-smi&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;]&lt;br /&gt;
&lt;br /&gt;
(“smi” stands for System Management Interface) provides an overview of the physical and virtual GPUs available to users in a system&amp;lt;ref&amp;gt;On Mufasa, this command may require to be launched via the SLURM job scheduling system (as explained in Section 2 of this document) in order to be able to access the GPUs.&lt;br /&gt;
&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-3&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.2 Accessing Mufasa ==&lt;br /&gt;
&lt;br /&gt;
User access to Mufasa is always remote and exploits the &amp;#039;&amp;#039;SSH&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure SHell&amp;#039;&amp;#039;) protocol. To open a remote connection to Mufasa, open a local terminal on your computer and, in it, run command&amp;lt;ref&amp;gt;Linux, MacOs and Windows 10 (and later) terminals can be used. All, in fact, include the required SSH client. A handy alternative tool for Windows (also including an X server, required to run on Mufasa Linux programs with a graphical user interface) is [https://mobaxterm.mobatek.net/ MobaXterm].&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ssh &amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;your&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;username&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;on&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Mufasa&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;IP&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;address&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where &amp;#039;&amp;#039;&amp;amp;lt;Mufasa&amp;#039;s_IP_address&amp;amp;gt;&amp;#039;&amp;#039; is any of the following two addresses:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;10.79.23.96&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
10.79.23.97&lt;br /&gt;
&lt;br /&gt;
If you don&amp;#039;t have a user configured on Mufasa, you first have to ask your supervisor for one. Information about the creation of users are provided by Section 1.6.&lt;br /&gt;
&lt;br /&gt;
In order to connect to Mufasa your computer must belong to Polimi&amp;#039;s LAN, either because it is physically located at Politecnico di Milano, or because you are using Polimi&amp;#039;s VPN. Ask your supervisor about the VPN if you need to connect to Mufasa from non-Polimi locations, such as your home.&lt;br /&gt;
&lt;br /&gt;
As soon as you launch the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, you will be asked to type the password (i.e. the one of your user account on Mufasa). Once the password has been provided, the local terminal on your computer becomes a remote terminal (a “remote shell”) through which you interact with Mufasa&amp;lt;ref&amp;gt;The standard form of the &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; command, i.e. the one described above, should always be preferred. In special cases it may be necessary to remotely run (on Mufasa) Linux programs that have a graphical user interface. These programs require interaction with the X server of the remote user&amp;#039;s Linux machine, and a special mode of operation of &amp;#039;&amp;#039;ssh&amp;#039;&amp;#039; is needed to enable this. This mode is engaged by running the command like this:&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;ssh -X &amp;amp;lt;your username on Mufasa&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s IP address&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/ref&amp;gt;. The shell sports a command prompt such as&lt;br /&gt;
&lt;br /&gt;
&amp;amp;lt;your_username_on_Mufasa&amp;amp;gt;@rk018445:~$&lt;br /&gt;
&lt;br /&gt;
(&amp;#039;&amp;#039;rk018445&amp;#039;&amp;#039; is the Linux hostname of Mufasa). You can issue commands to Mufasa by typing them after the prompt, then pressing the &amp;#039;&amp;#039;enter&amp;#039;&amp;#039; key. Being Mufasa a Linux server, it will respond to all the standard Linux system commands such as &amp;#039;&amp;#039;pwd&amp;#039;&amp;#039; (which prints the path to the current directory) or &amp;#039;&amp;#039;cd &amp;amp;lt;destination_dir&amp;amp;gt;&amp;#039;&amp;#039; (which changes the current directory&amp;#039;&amp;#039;)&amp;#039;&amp;#039;. On the internet you can find many tutorials about the Linux command line: for instance [https://linuxcommand.org/index.php this one].&lt;br /&gt;
&lt;br /&gt;
 To close the SSH session, just run&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;exit&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
from the command prompt of the remote shell.&lt;br /&gt;
&lt;br /&gt;
SSH sessions to Mufasa are subjected to an inactivity timeout, i.e. after a given period during which no interaction between user and Mufasa occurred, the ssh session gets automatically closed and a new one must be opened in order to continue work. Users who need to be able to reconnect to the very same shell where they launched a program (for instance because their program is interactive or because it provides progress update messages) should use the &amp;#039;&amp;#039;screen&amp;#039;&amp;#039;, as explained later in this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-4&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.3 File transfer ==&lt;br /&gt;
&lt;br /&gt;
Uploading files from local machine to Mufasa and downloading files from Mufasa onto local machines is done using the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; (&amp;#039;&amp;#039;Secure File Transfer Protocol&amp;#039;&amp;#039;) protocol. &lt;br /&gt;
&lt;br /&gt;
For this, Linux and MacOS users can directly use the &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; package, as explained (for instance) in [https://geekflare.com/sftp-command-examples/ this guide]. In order to access Mufasa for file transfer, the first thing to do is to run the following command (note the similarity to SSH connections):&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;ftp&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; &amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;your&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;username&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;on&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Mufasa&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;@&amp;amp;lt;Mufasa&amp;#039;s&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;IP&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;_&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;address&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
You will be asked your password. Once you provide it, you access (via the terminal) an interactive sftp shell, where the command prompt takes the form&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sftp&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
You can run the required &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; commands from this shell. Most of these commands have two forms: one to act on the remote machine (i.e. Mufasa) and one to act on the local machine (i.e. the user&amp;#039;s computer). To differentiate, the “local” versions usually have names that start with the letter “l” (lowercase L). &lt;br /&gt;
&lt;br /&gt;
MacOS users can interact with Mufasa via SFTP also using the [https://cyberduck.io/ Cyberduck] software package.&lt;br /&gt;
&lt;br /&gt;
Windows users can interact with Mufasa via SFTP protocol using the [https://mobaxterm.mobatek.net/ MobaXterm] software package.&lt;br /&gt;
&lt;br /&gt;
The most basic &amp;#039;&amp;#039;sftp&amp;#039;&amp;#039; commands (to be issued from the sftp command prompt) are:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;cd &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Change directory to &amp;amp;lt;path&amp;amp;gt; on remote machine (i.e. Mufasa)&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;lcd &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;lt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;path&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Change directory to &amp;amp;lt;path&amp;amp;gt; on local machine (i.e. user&amp;#039;s machine)&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;get &amp;amp;lt;file&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Downloads (i.e. copies) &amp;amp;lt;file&amp;amp;gt; from current directory of remote&amp;lt;br /&amp;gt;&lt;br /&gt;
machine tocurrent directory of local machine&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;put &amp;amp;lt;file&amp;amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Uploads (i.e. copies) &amp;amp;lt;file&amp;amp;gt; from current directory of local machine to&amp;lt;br /&amp;gt;&lt;br /&gt;
current directory of remote machine&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;exit&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;Quit sftp&lt;br /&gt;
&lt;br /&gt;
Of course, a user can only upload files to directories where they have write permission (usually only their own /home directory and its subdirectories), and can only download files that they have read permission.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-5&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.4 Docker containers ==&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;As a general rule, all computation performed on Mufasa must occur within &amp;#039;&amp;#039;&amp;#039;[https://www.docker.com/ &amp;#039;&amp;#039;&amp;#039;Docker containers&amp;#039;&amp;#039;&amp;#039;]. This allows every user to configure their own execution environment without any risk of interfering with everyone else&amp;#039;s.&lt;br /&gt;
&lt;br /&gt;
From [https://docs.docker.com/get-started/ Docker&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure.&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host.&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;#039;&amp;#039;A container is a sandboxed process on your machine that is isolated from all other processes on the host machine. When running a container, it uses an isolated filesystem. [containing] everything needed to run an application - all dependencies, configuration, scripts, binaries, etc. The image also contains other configuration for the container, such as environment variables, a default command to run, and other metadata.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Using Docker allows each user of Mufasa to build the software environment that their job(s) require. In particular, using Docker containers enables users to configure their own (containerized) system and install any required libraries on their own, without need to ask administrators to modify the configuration of Mufasa. As a consequence, users can freely experiment with their (containerized) system without risk to the work of other users and to the stability and reliability of Mufasa. In particular, containers allow users to run jobs that require multiple and/or obsolete versions of the same library.&lt;br /&gt;
&lt;br /&gt;
A large number of preconfigured Docker containers are already available, so users do not usually need to start from scratch in preparing the environment where their jobs will run on Mufasa. The official Docker container repository is [https://hub.docker.com/search?q=&amp;amp;type=image dockerhub].&lt;br /&gt;
&lt;br /&gt;
How to run Docker containers on Mufasa will be explained in Part 2 of this document.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-6&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.5 The SLURM job scheduling system ==&lt;br /&gt;
&lt;br /&gt;
Mufasa uses [https://slurm.schedmd.com/overview.html SLURM] to manage shared access to its resources. &amp;#039;&amp;#039;&amp;#039;Users of Mufasa must use SLURM to run and manage the jobs they run on the machine&amp;#039;&amp;#039;&amp;#039;&amp;lt;ref&amp;gt;It is possible for users to run jobs without using SLURM; however, running jobs run this way is only intended for “housekeeping” activities and only provides access to a small subset of Mufasa&amp;#039;s resources. For instance, jobs run outside SLURM cannot access the GPUs, can only use a few processor cores, can only access a small portion of RAM. Using SLURM is therefore necessary for any resource-intensive job.&lt;br /&gt;
&amp;lt;/ref&amp;gt;. From [https://slurm.schedmd.com/documentation.html SLURM&amp;#039;s documentation]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“&amp;#039;&amp;#039;Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.&amp;#039;&amp;#039;”&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
The use of a job scheduling system ensures that Mufasa&amp;#039;s resources are exploited in an efficient way. However, the fact that a schedule exists means that usually a job does not get immediately executed as soon as it is launched: instead, the job gets &amp;#039;&amp;#039;queued&amp;#039;&amp;#039; and will be executed as soon as possible, according to the availability of resources in the machine.&lt;br /&gt;
&lt;br /&gt;
Useful references for SLURM users are the [https://slurm.schedmd.com/man_index.html collected man pages] and the [https://slurm.schedmd.com/pdfs/summary.pdf command overview].&lt;br /&gt;
&lt;br /&gt;
In order to let SLURM schedule job execution, before launching a job a user must specify what resources (such as RAM, processor cores, GPUs, ...) it requires. While managing process queues, SLURM will consider such requirements and match them with the available resources. As a consequence, resource-heavy jobs generally take longer to get executed, while less demanding jobs are usually put into execution quickly. On the other hand, processes that -while running- try to use more resources than they requested get killed by SLURM to avoid damaging other jobs.&lt;br /&gt;
&lt;br /&gt;
All in all, the take-away message is: &amp;#039;&amp;#039;consider carefully how much resources to ask for your job&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
In Part 2 of this document it will be explained how resource requests can be greatly simplified by making use of predefined resource sets called &amp;#039;&amp;#039;SLURM partitions&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;span id=&amp;quot;anchor-7&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;1.6 &amp;#039;&amp;#039;&amp;#039;U&amp;#039;&amp;#039;&amp;#039;sers and group&amp;#039;&amp;#039;&amp;#039;s&amp;#039;&amp;#039;&amp;#039; ==&lt;br /&gt;
&lt;br /&gt;
As already explained, only Mufasa users can access the machine and interact with it. Creation of new users is done by Job Administrators or by specially designated users within each research group.&lt;br /&gt;
&lt;br /&gt;
Mufasa usernames have the form &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;xyyy&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; (all lowercase) where &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;x&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; is the first letter of the first name and &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;yyy&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039; is the complete surname. For instance, user Mario Rossi will be assigned user name &amp;#039;&amp;#039;mrossi&amp;#039;&amp;#039;. If multiple users with the same surname and first letter of the name exist, those created after the first are given usernames &amp;#039;&amp;#039;xyyy01&amp;#039;&amp;#039;, &amp;#039;&amp;#039;xyyy02&amp;#039;&amp;#039;, and so on.&lt;br /&gt;
&lt;br /&gt;
On Linux machines such as Mufasa, users belong to &amp;#039;&amp;#039;groups&amp;#039;&amp;#039;. On Mufasa, groups are used to identify the research group that a specific user is part of. Assigment of Mufasa&amp;#039;s users to groups follow these rules:&lt;br /&gt;
&lt;br /&gt;
* All users belong to group &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;users&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
* Additionally, each user must belong to &amp;#039;&amp;#039;one and only one&amp;#039;&amp;#039; of the following (within brackets is the name of the faculty who is in charge of Mufasa for each group):&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;near&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;mrs&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [https://nearlab.polimi.it/medical/ Medical Robotics Section of NearLab] (prof. De Momi);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;near&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;nes&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [https://nearlab.polimi.it/neuroengineering/ NeuroEngineering Section of NearLab] (prof. Ferrante);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;cartcas&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [http://www.cartcas.polimi.it/ CartCasLab] (prof. Cerveri);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;biomech&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, i.e. [http://www.biomech.polimi.it/ Biomechanics Research Group] (prof. Votta);&lt;br /&gt;
** &amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;bio&amp;#039;&amp;#039;&amp;#039;&amp;#039;&amp;#039;, for BioEngineering users not belonging to the research groups listed above.&lt;br /&gt;
&lt;br /&gt;
Users who are not Job Administrators but have been given the power to create users can do so with command&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sudo /opt/share/sbin/add_user.sh -u &amp;amp;lt;user&amp;amp;gt; -g users,&amp;amp;lt;group&amp;amp;gt;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
where &amp;#039;&amp;#039;&amp;amp;lt;user&amp;amp;gt;&amp;#039;&amp;#039; is the username of the new user and &amp;#039;&amp;#039;&amp;amp;lt;group&amp;amp;gt;&amp;#039;&amp;#039; is one of the 6 groups from the list above.&lt;br /&gt;
&lt;br /&gt;
For instance, in order to create a user on Mufasa for a person named Mario Rossi belonging to the NeuroEngineering Section of NearLab, the following command will be used:&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;sudo /opt/share/sbin/add_user.sh -u mrossi -g users,nearnes&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
New users are created with a predefined password, that they will be asked to change at their first login. For security reason, it is important that such first login occurs as soon as possible.&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
</feed>