Difference between revisions of "User Jobs"
Line 27: | Line 27: | ||
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. On Mufasa, partition names make reference to the features of the jobs that the partition has been set up for. | In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. On Mufasa, partition names make reference to the features of the jobs that the partition has been set up for. | ||
The resources assigned to each partition (i.e. those available to jobs on that partition) can be inspected with command | |||
The complete list of the resources assigned to each partition (i.e. those available to jobs on that partition) can be inspected with command | |||
<pre style="color: lightgrey; background: black;"> | <pre style="color: lightgrey; background: black;"> | ||
Line 51: | Line 52: | ||
up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1 | up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1 | ||
</pre> | |||
A less comprehensive but more readable view of partition features is obtained with command | |||
<pre style="color: lightgrey; background: black;"> | |||
sinfo --Format=all | |||
</pre> | </pre> | ||
Revision as of 10:13, 18 January 2022
This page presents the features of Mufasa that are most relevant to Mufasa's Job Users. Job Users can submit jobs for execution, cancel their own jobs, and see other users' jobs (but not intervene on them).
Job Users are by necessity SLURM users (see The SLURM job scheduling system) so you may also want to read SLURM's own Quick Start User Guide.
Partitions
Several execution queues for jobs have been defined on Mufasa. Such queues are called partitions in SLURM terminology. Each partition has features (in term of resources available to the jobs on that queue) that make the partition suitable for a certain category of jobs. SLURM command
sinfo
(link to SLURM docs) provides a list of available partitions. Its output is similar to this:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up infinite 1 mix gn01 small* up 12:00:00 1 mix gn01 normal up 1-00:00:00 1 mix gn01 longnormal up 3-00:00:00 1 mix gn01 gpu up 1-00:00:00 1 mix gn01 gpulong up 3-00:00:00 1 mix gn01 fat up 3-00:00:00 1 mix gn01
In this example, available partitions are named “debug”, “small”, “normal”, “longnormal”, “gpu”, “gpulong”, “fat”. On Mufasa, partition names make reference to the features of the jobs that the partition has been set up for.
The complete list of the resources assigned to each partition (i.e. those available to jobs on that partition) can be inspected with command
sinfo --Format=all
In the example, the output of the format is the following (with added blank lines for clarity):
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS up|(null)|62|0|852393|(null)|all|NO|infinite|1027000|rk018445|rk018445|1|yes|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |n/a |GANG,SUSPEND |gn01 |3.13 |debug |debug |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|12:00:00|1027000|rk018445|rk018445|0|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |small* |small |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|10|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |normal |normal |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|100|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |24 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |longnormal |longnormal |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|1-00:00:00|1027000|rk018445|rk018445|25|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |15:00 |GANG,SUSPEND |gn01 |3.13 |gpu |gpu |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|125|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |UNLIMITED |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |gpulong |gpulong |all |mixed |Unknown |N/A |2 |31 |1 up|(null)|62|0|852393|(null)|all|FORCE:2|3-00:00:00|1027000|rk018445|rk018445|200|no|1-infinite|mix|Unknown|21.08.2|1|2:31:1|1/0 |48 |16/46/0/62 |1 |none |1/0/0/1 |gpu:40gb:2(S:0-1),gpu:20gb:3(S:0-1),gpu:10gb:6(S:0-1) |Unknown |1 |1:00:00 |GANG,SUSPEND |gn01 |3.13 |fat |fat |all |mixed |Unknown |N/A |2 |31 |1
A less comprehensive but more readable view of partition features is obtained with command
sinfo --Format=all
- for instance, partition “debug” is used for test jobs, while partition "gpu" is for GPU-intensive jobs. The asterisk after the name of partition “small” marks it as the default partition, i.e. the one on which jobs are launched if no partition is specified.
When launching a job, users may exploit partitions by selecting the most suitable one and specifying that their job must be run on that partition. This avoids the need for the user to specify the amount of each resource that the job requires, since a set of resources has already been defined for each partition.The difference between partitions is in the default amount of resources that they assign to processes. The fact that by selecting the right partition for their job a user can pre-define the requirements of the job without having to specify them makes partitions very handy, and avoids possible mistakes. A complete description of the default amount of resources that the partitions assign to their jobs can be obtained using SLURM command sinfo --Format=All (an example is shown below)
Partition defaults are defined by Job Administrators and cannot be modified by Job Users. Users can, however, select the partitions on which each of their jobs is launched, and ‑if needed‑ change the resource requested by their jobs wrt the default values associated to such partitions.
Any element of the default assignment of resources provided by a specific partition can be overridden by specifying an option when launching the job. Therefore users are not forced to accept the default value. However, it makes sense to choose the most suitable partition for a job in the first place, and then to specify the job's requirements only for those resources that have an unsuitable default value.
Resource requests by the user launching a job can be both lower and higher than the default value of the partition for that resource. However, they cannot exceed the maximum value that the partition allows for requests of such resource, if defined. For each resourse, the maximum value is an additional parameter of the partition that System Administrators have the possibiltiy of specifying. If a user tries to launch on a partition a job that requests a higher value of a resource than the partition‑specified maximum, the launch command is refused.
One of the resources provided to jobs by partitions is time, in the sense that a job is permitted to run for no longer than a predefined time duration. As with any other resource provided by a partition, this duration takes the default value unless the user specifies a different value. Jobs that exceed their allotted time are killed by SLURM.
Partition availability
The most important information that sinfo provides about a partition is its partition state, i.e. its availability. Partition state is shown in column AVAIL (note that there is also another column named STATE: it provides, instead, the state of the node(s), i.e. the machine(s), providing resources to the partition).
The standard value for partition state/availability is up, as in the example above, meaning that the partition is available for jobs. If the availability of a partition is stated as down or drain, all jobs waiting for that partition are paused and the intervention of a Job Administrator is required to restore the partition's operation.
Executing jobs on Mufasa
The main reason for a user to interact with Mufasa is to make it execute jobs that require resources not available to standard desktop-class machines. Therefore, launching jobs is the most important operation that users will perform on Mufasa: this section explains how it is done. Considering that all computation run on Mufasa must occur within Docker containers, the processes run by Mufasa users are always containers except for menial, non-computationally intensive jobs.
The process of launching user jobs requires two steps:
Step 1: use SLURM to run the Docker container where the job will take place;
Step 2: launch the user job from within the Docker container.
These steps are described in the following sections of this document.
An optional (but recommended) operation is to use an execution script to manage the launching process. How to do this is described below, by a specific section of this document.
Step 1: using SLURM to run a Docker container
As explained above, the first step to run a user job on Mufasa is to run the Docker container where the job will take place. A container is a “sandbox” containing the environment where the user's application operates. Parts of Mufasa's filesystem can be made visible (and writable, if they belong to the user's /home directory) to the environment of the container. This allows the containerized user application to read from, and write to, Mufasa's filesystem: for instance, to read data and write results.
Each user is in charge of preparing the Docker container(s) where the user's jobs will be executed. In most situations the user can simply select a suitable ready-made container from the many which are already available for use.
In order to run a Docker container via SLURM, a user must use a command similar to the following:
srun ‑‑p <partition_name> ‑‑container-image=<container_path.sqsh> ‑‑no‑container‑entrypoint ‑‑container‑mounts=<mufasa_dir>:<docker_dir> ‑‑gres=<gpu_resources> ‑‑mem=<mem_resources> ‑‑cpus‑per‑task <cpu_amount> ‑‑pty ‑‑time=<hh:mm:ss>
<command_to_run_within_container>
We will now decompose this command into its constituent parts.
srun is one of SLURM's commands to run jobs (see Section 2.3 for an alternative command, sbatch). The following sections will provide additional details about srun and other ways to run jobs via SLURM.
All parts of the command above that come after srun are options that specify what to execute and how. Some of the options are specifically dedicated to Docker containers<ref>To facilitate the execution of Docker containers, the Nvidia Pyxis package has been installed on Mufasa as an adjunct to SLURM. Pyxis allows unprivileged users (i.e., those that are not administrators of Mufasa) to execute containers and run commands within them. Options ‑‑container-image, ‑‑no‑container‑entrypoint, ‑‑container-mounts are provided to srun by Pyxis. </ref>. Below is a description of the options:
‑‑p <partition_name> specifies the resource partition on which the job will be run.
Important! If ‑‑p <partition_name> is used, options that specify how many resources to assign to the job (such as ‑‑mem=<mem_resources>, ‑‑cpus‑per‑task <cpu_number> or ‑‑time=<hh:mm:ss>) can be omitted, greatly simplyfying the command. If an explicit amount is not requested for a given resource, the job is assigned the default amount of the resource (as defined by the chosen partition). A notable exception to this rule concerns option ‑‑gres=<gpu_resources>: GPU resources, in fact, must always be explicitly requested with option ‑‑gres, otherwise no access to GPUs is granted to the job.
‑‑container-image=<container_path.sqsh> specifies the container to be run
‑‑no‑container‑entrypoint specifies that the entrypoint defined in the container image should not be executed (ENTRYPOINT in the Dockerfile that defines the container). The entrypoint is a command that gets executed as soon as the container is run: option ‑‑no‑container‑entrypoint is useful when the user is not sure of the effect of such command.
‑‑container‑mounts=<mufasa_dir>:<docker_dir> specifies what parts of Mufasa's filesystem will be available within the container's filesystem, and where they will be mounted; for instance, if <mufasa_dir>:<docker_dir> takes the value /home/mrossi:/data this tells srun to mount Mufasa's directory /home/mrossi in position /data within the filesystem of the Docker container. When the docker container reads or writes files in directory /data of its own (internal) filesystem, what actually happens is that files in /home/mrossi get manipulated instead. /home/mrossi is the only part of the filesystem of Mufasa that is visible to, and changeable by, the Docker container.
‑‑gres='<gpu_resources>''' specifies what GPUs to assign to the container; for instance, <gpus> may be gpu:40gb:2, that corresponds to giving the job control to 2 entire large‑size GPUs.
Important! The ‑‑gres parameter is mandatory if the job needs to use the system's GPUs. Differently from other resources (where unspecified requests lead to the assignment of a default amount of the resource), GPUs must always be explicitly requested with ‑‑gres.
‑‑mem=<mem_resources> specifies the amount of RAM to assign to the container; for instance, <mem_resources> may be 200G
‑‑cpus-per-task <cpu_amount> specifies how many CPUs to assign to the container; for instance, <cpu_amount> may be 2
‑‑pty specifies that the job will be interactive (this is necessary when <command_to_run_within_container> is /bin/bash)
‑‑time=<hh:mm:ss> specifies the maximum time allowed to the job to run, in the format hours:minutes:seconds; for instance, <hh:mm:ss> may be 72:00:00
<command_to_run_within_container> the executable that will be run within the Docker container as soon as it is operative. A typical value for <command_to_run_within_container> is /bin/bash . This instructs srun to open an interactive shell session (i.e. a command-line terminal interface) within the container, from which the user will then run their job. Another typical value for <command_to_run_within_container> is python, which launches an interactive Python session from which the user will then run their job. It is also possible to use <command_to_run_within_container> to launch non-interactive programs.
Step 2: launching a user job from within a Docker container
Once the container is up and running, usually the user is dropped to the interactive environment specified by <command_to_run_within_container>. This interactive environment can be, for instance, a bash shell or the interactive Python mode. Once inside the interactive environment, the user can simply run the required program in the usual way (depending on the type of environment).
Using SLURM to run jobs: additional information
In SLURM, jobs are launched using commands srun (for interactive programs) or sbatch (for non-interactive ones). The preceding sections illustrated the use of srun that is most important to Mufasa's users: i.e., to run a Docker container; this section will provide a broader overview of their use.
Mufasa's Job Users do not need to know the contents of this section in order to use the machine. These contents are provided to enhance the user's knowledge of SLURM and its usage, but are optional.
In the following, we provide more general information about SLURM commands srun and sbatch. The main difference between them is that srun locks the shell from which it has been launched, so it is only really suitable for processes that use the console for interaction with their user; sbatch, on the contrary, does not lock the shell and simply adds the job to the queue.
Basic srun
and sbatch
syntax
The basic syntax of an srun command (the one of an sbatch command is similar) is
srun '<options>' '<path'_'of'_'the'_'program'_'to'_'be'_'run'_'via'_'SLURM>
Among the options, one of the most important is
--res=gpu:K
where K is an integer between 1 and the maximum number of GPUs available in the server (5 for Mufasa). This option specifies how many of the GPUs the program requests for use. Since GPUs are the most scarce resources of Mufasa, this option must always be explicitly specified when running a job that requires GPUs.
A quick way to define the set of resources that a program will have access to is to use option
--p <partition name>
This option specifies that SLURM will run the program on a specific partition, and therefore that it will have access to all and only the resources available to that partition. As a consequence, all options that define how many resources to assign the job, such as ‑‑res=gpu:K, will only be able to provide the job with resources that are available to the chosen partition. Jobs that require resources that are not available to the chosen partition do not get executed.
For instance, running
srun -p small ./my_program
makes SLURM run my_program on the partition called “small”. Running the program this way means that the resources associated to this partition will be available to it for use.
If I don't want to run my_program on a partition but still want to ensure that it gets access to one GPU to operate correctly, I will need to specify in the srun command this as follows:
srun --gres=gpu:1 ./my_program
Running interactive jobs via SLURM
As explained, srun is suitable for launching interactive user jobs, i.e. jobs that use the terminal output and the keyboard to exchange information with a human user. If a user needs this type of interaction, they must run a bash shell (i.e. a terminal session) with
srun --pty /bin/bash
and subsequently use the bash shell to run the interactive program. To close the SLURM-spawned bash shell, run (as with any other shell), exit.
Of course, also the “base” shell (i.e. the one that opens when an SSH connection to Mufasa is established) can be used to run programs: however, programs launched this way are not being run via SLURM and therefore are not able to access most of the resources of the machine (in particular, there is no way to make GPUs accessible to them). On the contrary, running programs with srun ensures that they can access all the resources managed by SLURM.
As usual, GPU resources (if needed) must always be requested explicitly with parameter
--res=gpu:K . For instance, to run an interactive program which needs one GPU I will first run a bash shell via SLURM with command
'''srun --gres=gpu:1 --pty /bin/bash
an then run the interactive program from the newly opened shell.
An alternative to explicitly specifying what resources to assign to the bash shell run via SLURM is to run /bin/bash on one of the available partitions. For instance, to run the shell on partition “small” the command is
srun -p small --pty /bin/bash
Mufasa is configured to show, as part of the command prompt of a bash shell run via SLURM, a message such as (SLURM ID xx) (where xx is the ID of the /bin/bash process within SLURM). When you see this message, you know that the bash shell you are interacting with is a SLURM one.
Another way to know if the current shell is the “base” shell or a new one run via SLURM is to run command
echo $SLURM_JOB_ID
If no number gets printed, this means that the shell is the “base” one. If a number is printed, it is the SLURM job ID of the /bin/bash process.
Using screen
with srun
A consequence of the way srun operates is that if you launch an interactive job but do not plan to keep the SSH connection to Mufasa open (or if you fear that the timeout on SSH connections will cut your contact with the shell) you should use command srun inside a screen (here is one of many tutorials about screen available online), then detach from the screen. Now you can disconnect from Mufasa; when you need to reach your job again, you can can reopen an SSH connection to Mufasa and then reconnect to the screen.
More specifically, the succession of operations is:
- From the Mufasa shell, run screen
- In the screen thus created (it has the look of an empty shell), launch your job with srun
- Detach from the screen with ctrl + A followed by D: you will come back to the original Mufasa shell, and your process will go on running in the screen
- Close the SSH session to Mufasa
- (later) To resume contact with your running process, connect to Mufasa with SSH
- In the Mufasa shell, run screen -r
- You are now back to the screen where you launched your job
- When you do not need the screen containing your job anymore, destroy it by using (within the screen) ctrl + A followed by X
A use case for screen is writing your program in such a way that it prints progress advancement messages as it goes on with its work. Then, you can check its advancement by periodically reconnecting to the screen where the program is running and reading the messages it printed.
Using execution scripts to wrap user jobs
Sections 2.2 and 2.3 explained how to use SLURM to run user jobs directly, i.e. by specifying the value of SLURM parameters directly on the command line. Each parameter value is provided to SLURM by including an argument such as
--parameter_name=parameter_value
into the command line.
In general, though, it is preferable to wrap the commands that run jobs into execution scripts. An execution script makes specifying all required parameters easier, makes errors in configuring such parameters less likely, and -most importantly- can be reused for other jobs.
An execution script is a Linux shell script composed of two parts:
- a preamble where the user specifies the values to be given to parameters, each preceded by the keyword SBATCH;
- one or more srun''' commands that use SLURM to run jobs, using the parameter values specified by the preamble.
An execution script is a special type of Linux bash script. A bash script is a file that is intended to be run by the bash command interpreter. In order to be acceptable as a bash script, a text file must:
- have the “executable” flag set;
- have “#!/bin/bash” as its very first line.
Usually, a Linux bash script is given a name ending in .sh, such as my_execution_script.sh. To execute the script, just open a terminal, write the scripts's full path (e.g., ./my_execution_script.sh) and press <enter>. Within a bash script, lines preceded by “#” are comments (with the notable exception of the initial “#!/bin/bash” line). Use of blank lines as spacers is allowed.
Below is an example of execution script (actual instructions are shown in bold, the rest are comments):
#!/bin/bash
# ----------------preamble----------------
# Note: these are examples. Put your own SBATCH directives below
SBATCH --job-name='myjob
# name assigned to the job
SBATCH --cpus-per-task=1
# number of threads allocated to each task
SBATCH --mem-per-cpu=500M
# amount of memory per CPU core
SBATCH --gres=gpu:1
# number of GPUs per node
SBATCH --partition=small
# the partition to run your jobs in
SBATCH --time=0-00:01:00
# time assigned to your jobs to run (format: day-hour:min:sec)
# ----------------srun commands-----------------
# Put your own srun command(s) below: see Section 2.2
srun ...
As the example above shows, beyond the initial directive “#!/bin/bash” the script includes a series of SBATCH directives used to specify parameter values, and finally one or more srun commands that run the jobs. Any parameter accepted by commands srun and sbatch can be used as an SBATCH directive in an execution script.
Job caching
When a Job User runs a job via SLURM (with or without an execution script), Mufasa exploits a (tranparent) caching mechanism to speed up its execution. The speedup is obtained by removing the need for the running job to execute accesses to the (mechanical, slow) HDDs where /home partitions reside, and substituting them with accesses to (solid-state, fast) SSDs.
Precisely, each time a job is run via SLURM Mufasa:
- temporarily copies code and associated data from the user's own /home partition to a cache space located on system SSDs;
- runs the user job from the SSDs, using the copy of the data on the SSD as input;
- creates the output file(s) on the SSDs;
- when the job ends, copies the output files from the SSDs to the user's own /home partition .
The whole process is completely transparent to the user. The user simply prepares executable and data in their /home folder, then runs the job (possibly via an execution script). When job execution ends, the user finds their output data in the /home folder, exactly as if the execution actually occurred there.
Monitoring and managing jobs
SLURM provides Job Users with several tools to inspect and manage jobs. While a Job User is able to inspect all users' jobs, they are only allowed to modify the condition of their own jobs.
From SLURM's overview (the links point to the appropriate URLs in SLURM's online documentation): “User tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, squeue to report the status of jobs [i.e. to inspect the scheduling queue], and sacct to get information about jobs and job steps that are running or have completed.”