Advanced SLURM usage and Multiple GPU jobs
==========================================

Handling preemption
-------------------

.. _advanced_preemption:

On the Apuana cluster, jobs can preempt one-another depending on their priority
(unkillable>high>low) (See the `Slurm documentation
<https://slurm.schedmd.com/preempt.html>`_)

The default preemption mechanism is to kill and re-queue the job automatically
without any notice. To allow a different preemption mechanism, every partition
have been duplicated (i.e. have the same characteristics as their counterparts)
allowing a **120sec** grace period before killing your job *but don't requeue
it automatically*: those partitions are referred by the suffix: ``-grace``
(``main-grace, long-grace, main-cpu-grace, long-cpu-grace``).

When using a partition with a grace period, a series of signals consisting of
first ``SIGCONT`` and ``SIGTERM`` then ``SIGKILL`` will be sent to the SLURM
job.  It's good practice to catch those signals using the Linux ``trap`` command
to properly terminate a job and save what's necessary to restart the job.  On
each cluster, you'll be allowed a *grace period* before SLURM actually kills
your job (``SIGKILL``).

The easiest way to handle preemption is by trapping the ``SIGTERM`` signal

.. code-block:: console

    #SBATCH --ntasks=1
    #SBATCH ....

    exit_script() {
        echo "Preemption signal, saving myself"
        trap - SIGTERM # clear the trap
        # Optional: sends SIGTERM to child/sub processes
        kill -- -$$
    }

    trap exit_script SIGTERM

    # The main script part
    python3 my_script


.. note::
    | **Requeuing**:
    | The Slurm scheduler on the cluster does not allow a grace period before
    | preempting a job while requeuing it automatically, therefore your job will
    | be cancelled at the end of the grace period.
    | To automatically requeue it, you can just add the ``sbatch`` command inside
    | your ``exit_script`` function.


Packing jobs
------------

Sharing a GPU between processes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``srun``, when used in a batch job is responsible for starting tasks on the
allocated resources (see srun) SLURM batch script

.. code-block:: console

    #SBATCH --ntasks-per-node=2
    #SBATCH --output=myjob_output_wrapper.out
    #SBATCH --ntasks=2
    #SBATCH --gres=gpu:1
    #SBATCH --cpus-per-task=4
    #SBATCH --mem=18G
    srun -l --output=myjob_output_%t.out python script args

This will run Python 2 times, each process with 4 CPUs with the same arguments
``--output=myjob_output_%t.out`` will create 2 output files appending the task
id (``%t``) to the filename and 1 global log file for things happening outside
the ``srun`` command.

Knowing that, if you want to have 2 different arguments to the Python program,
you can use a multi-prog configuration file: ``srun -l --multi-prog silly.conf``

.. code-block:: console

    0  python script firstarg
    1  python script secondarg

Or by specifying a range of tasks

.. code-block:: console

    0-1  python script %t

%t being the taskid that your Python script will parse.  Note the ``-l`` on the
``srun`` command: this will prepend each line with the taskid (0:, 1:)

Sharing a node with multiple GPU 1process/GPU
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On Digital Research Alliance of Canada, several nodes, especially nodes with
``largeGPU`` (P100) are reserved for jobs requesting the whole node, therefore
packing multiple processes in a single job can leverage faster GPU.

If you want different tasks to access different GPUs in a single allocation you
need to create an allocation requesting a whole node and using ``srun`` with a
subset of those resources (1 GPU).

Keep in mind that every resource not specified on the ``srun`` command while
inherit the global allocation specification so you need to split each resource
in a subset (except --cpu-per-task which is a per-task requirement)

Each ``srun`` represents a job step (``%s``).

Example for a GPU node with 24 cores and 4 GPUs and 128G of RAM
Requesting 1 task per GPU

.. code-block:: console

    #!/bin/bash
    #SBATCH --nodes=1-1
    #SBATCH --ntasks-per-node=4
    #SBATCH --output=myjob_output_wrapper.out
    #SBATCH --gres=gpu:4
    #SBATCH --cpus-per-task=6
    srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args1 &
    srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args2 &
    srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args3 &
    srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args4 &
    wait

This will create 4 output files:

- JOBID-step-0.out
- JOBID-step-1.out
- JOBID-step-2.out
- JOBID-step-3.out


Sharing a node with multiple GPU & multiple processes/GPU
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Combining both previous sections, we can create a script requesting a whole node
with four GPUs, allocating 1 GPU per ``srun`` and sharing each GPU with multiple
processes

Example still with a 24 cores/4 GPUs/128G RAM
Requesting 2 tasks per GPU

.. code-block:: console

    #!/bin/bash
    #SBATCH --nodes=1-1
    #SBATCH --ntasks-per-node=8
    #SBATCH --output=myjob_output_wrapper.out
    #SBATCH --gres=gpu:4
    #SBATCH --cpus-per-task=3
    srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
    srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
    srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
    srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
    wait

``--exclusive`` is important to specify subsequent step/srun to bind to different cpus.

This will produce 8 output files, 2 for each step:

- JOBID-step-0-task-0.out
- JOBID-step-0-task-1.out
- JOBID-step-1-task-0.out
- JOBID-step-1-task-1.out
- JOBID-step-2-task-0.out
- JOBID-step-2-task-1.out
- JOBID-step-3-task-0.out
- JOBID-step-3-task-1.out

Running ``nvidia-smi`` in silly.conf, while parsing the output, we can see 4
GPUs allocated and 2 tasks per GPU

.. code-block:: console

    $ cat JOBID-step-* | grep Tesla
    0: |   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
    1: |   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
    0: |   0  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
    1: |   0  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
    0: |   0  Tesla P100-PCIE...  On   | 00000000:82:00.0 Off |                    0 |
    1: |   0  Tesla P100-PCIE...  On   | 00000000:82:00.0 Off |                    0 |
    0: |   0  Tesla P100-PCIE...  On   | 00000000:03:00.0 Off |                    0 |
    1: |   0  Tesla P100-PCIE...  On   | 00000000:03:00.0 Off |                    0 |