User’s guide¶

…or IDT’s list of opinionated howtos

This section seeks to provide users of the Apuana infrastructure with practical knowledge, tips and tricks and example commands.

Running your code¶

SLURM commands guide¶

Basic Usage¶

The SLURM documentation provides extensive information on the available commands to query the cluster status or submit jobs.

Below are some basic examples of how to use SLURM.

Submitting jobs¶

Batch job¶

In order to submit a batch job, you have to create a script containing the main command(s) you would like to execute on the allocated resources/nodes.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=job_output.txt
#SBATCH --error=job_error.txt
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem=100Gb

module load python/3.5
python my_script.py

Your job script is then submitted to SLURM with sbatch (ref.)

$ sbatch job_script
sbatch: Submitted batch job 4323674

The working directory of the job will be the one where your executed sbatch.

Dica

Slurm directives can be specified on the command line alongside sbatch or inside the job script with a line starting with #SBATCH.

Interactive job¶

Workload managers usually run batch jobs to avoid having to watch its progression and let the scheduler run it as soon as resources are available. If you want to get access to a shell while leveraging cluster resources, you can submit an interactive jobs where the main executable is a shell with the srun/salloc (srun/salloc) commands

salloc

Will start an interactive job on the first node available with the default resources set in SLURM (1 task/1 CPU). srun accepts the same arguments as sbatch with the exception that the environment is not passed.

Dica

To pass your current environment to an interactive job, add --preserve-env to srun.

salloc can also be used and is mostly a wrapper around srun if provided without more info but it gives more flexibility if for example you want to get an allocation on multiple nodes.

Job submission arguments¶

In order to accurately select the resources for your job, several arguments are available. The most important ones are:

Argument	Description
-n, –ntasks=<number>	The number of task in your script, usually =1
-c, –cpus-per-task=<ncpus>	The number of cores for each task
-t, –time=<time>	Time requested for your job
–mem=<size[units]>	Memory requested for all your tasks
–gres=<list>	Select generic resources such as GPUs for your job: `--gres=gpu:GPU_MODEL`

Dica

Always consider requesting the adequate amount of resources to improve the scheduling of your job (small jobs always run first).

Checking job status¶

To display jobs currently in queue, use squeue and to get only your jobs type

$ squeue -u $USER
JOBID   USER          NAME    ST  START_TIME         TIME NODES CPUS TRES_PER_NMIN_MEM NODELIST (REASON) COMMENT
133     my_username   myjob   R   2019-03-28T18:33   0:50     1    2        N/A  7000M node1 (None) (null)

Nota

The maximum number of jobs able to be submitted to the system per user is 1000 (MaxSubmitJobs=1000) at any given time from the given association. If this limit is reached, new submission requests will be denied until existing jobs in this association complete.

Removing a job¶

To cancel your job simply use scancel

scancel 4323674

Partitioning¶

Since we don’t have many GPUs on the cluster, resources must be shared as fairly as possible. The --partition=/-p flag of SLURM allows you to set the priority you need for a job. Each job assigned with a priority can preempt jobs with a lower priority: unkillable > main > long. Once preempted, your job is killed without notice and is automatically re-queued on the same partition until resources are available. (To leverage a different preemption mechanism, see the Handling preemption)

Flag	Max Resource Usage	Max Time	Note
--partition=unkillable	6 CPUs, mem=32G, 1 GPU	2 days
--partition=unkillable-cpu	2 CPUs, mem=16G	2 days	CPU-only jobs
--partition=short-unkillable	24 CPUs, mem=128G, 4 GPUs	3 hours (!)	Large but short jobs
--partition=main	8 CPUs, mem=48G, 2 GPUs	5 days
--partition=main-cpu	8 CPUs, mem=64G	5 days	CPU-only jobs
--partition=long	no limit of resources	7 days
--partition=long-cpu	no limit of resources	7 days	CPU-only jobs

Aviso

Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent them obstructing any GPU job, they were always lowest-priority and preemptible. This was implemented by automatically assigning them to one of the now-obsolete partitions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace.

Do not use these partition names anymore. Prefer the *-cpu partition

names defined above.

For backwards-compatibility purposes, the legacy partition names are translated to their effective equivalent long-cpu, but they will eventually be removed entirely.

Nota

As a convenience, should you request the unkillable, main or long: partition for a CPU-only job, the partition will be translated to its -cpu equivalent automatically.

For instance, to request an unkillable job with 1 GPU, 4 CPUs, 10G of RAM and 12h of computation do:

sbatch --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable <job.sh>

You can also make it an interactive job using salloc:

salloc --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable

The Mila cluster has many different types of nodes/GPUs. To request a specific type of node/GPU, you can add specific feature requirements to your job submission command.

To access those special nodes you need to request them explicitly by adding the flag --constraint=<name>. The full list of nodes in the Mila Cluster can be accessed Node profile description.

Example:

To request a machine with 2 GPUs using NVLink, you can use

sbatch -c 4 --gres=gpu:2 --constraint=nvlink

Feature	Particularities
12GB/16GB/24GB/32GB/48GB	Request a specific amount of GPU memory
volta/turing/ampere	Request a specific GPU architecture
nvlink	Machine with GPUs using the NVLink interconnect technology

Information on partitions/nodes¶

sinfo (ref.) provides most of the information about available nodes and partitions/queues to submit jobs to.

Partitions are a group of nodes usually sharing similar features. On a partition, some job limits can be applied which will override those asked for a job (i.e. max time, max CPUs, etc…)

To display available partitions, simply use

$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE  NODELIST
batch     up     infinite     2 alloc  node[1,3,5-9]
batch     up     infinite     6 idle   node[10-15]
cpu       up     infinite     6 idle   cpu_node[1-15]
gpu       up     infinite     6 idle   gpu_node[1-15]

To display available nodes and their status, you can use

$ sinfo -N -l
NODELIST    NODES PARTITION STATE  CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
node[1,3,5-9]   2 batch     allocated 2    246    16000     0  (null)   (null)
node[2,4]       2 batch     drain     2    246    16000     0  (null)   (null)
node[10-15]     6 batch     idle      2    246    16000     0  (null)   (null)
...

And to get statistics on a job running or terminated, use sacct with some of the fields you want to display

$ sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER
    User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed   NNodes      NCPUS        NodeList              WorkDir
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- --------------------
my_usern+ 2398         run_extra+      batch    RUNNING 130-05:00+ 2019-03-27T18:33:43             Unknown 1-01:07:54        1         16 node9           /home/mila/my_usern+
my_usern+ 2399         run_extra+      batch    RUNNING 130-05:00+ 2019-03-26T08:51:38             Unknown 2-10:49:59        1         16 node9           /home/mila/my_usern+

Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional time formats.

sacct -u $USER --start=2019-01-01

scontrol (ref.) can be used to provide specific information on a job (currently running or recently terminated)

$ scontrol show job 43123
JobId=43123 JobName=python_script.py
UserId=my_username(1500000111) GroupId=student(1500000000) MCS_label=N/A
Priority=645895 Nice=0 Account=my_username QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=3 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=2-10:41:57 TimeLimit=130-05:00:00 TimeMin=N/A
SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18
AccrueTime=2019-03-26T08:49:18
StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-03-26T08:49:18
Partition=slurm_partition AllocNode:Sid=login-node-1:14586
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node2
BatchHost=node2
NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=32000M,node=1,billing=3
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=16 MinMemoryNode=32000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
WorkDir=/home/mila/my_username
StdErr=/home/mila/my_username/slurm-43123.out
StdIn=/dev/null
StdOut=/home/mila/my_username/slurm-43123.out
Power=

Or more info on a node and its resources

$ scontrol show node node9
NodeName=node9 Arch=x86_64 CoresPerSocket=4
CPUAlloc=16 CPUTot=16 CPULoad=1.38
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=10.252.232.4 NodeHostName=mila20684000000 Port=0 Version=18.08
OS=Linux 4.15.0-1036 #38-Ubuntu SMP Fri Dec 7 02:47:47 UTC 2018
RealMemory=32000 AllocMem=32000 FreeMem=23262 Sockets=2 Boards=1
State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurm_partition
BootTime=2019-03-26T08:50:01 SlurmdStartTime=2019-03-26T08:51:15
CfgTRES=cpu=16,mem=32000M,billing=3
AllocTRES=cpu=16,mem=32000M
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Useful Commands¶

salloc: Get an interactive job and give you a shell. (ssh like) CPU only
salloc --gres=gpu:1 -c 2 --mem=12000: Get an interactive job with one GPU, 2 CPUs and 12000 MB RAM
sbatch: start a batch job (same options as salloc)
sattach --pty <jobid>.0: Re-attach a dropped interactive job
sinfo: status of all nodes
sinfo -Ogres:27,nodelist,features -tidle,mix,alloc: List GPU type and FEATURES that you can request
savail: (Custom) List available gpu
scancel <jobid>: Cancel a job
squeue: summary status of all active jobs
squeue -u $USER: summary status of all YOUR active jobs
squeue -j <jobid>: summary status of a specific job
squeue -Ojobid,name,username,partition,state,timeused,nodelist,gres,tres: status of all jobs including requested resources (see the SLURM squeue doc for all output options)
scontrol show job <jobid>: Detailed status of a running job
sacct -j <job_id> -o NodeList: Get the node where a finished job ran
sacct -u $USER -S <start_time> -E <stop_time>: Find info about old jobs
sacct -oJobID,JobName,User,Partition,Node,State: List of current and recent jobs

Special GPU requirements¶

Specific GPU architecture and memory can be easily requested through the --gres flag by using either

--gres=gpu:architecture:number
--gres=gpu:memory:number
--gres=gpu:model:number

Example:

To request 1 GPU with at least 16GB of memory use

sbatch -c 4 --gres=gpu:16gb:1

The full list of GPU and their features can be accessed here.

Example script¶

Here is a sbatch script that follows good practices on the Mila cluster:

#!/bin/bash

#SBATCH --partition=unkillable                           # Ask for unkillable job
#SBATCH --cpus-per-task=2                                # Ask for 2 CPUs
#SBATCH --gres=gpu:1                                     # Ask for 1 GPU
#SBATCH --mem=10G                                        # Ask for 10 GB of RAM
#SBATCH --time=3:00:00                                   # The job will run for 3 hours
#SBATCH -o /network/scratch/<u>/<username>/slurm-%j.out  # Write the log on scratch

# 1. Load the required modules
module --quiet load anaconda/3

# 2. Load your environment
conda activate "<env_name>"

# 3. Copy your dataset on the compute node
cp /network/datasets/<dataset> $SLURM_TMPDIR

# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR
#    and look for the dataset into $SLURM_TMPDIR
python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR

# 5. Copy whatever you want to save on $SCRATCH
cp $SLURM_TMPDIR/<to_save> /network/scratch/<u>/<username>/

Portability concerns and solutions¶

When working on a software project, it is important to be aware of all the software and libraries the project relies on and to list them explicitly and under a version control system in such a way that they can easily be installed and made available on different systems. The upsides are significant:

Easily install and run on the cluster
Ease of collaboration
Better reproducibility

To achieve this, try to always keep in mind the following aspects:

Versions: For each dependency, make sure you have some record of the
specific version you are using during development. That way, in the future, you will be able to reproduce the original environment which you know to be compatible. Indeed, the more time passes, the more likely it is that newer versions of some dependency have breaking changes. The pip freeze command can create such a record for Python dependencies.
Isolation: Ideally, each of your software projects should be isolated from
the others. What this means is that updating the environment for project A

should not update the environment for project B. That way, you can freely
install and upgrade software and libraries for the former without worrying about breaking the latter (which you might not notice until weeks later, the next time you work on project B!) Isolation can be achieved using Python Virtual environments and containers.

Managing your environments¶

Virtual environments¶

A virtual environment in Python is a local, isolated environment in which you can install or uninstall Python packages without interfering with the global environment (or other virtual environments). It usually lives in a directory (location varies depending on whether you use venv, conda or poetry). In order to use a virtual environment, you have to activate it. Activating an environment essentially sets environment variables in your shell so that:

python points to the right Python version for that environment (different
virtual environments can use different versions of Python!)
python looks for packages in the virtual environment
pip install installs packages into the virtual environment
Any shell commands installed via pip install are made available

To run experiments within a virtual environment, you can simply activate it in the script given to sbatch.

Pip/Virtualenv¶

Pip is the preferred package manager for Python and each cluster provides several Python versions through the associated module which comes with pip. In order to install new packages, you will first have to create a personal space for them to be stored. The preferred solution (as it is the preferred solution on Digital Research Alliance of Canada clusters) is to use virtual environments.

First, load the Python module you want to use:

module load python/3.8

Then, create a virtual environment in your home directory:

python -m venv $HOME/<env>

Where <env> is the name of your environment. Finally, activate the environment:

source $HOME/<env>/bin/activate

You can now install any Python package you wish using the pip command, e.g. pytorch:

pip install torch torchvision

Or Tensorflow:

pip install tensorflow-gpu

Conda¶

Another solution for Python is to use miniconda or anaconda which are also available through the module command: (the use of Conda is not recommended for Digital Research Alliance of Canada clusters due to the availability of custom-built packages for pip)

$ module load miniconda/3
[=== Module miniconda/3 loaded ===]
To enable conda environment functions, first use:

To create an environment (see here for details) using a specific Python version, you may write:

conda create -n <env> python=3.9

Where <env> is the name of your environment. You can now activate it by doing:

conda activate <env>

You are now ready to install any Python package you want in this environment. For instance, to install PyTorch, you can find the Conda command of any version you want on pytorch’s website, e.g:

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

If you make a lot of environments and install/uninstall a lot of packages, it can be good to periodically clean up Conda’s cache:

conda clean --all

Using Modules¶

A lot of software, such as Python and Conda, is already compiled and available on the cluster through the module command and its sub-commands. In particular, if you wish to use Python 3.7 you can simply do:

module load python/3.7

The module command¶

For a list of available modules, simply use:

$ module avail
--------------------------------------------------------------------------------------------------------------- Global Aliases ---------------------------------------------------------------------------------------------------------------
    cuda/10.0 -> cudatoolkit/10.0    cuda/9.2      -> cudatoolkit/9.2                                 pytorch/1.4.1       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1    tensorflow/1.15 -> python/3.7/tensorflow/1.15
    cuda/10.1 -> cudatoolkit/10.1    mujoco-py     -> python/3.7/mujoco-py/2.0                        pytorch/1.5.0       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0    tensorflow/2.2  -> python/3.7/tensorflow/2.2
    cuda/10.2 -> cudatoolkit/10.2    mujoco-py/2.0 -> python/3.7/mujoco-py/2.0                        pytorch/1.5.1       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1
    cuda/11.0 -> cudatoolkit/11.0    pytorch       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1    tensorflow          -> python/3.7/tensorflow/2.2
    cuda/9.0  -> cudatoolkit/9.0     pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0    tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15

--------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core ---------------------------------------------------------------------------------------------------
    Mila       (S,L)    anaconda/3 (D)    go/1.13.5        miniconda/2        mujoco/1.50        python/2.7    python/3.6        python/3.8           singularity/3.0.3    singularity/3.2.1    singularity/3.5.3 (D)
    anaconda/2          go/1.12.4         go/1.14   (D)    miniconda/3 (D)    mujoco/2.0  (D)    python/3.5    python/3.7 (D)    singularity/2.6.1    singularity/3.1.1    singularity/3.4.2

------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Compiler -------------------------------------------------------------------------------------------------
    python/3.7/mujoco-py/2.0

--------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda ---------------------------------------------------------------------------------------------------
    cuda/10.0/cudnn/7.3        cuda/10.0/nccl/2.4         cuda/10.1/nccl/2.4     cuda/11.0/nccl/2.7        cuda/9.0/nccl/2.4     cudatoolkit/9.0     cudatoolkit/10.1        cudnn/7.6/cuda/10.0/tensorrt/7.0
    cuda/10.0/cudnn/7.5        cuda/10.1/cudnn/7.5        cuda/10.2/cudnn/7.6    cuda/9.0/cudnn/7.3        cuda/9.2/cudnn/7.6    cudatoolkit/9.2     cudatoolkit/10.2        cudnn/7.6/cuda/10.1/tensorrt/7.0
    cuda/10.0/cudnn/7.6 (D)    cuda/10.1/cudnn/7.6 (D)    cuda/10.2/nccl/2.7     cuda/9.0/cudnn/7.5 (D)    cuda/9.2/nccl/2.4     cudatoolkit/10.0    cudatoolkit/11.0 (D)    cudnn/7.6/cuda/9.0/tensorrt/7.0

------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Pytorch --------------------------------------------------------------------------------------------------
    python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1    python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D)    python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0
    python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.0    python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1        python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 (D)

------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Tensorflow ------------------------------------------------------------------------------------------------
    python/3.7/tensorflow/1.15    python/3.7/tensorflow/2.0    python/3.7/tensorflow/2.2 (D)

Modules can be loaded using the load command:

module load <module>

To search for a module or a software, use the command spider:

module spider search_term

E.g.: by default, python2 will refer to the os-shipped installation of python2.7 and python3 to python3.6. If you want to use python3.7 you can type:

module load python3.7

Available Software¶

Modules are divided in 5 main sections:

Section	Description
Core	Base interpreter and software (Python, go, etc…)
Compiler	Interpreter-dependent software (see the note below)
Cuda	Toolkits, cudnn and related libraries
Pytorch/Tensorflow	Pytorch/TF built with a specific Cuda/Cudnn version for Mila’s GPUs (see the related paragraph)

Nota

Modules which are nested (../../..) usually depend on other software/module loaded alongside the main module. No need to load the dependent software, the complex naming scheme allows an automatic detection of the dependent module(s):

i.e.: Loading cudnn/7.6/cuda/9.0/tensorrt/7.0 will load cudnn/7.6 and cuda/9.0 alongside

python/3.X is a particular dependency which can be served through python/3.X or anaconda/3 and is not automatically loaded to let the user pick his favorite flavor.

Default package location¶

Python by default uses the user site package first and packages provided by module last to not interfere with your installation. If you want to skip packages installed in your site-packages folder (in your /home directory), you have to start Python with the -s flag.

To check which package is loaded at import, you can print package.__file__ to get the full path of the package.

Example:

$ module load pytorch/1.5.0
$ python -c 'import torch;print(torch.__file__)'
/home/mila/my_home/.local/lib/python3.7/site-packages/torch/__init__.py   <== package from your own site-package

Now with the -s flag:

$ module load pytorch/1.5.0
$ python -s -c 'import torch;print(torch.__file__)'
/cvmfs/ai.mila.quebec/apps/x86_64/debian/pytorch/python3.7-cuda10.1-cudnn7.6-v1.5.0/lib/python3.7/site-packages/torch/__init__.py'

On using containers¶

Another option for creating portable code is Using containers on clusters.

Containers are a popular approach at deploying applications by packaging a lot of the required dependencies together. The most popular tool for this is Docker, but Docker cannot be used on the Mila cluster (nor the other clusters from Digital Research Alliance of Canada).

One popular mechanism for containerisation on a computational cluster is called Singularity. This is the recommended approach for running containers on the Mila cluster. See section Singularity for more details.

Singularity¶

Overview¶

What is Singularity?¶

Running Docker on SLURM is a security problem (e.g. running as root, being able to mount any directory). The alternative is to use Singularity, which is a popular solution in the world of HPC.

There is a good level of compatibility between Docker and Singularity, and we can find many exaggerated claims about able to convert containers from Docker to Singularity without any friction. Oftentimes, Docker images from DockerHub are 100% compatible with Singularity, and they can indeed be used without friction, but things get messy when we try to convert our own Docker build files to Singularity recipes.

Links to official documentation¶

official Singularity user guide (this is the one you will use most often)
official Singularity admin guide

Overview of the steps used in practice¶

Most often, the process to create and use a Singularity container is:

on your Linux computer (at home or work)
- select a Docker image from DockerHub (e.g. pytorch/pytorch)
- make a recipe file for Singularity that starts with that DockerHub image
- build the recipe file, thus creating the image file (e.g. my-pytorch-image.sif)
- test your singularity container before send it over to the cluster
- rsync -av my-pytorch-image.sif <login-node>:Documents/my-singularity-images
on the login node for that cluster
- queue your jobs with sbatch ...
- (note that your jobs will copy over the my-pytorch-image.sif to $SLURM_TMPDIR and will then launch Singularity with that image)
- do something else while you wait for them to finish
- queue more jobs with the same my-pytorch-image.sif, reusing it many times over

In the following sections you will find specific examples or tips to accomplish in practice the steps highlighted above.

Nope, not on MacOS¶

Singularity does not work on MacOS, as of the time of this writing in 2021. Docker does not actually run on MacOS, but there Docker silently installs a virtual machine running Linux, which makes it a pleasant experience, and the user does not need to care about the details of how Docker does it.

Given its origins in HPC, Singularity does not provide that kind of seamless experience on MacOS, even though it’s technically possible to run it inside a Linux virtual machine on MacOS.

Where to build images¶

Building Singularity images is a rather heavy task, which can take 20 minutes if you have a lot of steps in your recipe. This makes it a bad task to run on the login nodes of our clusters, especially if it needs to be run regularly.

On the Mila cluster, we are lucky to have unrestricted internet access on the compute nodes, which means that anyone can request an interactive CPU node (no need for GPU) and build their images there without problem.

Aviso

Do not build Singularity images from scratch every time your run a job in a large batch. This will be a colossal waste of GPU time as well as internet bandwidth. If you setup your workflow properly (e.g. using bind paths for your code and data), you can spend months reusing the same Singularity image my-pytorch-image.sif.

Building the containers¶

Building a container is like creating a new environment except that containers are much more powerful since they are self-contained systems. With singularity, there are two ways to build containers.

The first one is by yourself, it’s like when you got a new Linux laptop and you don’t really know what you need, if you see that something is missing, you install it. Here you can get a vanilla container with Ubuntu called a sandbox, you log in and you install each packages by yourself. This procedure can take time but will allow you to understand how things work and what you need. This is recommended if you need to figure out how things will be compiled or if you want to install packages on the fly. We’ll refer to this procedure as singularity sandboxes.

The second way is more like you know what you want, so you write a list of everything you need, you send it to singularity and it will install everything for you. Those lists are called singularity recipes.

First way: Build and use a sandbox¶

You might ask yourself: On which machine should I build a container?

First of all, you need to choose where you’ll build your container. This operation requires memory and high cpu usage.

Aviso

Do NOT build containers on any login nodes !

(Recommended for beginner) If you need to use apt-get, you should **build
the container on your laptop** with sudo privileges. You’ll only need to install singularity on your laptop. Windows/Mac users can look there and Ubuntu/Debian users can use directly:
sudo apt-get install singularity-container
If you can’t install singularity on your laptop and you don’t need apt-get, you can reserve a cpu node on the Mila cluster to build your container.

In this case, in order to avoid too much I/O over the network, you should define the singularity cache locally:

export SINGULARITY_CACHEDIR=$SLURM_TMPDIR

If you can’t install singularity on your laptop and you **want to use
apt-get**, you can use singularity-hub to build your containers and read Recipe_section.

Download containers from the web¶

Hopefully, you may not need to create containers from scratch as many have been already built for the most common deep learning software. You can find most of them on dockerhub.

Go on dockerhub and select the container you want to pull.

For example, if you want to get the latest PyTorch version with GPU support (Replace runtime by devel if you need the full Cuda toolkit):

singularity pull docker://pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime

Or the latest TensorFlow:

singularity pull docker://tensorflow/tensorflow:latest-gpu-py3

Currently the pulled image pytorch.simg or tensorflow.simg is read-only meaning that you won’t be able to install anything on it. Starting now, PyTorch will be taken as example. If you use TensorFlow, simply replace every pytorch occurrences by tensorflow.

How to add or install stuff in a container¶

The first step is to transform your read only container pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg in a writable version that will allow you to add packages.

Aviso

Depending on the version of singularity you are using, singularity will build a container with the extension .simg or .sif. If you’re using .sif files, replace every occurences of .simg by .sif.

Dica

If you want to use apt-get you have to put sudo ahead of the following commands

This command will create a writable image in the folder pytorch.

singularity build --sandbox pytorch pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg

Then you’ll need the following command to log inside the container.

singularity shell --writable -H $HOME:/home pytorch

Once you get into the container, you can use pip and install anything you need (Or with apt-get if you built the container with sudo).

Aviso

Singularity mounts your home folder, so if you install things into the $HOME of your container, they will be installed in your real $HOME!

You should install your stuff in /usr/local instead.

Creating useful directories¶

One of the benefits of containers is that you’ll be able to use them across different clusters. However for each cluster the datasets and experiments folder location can be different. In order to be invariant to those locations, we will create some useful mount points inside the container:

mkdir /dataset
mkdir /tmp_log
mkdir /final_log

From now, you won’t need to worry anymore when you write your code to specify where to pick up your dataset. Your dataset will always be in /dataset independently of the cluster you are using.

Testing¶

If you have some code that you want to test before finalizing your container, you have two choices. You can either log into your container and run Python code inside it with:

singularity shell --nv pytorch

Or you can execute your command directly with

singularity exec --nv pytorch Python YOUR_CODE.py

Dica

—nv allows the container to use gpus. You don’t need this if you don’t plan to use a gpu.

Aviso

Don’t forget to clear the cache of the packages you installed in the containers.

Creating a new image from the sandbox¶

Once everything you need is installed inside the container, you need to convert it back to a read-only singularity image with:

singularity build pytorch_final.simg pytorch

Second way: Use recipes¶

A singularity recipe is a file including specifics about installation software, environment variables, files to add, and container metadata. It is a starting point for designing any custom container. Instead of pulling a container and installing your packages manually, you can specify in this file the packages you want and then build your container from this file.

Here is a toy example of a singularity recipe installing some stuff:

################# Header: Define the base system you want to use ################
# Reference of the kind of base you want to use (e.g., docker, debootstrap, shub).
Bootstrap: docker
# Select the docker image you want to use (Here we choose tensorflow)
From: tensorflow/tensorflow:latest-gpu-py3

################# Section: Defining the system #################################
# Commands in the %post section are executed within the container.
%post
        echo "Installing Tools with apt-get"
        apt-get update
        apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
        apt-get clean
        echo "Installing things with pip"
        pip install tqdm
        echo "Creating mount points"
        mkdir /dataset
        mkdir /tmp_log
        mkdir /final_log


# Environment variables that should be sourced at runtime.
%environment
        # use bash as default shell
        SHELL=/bin/bash
        export SHELL

A recipe file contains two parts: the header and sections. In the header you specify which base system you want to use, it can be any docker or singularity container. In sections, you can list the things you want to install in the subsection post or list the environment’s variable you need to source at each runtime in the subsection environment. For a more detailed description, please look at the singularity documentation.

In order to build a singularity container from a singularity recipe file, you should use:

sudo singularity build <NAME_CONTAINER> <YOUR_RECIPE_FILES>

Aviso

You always need to use sudo when you build a container from a recipe. As there is no access to sudo on the cluster, a personal computer or the use singularity hub is needed to build a container

Build recipe on singularity hub¶

Singularity hub allows users to build containers from recipes directly on singularity-hub’s cloud meaning that you don’t need to build containers by yourself. You need to register on singularity-hub and link your singularity-hub account to your GitHub account, then:

Create a new github repository.

Add a collection on singularity-hub and select the github repository your created.

Clone the github repository on your computer.
$ git clone <url>
Write the singularity recipe and save it as a file named Singularity.

Git add Singularity, commit and push on the master branch
$ git add Singularity
$ git commit
$ git push origin master

At this point, robots from singularity-hub will build the container for you, you will be able to download your container from the website or directly with:

singularity pull shub://<github_username>/<repository_name>

Example: Recipe with OpenAI gym, MuJoCo and Miniworld¶

Here is an example on how you can use a singularity recipe to install complex environment such as OpenAI gym, MuJoCo and Miniworld on a PyTorch based container. In order to use MuJoCo, you’ll need to copy the key stored on the Mila cluster in /ai/apps/mujoco/license/mjkey.txt to your current directory.

#This is a dockerfile that sets up a full Gym install with test dependencies
Bootstrap: docker

# Here we ll build our container upon the pytorch container
From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime

# Now we'll copy the mjkey file located in the current directory inside the container's root
# directory
%files
        mjkey.txt

# Then we put everything we need to install
%post
        export PATH=$PATH:/opt/conda/bin
        apt -y update && \
        apt install -y keyboard-configuration && \
        apt install -y \
        python3-dev \
        python-pyglet \
        python3-opengl \
        libhdf5-dev \
        libjpeg-dev \
        libboost-all-dev \
        libsdl2-dev \
        libosmesa6-dev \
        patchelf \
        ffmpeg \
        xvfb \
        libhdf5-dev \
        openjdk-8-jdk \
        wget \
        git \
        unzip && \
        apt clean && \
        rm -rf /var/lib/apt/lists/*
        pip install h5py

        # Download Gym and MuJoCo
        mkdir /Gym && cd /Gym
        git clone https://github.com/openai/gym.git || true && \
        mkdir /Gym/.mujoco && cd /Gym/.mujoco
        wget https://www.roboti.us/download/mjpro150_linux.zip  && \
        unzip mjpro150_linux.zip && \
        wget https://www.roboti.us/download/mujoco200_linux.zip && \
        unzip mujoco200_linux.zip && \
        mv mujoco200_linux mujoco200

        # Export global environment variables
        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
        cp /mjkey.txt /Gym/.mujoco/mjkey.txt
        # Install Python dependencies
        wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
        pip install -r requirements.txt
        # Install Gym and MuJoCo
        cd /Gym/gym
        pip install -e '.[all]'
        # Change permission to use mujoco_py as non sudoer user
        chmod -R 777 /opt/conda/lib/python3.6/site-packages/mujoco_py/
        pip install --upgrade minerl

# Export global environment variables
%environment
        export SHELL=/bin/sh
        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
        export PATH=/Gym/gym/.tox/py3/bin:$PATH

%runscript
        exec /bin/sh "$@"

Here is the same recipe but written for TensorFlow:

#This is a dockerfile that sets up a full Gym install with test dependencies
Bootstrap: docker

# Here we ll build our container upon the tensorflow container
From: tensorflow/tensorflow:latest-gpu-py3

# Now we'll copy the mjkey file located in the current directory inside the container's root
# directory
%files
        mjkey.txt

# Then we put everything we need to install
%post
        apt -y update && \
        apt install -y keyboard-configuration && \
        apt install -y \
        python3-setuptools \
        python3-dev \
        python-pyglet \
        python3-opengl \
        libjpeg-dev \
        libboost-all-dev \
        libsdl2-dev \
        libosmesa6-dev \
        patchelf \
        ffmpeg \
        xvfb \
        wget \
        git \
        unzip && \
        apt clean && \
        rm -rf /var/lib/apt/lists/*

        # Download Gym and MuJoCo
        mkdir /Gym && cd /Gym
        git clone https://github.com/openai/gym.git || true && \
        mkdir /Gym/.mujoco && cd /Gym/.mujoco
        wget https://www.roboti.us/download/mjpro150_linux.zip  && \
        unzip mjpro150_linux.zip && \
        wget https://www.roboti.us/download/mujoco200_linux.zip && \
        unzip mujoco200_linux.zip && \
        mv mujoco200_linux mujoco200

        # Export global environment variables
        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
        cp /mjkey.txt /Gym/.mujoco/mjkey.txt

        # Install Python dependencies
        wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
        pip install -r requirements.txt
        # Install Gym and MuJoCo
        cd /Gym/gym
        pip install -e '.[all]'
        # Change permission to use mujoco_py as non sudoer user
        chmod -R 777 /usr/local/lib/python3.5/dist-packages/mujoco_py/

        # Then install miniworld
        cd /usr/local/
        git clone https://github.com/maximecb/gym-miniworld.git
        cd gym-miniworld
        pip install -e .

# Export global environment variables
%environment
        export SHELL=/bin/bash
        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
        export PATH=/Gym/gym/.tox/py3/bin:$PATH

%runscript
        exec /bin/bash "$@"

Keep in mind that those environment variables are sourced at runtime and not at build time. This is why, you should also define them in the %post section since they are required to install MuJoCo.

Using containers on clusters¶

How to use containers on clusters¶

On every cluster with Slurm, datasets and intermediate results should go in $SLURM_TMPDIR while the final experiment results should go in $SCRATCH. In order to use the container you built, you need to copy it on the cluster you want to use.

Aviso

You should always store your container in $SCRATCH !

Then reserve a node with srun/sbatch, copy the container and your dataset on the node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code <YOUR_CODE> within the container <YOUR_CONTAINER> with:

singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE>

Remember that /dataset, /tmp_log and /final_log were created in the previous section. Now each time, we’ll use singularity, we are explicitly telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder /dataset inside the container with the option -B such that each dataset downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR.

This will allow us to have code and scripts that are invariant to the cluster environment. The option -H specify what will be the container’s home. For example, if you have your code in $HOME/Project12345/Version35/ you can specify -H $HOME/Project12345/Version35:/home, thus the container will only have access to the code inside Version35.

If you want to run multiple commands inside the container you can use:

singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \
    -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \
    $SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>'

Example: Interactive case (srun/salloc)¶

Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and <YOUR_DATASET> to $SLURM_TMPDIR

# 0. Get an interactive session
$ srun --gres=gpu:1
# 1. Copy your container on the compute node
$ rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
# 2. Copy your dataset on the compute node
$ rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR

Then use singularity shell to get a shell inside the container

# 3. Get a shell in your environment
$ singularity shell --nv \
        -H $HOME:/home \
        -B $SLURM_TMPDIR:/dataset/ \
        -B $SLURM_TMPDIR:/tmp_log/ \
        -B $SCRATCH:/final_log/ \
        $SLURM_TMPDIR/<YOUR_CONTAINER>

# 4. Execute your code
<Singularity_container>$ python <YOUR_CODE>

or use singularity exec to execute <YOUR_CODE>.

# 3. Execute your code
$ singularity exec --nv \
        -H $HOME:/home \
        -B $SLURM_TMPDIR:/dataset/ \
        -B $SLURM_TMPDIR:/tmp_log/ \
        -B $SCRATCH:/final_log/ \
        $SLURM_TMPDIR/<YOUR_CONTAINER> \
        python <YOUR_CODE>

You can create also the following alias to make your life easier.

alias my_env='singularity exec --nv \
        -H $HOME:/home \
        -B $SLURM_TMPDIR:/dataset/ \
        -B $SLURM_TMPDIR:/tmp_log/ \
        -B $SCRATCH:/final_log/ \
        $SLURM_TMPDIR/<YOUR_CONTAINER>'

This will allow you to run any code with:

my_env python <YOUR_CODE>

Example: sbatch case¶

You can also create a sbatch script:

#!/bin/bash
#SBATCH --cpus-per-task=6         # Ask for 6 CPUs
#SBATCH --gres=gpu:1              # Ask for 1 GPU
#SBATCH --mem=10G                 # Ask for 10 GB of RAM
#SBATCH --time=0:10:00            # The job will run for 10 minutes

# 1. Copy your container on the compute node
rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
# 2. Copy your dataset on the compute node
rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
# 3. Executing your code with singularity
singularity exec --nv \
        -H $HOME:/home \
        -B $SLURM_TMPDIR:/dataset/ \
        -B $SLURM_TMPDIR:/tmp_log/ \
        -B $SCRATCH:/final_log/ \
        $SLURM_TMPDIR/<YOUR_CONTAINER> \
        python "<YOUR_CODE>"
# 4. Copy whatever you want to save on $SCRATCH
rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH

Issue with PyBullet and OpenGL libraries¶

If you are running certain gym environments that require pyglet, you may encounter a problem when running your singularity instance with the Nvidia drivers using the --nv flag. This happens because the --nv flag also provides the OpenGL libraries:

libGL.so.1 => /.singularity.d/libs/libGL.so.1
libGLX.so.0 => /.singularity.d/libs/libGLX.so.0

If you don’t experience those problems with pyglet, you probably don’t need to address this. Otherwise, you can resolve those problems by apt-get install -y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making sure that your LD_LIBRARY_PATH points to those libraries before the ones in /.singularity.d/libs.

%environment
        # ...
        export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH

Apuana cluster¶

On the Apuana cluster $SCRATCH is not yet defined, you should add the experiment results you want to keep in /network/scratch/<u>/<username>/. In order to use the sbatch script above and to match other cluster environment’s names, you can define $SCRATCH as an alias for /network/scratch/<u>/<username> with:

echo "export SCRATCH=/network/scratch/${USER:0:1}/$USER" >> ~/.bashrc

Then, you can follow the general procedure explained above.

Sharing Data with ACLs¶

Regular permissions bits are extremely blunt tools: They control access through only three sets of bits owning user, owning group and all others. Therefore, access is either too narrow (0700 allows access only by oneself) or too wide (770 gives all permissions to everyone in the same group, and 777 to literally everyone).

ACLs (Access Control Lists) are an expansion of the permissions bits that allow more fine-grained, granular control of accesses to a file. They can be used to permit specific users access to files and folders even if conservative default permissions would have denied them such access.

As an illustrative example, to use ACLs to allow $USER (oneself) to share with $USER2 (another person) a “playground” folder hierarchy in Mila’s scratch filesystem at a location

/network/scratch/${USER:0:1}/$USER/X/Y/Z/...

in a safe and secure fashion that allows both users to read, write, execute, search and delete each others’ files:

1. Grant oneself permissions to access any future files/folders created by the other (or oneself)

(-d renders this permission a “default” / inheritable one)

setfacl -Rdm user:${USER}:rwx  /network/scratch/${USER:0:1}/$USER/X/Y/Z/

Nota

The importance of doing this seemingly-redundant step first is that files

and folders are always owned by only one person, almost always their: creator (the UID will be the creator’s, the GID typically as well). If that user is not yourself, you will not have access to those files unless the other person specifically gives them to you – or these files inherited a default ACL allowing you full access.

This is the inherited, default ACL serving that purpose.

2. Grant the other permission to access any future files/folders created by the other (or oneself)

(-d renders this permission a “default” / inheritable one)

setfacl -Rdm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/

3. Grant the other permission to access any existing files/folders created by oneself.

Such files and folders were created before the new default ACLs were added above and thus did not inherit them from their parent folder at the moment of their creation.

setfacl -Rm  user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/

Nota

The purpose of granting permissions first for future files and then for existing files is to prevent a race condition whereby after the first setfacl command the other person could create files to which the second setfacl command does not apply.

4. Grant another permission to search through one’s hierarchy down to the shared location in question.

Non-recursive (!!!!)
May also grant :rx in unlikely event others listing your folders on the
path is not troublesome or desirable.

setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/X/Y/
setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/X/
setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/

Nota

In order to access a file, all folders from the root (/) down to the parent folder in question must be searchable (+x) by the concerned user. This is already the case for all users for folders such as /, /network and /network/scratch, but users must explicitly grant access to some or all users either through base permissions or by adding ACLs, for at least /network/scratch/${USER:0:1}/$USER, $HOME and subfolders.

To bluntly allow all users to search through a folder (think twice!),

the following command can be used:

chmod a+x /network/scratch/${USER:0:1}/$USER/

Nota

For more information on setfacl and path resolution/access checking, consider the following documentation viewing commands:

man setfacl
man path_resolution

Viewing and Verifying ACLs¶

getfacl /path/to/folder/or/file
            1:  # file: somedir/
            2:  # owner: lisa
            3:  # group: staff
            4:  # flags: -s-
            5:  user::rwx
            6:  user:joe:rwx               #effective:r-x
            7:  group::rwx                 #effective:r-x
            8:  group:cool:r-x
            9:  mask::r-x
            10:  other::r-x
            11:  default:user::rwx
            12:  default:user:joe:rwx       #effective:r-x
            13:  default:group::r-x
            14:  default:mask::r-x
            15:  default:other::---

Nota

man getfacl

Advanced SLURM usage and Multiple GPU jobs¶

Handling preemption¶

On the Apuana cluster, jobs can preempt one-another depending on their priority (unkillable>high>low) (See the Slurm documentation)

The default preemption mechanism is to kill and re-queue the job automatically without any notice. To allow a different preemption mechanism, every partition have been duplicated (i.e. have the same characteristics as their counterparts) allowing a 120sec grace period before killing your job but don’t requeue it automatically: those partitions are referred by the suffix: -grace (main-grace, long-grace, main-cpu-grace, long-cpu-grace).

When using a partition with a grace period, a series of signals consisting of first SIGCONT and SIGTERM then SIGKILL will be sent to the SLURM job. It’s good practice to catch those signals using the Linux trap command to properly terminate a job and save what’s necessary to restart the job. On each cluster, you’ll be allowed a grace period before SLURM actually kills your job (SIGKILL).

The easiest way to handle preemption is by trapping the SIGTERM signal

#SBATCH --ntasks=1
#SBATCH ....

exit_script() {
    echo "Preemption signal, saving myself"
    trap - SIGTERM # clear the trap
    # Optional: sends SIGTERM to child/sub processes
    kill -- -$$
}

trap exit_script SIGTERM

# The main script part
python3 my_script

Nota

Requeuing:
The Slurm scheduler on the cluster does not allow a grace period before
preempting a job while requeuing it automatically, therefore your job will
be cancelled at the end of the grace period.
To automatically requeue it, you can just add the sbatch command inside
your exit_script function.

Packing jobs¶

Multiple Nodes¶

Data Parallel¶

Request 3 nodes with at least 4 GPUs each.

#!/bin/bash

# Number of Nodes
#SBATCH --nodes=3

# Number of tasks. 3 (1 per node)
#SBATCH --ntasks=3

# Number of GPU per node
#SBATCH --gres=gpu:4
#SBATCH --gpus-per-node=4

# 16 CPUs per node
#SBATCH --cpus-per-gpu=4

# 16Go per nodes (4Go per GPU)
#SBATCH --mem=16G

# we need all nodes to be ready at the same time
#SBATCH --wait-all-nodes=1

# Total resources:
#   CPU: 16 * 3 = 48
#   RAM: 16 * 3 = 48 Go
#   GPU:  4 * 3 = 12

# Setup our rendez-vous point
RDV_ADDR=$(hostname)
WORLD_SIZE=$SLURM_JOB_NUM_NODES
# -----

srun -l torchrun \
    --nproc_per_node=$SLURM_GPUS_PER_NODE\
    --nnodes=$WORLD_SIZE\
    --rdzv_id=$SLURM_JOB_ID\
    --rdzv_backend=c10d\
    --rdzv_endpoint=$RDV_ADDR\
    training_script.py

You can find below a pytorch script outline on what a multi-node trainer could look like.

import os
import torch.distributed as dist

class Trainer:
    def __init__(self):
        self.local_rank = None
        self.chk_path = ...
        self.model = ...

    @property
    def device_id(self):
        return self.local_rank

    def load_checkpoint(self, path):
        self.chk_path = path
        # ...

    def should_checkpoint(self):
        # Note: only one worker saves its weights
        return self.global_rank == 0 and self.local_rank == 0

    def save_checkpoint(self):
        if self.chk_path is None:
            return

        # Save your states here
        # Note: you should save the weights of self.model not ddp_model
        # ...

    def initialize(self):
        self.global_rank = int(os.environ.get("RANK", -1))
        self.local_rank = int(os.environ.get("LOCAL_RANK", -1))

        assert self.global_rank >= 0, 'Global rank should be set (Only Rank 0 can save checkpoints)'
        assert self.local_rank >= 0, 'Local rank should be set'

        dist.init_process_group(backend="gloo|nccl")

    def sync_weights(self, resuming=False):
        if resuming:
            # in the case of resuming all workers need to load the same checkpoint
            self.load_checkpoint()

            # Wait for everybody to finish loading the checkpoint
            dist.barrier()
            return

        # Make sure all workers have the same initial weights
        # This makes the leader save his weights
        if self.should_checkpoint():
            self.save_checkpoint()

        # All workers wait for the leader to finish
        dist.barrier()

        # All followers load the leader's weights
        if not self.should_checkpoint():
            self.load_checkpoint()

        # Leader waits for the follower to load the weights
        dist.barrier()

    def dataloader(self, dataset, batch_size):
        train_sampler = ElasticDistributedSampler(dataset)
        train_loader = DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=4,
            pin_memory=True,
            sampler=train_sampler,
        )
        return train_loader

    def train_step(self):
        # Your batch processing step here
        # ...
        pass

    def train(self, dataset, batch_size):
        self.sync_weights()

        ddp_model = torch.nn.parallel.DistributedDataParallel(
            self.model,
            device_ids=[self.device_id],
            output_device=self.device_id
        )

        loader = self.dataloader(dataset, batch_size)

        for epoch in range(100):
            for batch in iter(loader):
                self.train_step(batch)

                if self.should_checkpoint():
                    self.save_checkpoint()

def main():
    trainer = Trainer()
    trainer.load_checkpoint(path)
    tainer.initialize()

    trainer.train(dataset, batch_size)

Nota

To bypass Python GIL (Global interpreter lock) pytorch spawn one process for each GPU. In the example above this means at least 12 processes are spawn, at least 4 on each node.

Frequently asked questions (FAQs)¶

Connection/SSH issues¶

I’m getting `connection refused` while trying to connect to a login node¶

Login nodes are protected against brute force attacks and might ban your IP if it detects too many connections/failures. You will be automatically unbanned after 1 hour. For any further problem, please `submit a support ticket.

Shell issues¶

How do I change my shell ?¶

By default you will be assigned /bin/bash as a shell. If you would like to change for another one, please `submit a support ticket.

SLURM issues¶

How can I get an interactive shell on the cluster ?¶

Use salloc [--slurm_options] without any executable at the end of the command, this will launch your default shell on an interactive session. Remember that an interactive session is bound to the login node where you start it so you could risk losing your job if the login node becomes unreachable.

How can I reset my cluster password ?¶

To reset your password, please `submit a support ticket.

Warning: your cluster password is the same as your Google Workspace account. So, after reset, you must use the new password for all your Google services.

srun: error: –mem and –mem-per-cpu are mutually exclusive¶

You can safely ignore this, salloc has a default memory flag in case you don’t provide one.

How can I see where and if my jobs are running ?¶

Use squeue -u YOUR_USERNAME to see all your job status and locations. To get more info on a running job, try scontrol show job #JOBID

Unable to allocate resources: Invalid account or account/partition combination specified¶

Chances are your account is not setup properly. You should `submit a support ticket.

How do I cancel a job?¶

To cancel a specific job, use scancel #JOBID
To cancel all your jobs (running and pending), use scancel -u YOUR_USERNAME
To cancel all your pending jobs only, use scancel -t PD

How can I access a node on which one of my jobs is running ?¶

You can ssh into a node on which you have a job running, your ssh connection will be adopted by your job, i.e. if your job finishes your ssh connection will be automatically terminated. In order to connect to a node, you need to have password-less ssh either with a key present in your home or with an ssh-agent. You can generate a key on the login node like this:

ssh-keygen (3xENTER)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh

I’m getting `Permission denied (publickey)` while trying to connect to a node¶

See previous question

Where do I put my data during a job ?¶

Your /home as well as the datasets are on shared file-systems, it is recommended to copy them to the $SLURM_TMPDIR to better process them and leverage higher-speed local drives. If you run a low priority job subject to preemption, it’s better to save any output you want to keep on the shared file systems, because the $SLURM_TMPDIR is deleted at the end of each job.

slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup¶

You exceeded the amount of memory allocated to your job, either you did not request enough memory or you have a memory leak in your process. Try increasing the amount of memory requested with --mem= or --mem-per-cpu=.

fork: retry: Resource temporarily unavailable¶

You exceeded the limit of 2000 tasks/PIDs in your job, it probably means there is an issue with a sub-process spawning too many processes in your script. For any help with your software, please `submit a support ticket.

PyTorch issues¶

I randomly get `INTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263`¶

You are using PyTorch 1.10.x and hitting #67864, for which the solution is PR #72232 merged in PyTorch 1.11.x. For an immediate fix, consider the following compilable Gist: hack.cpp. Compile the patch to hack.so and then export LD_PRELOAD=/absolute/path/to/hack.so before executing the Python process that import torch a broken PyTorch 1.10.

For Hydra users who are using the submitit launcher plug-in, the env_set key cannot be used to set LD_PRELOAD in the environment as it does so too late at runtime. The dynamic loader reads LD_PRELOAD only once and very early during the startup of any process, before the variable can be set from inside the process. The hack must therefore be injected using the setup key in Hydra YAML config file:

hydra:
    launcher:
        setup:
            - export LD_PRELOAD=/absolute/path/to/hack.so

User’s guide¶

Running your code¶

SLURM commands guide¶

Basic Usage¶

Submitting jobs¶

Batch job¶

Interactive job¶

Job submission arguments¶

Checking job status¶

Removing a job¶

Partitioning¶

Information on partitions/nodes¶

Useful Commands¶

Special GPU requirements¶

Example script¶

Portability concerns and solutions¶

Managing your environments¶

Virtual environments¶

Pip/Virtualenv¶

Conda¶

Using Modules¶

The module command¶

Available Software¶

Default package location¶

On using containers¶

Singularity¶

Overview¶

What is Singularity?¶

Links to official documentation¶

Overview of the steps used in practice¶

Nope, not on MacOS¶

Where to build images¶

Building the containers¶

First way: Build and use a sandbox¶

Download containers from the web¶

How to add or install stuff in a container¶

Creating useful directories¶

Testing¶

Creating a new image from the sandbox¶

Second way: Use recipes¶

Build recipe on singularity hub¶

Example: Recipe with OpenAI gym, MuJoCo and Miniworld¶

Using containers on clusters¶

How to use containers on clusters¶

Example: Interactive case (srun/salloc)¶

Example: sbatch case¶

Issue with PyBullet and OpenGL libraries¶

Apuana cluster¶

Sharing Data with ACLs¶

Viewing and Verifying ACLs¶

Advanced SLURM usage and Multiple GPU jobs¶

Handling preemption¶

Packing jobs¶

Sharing a GPU between processes¶

Sharing a node with multiple GPU 1process/GPU¶

Sharing a node with multiple GPU & multiple processes/GPU¶

Multiple Nodes¶

Data Parallel¶

Frequently asked questions (FAQs)¶

Connection/SSH issues¶

I’m getting connection refused while trying to connect to a login node¶

Shell issues¶

How do I change my shell ?¶

SLURM issues¶

How can I get an interactive shell on the cluster ?¶

How can I reset my cluster password ?¶

srun: error: –mem and –mem-per-cpu are mutually exclusive¶

How can I see where and if my jobs are running ?¶

Unable to allocate resources: Invalid account or account/partition combination specified¶

How do I cancel a job?¶

How can I access a node on which one of my jobs is running ?¶

I’m getting Permission denied (publickey) while trying to connect to a node¶

Where do I put my data during a job ?¶

slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup¶

fork: retry: Resource temporarily unavailable¶

PyTorch issues¶

I randomly get INTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263¶

I’m getting `connection refused` while trying to connect to a login node¶

I’m getting `Permission denied (publickey)` while trying to connect to a node¶

I randomly get `INTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263`¶