.. _Using containers: Using containers on clusters ---------------------------- How to use containers on clusters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ On every cluster with Slurm, datasets and intermediate results should go in ``$SLURM_TMPDIR`` while the final experiment results should go in ``$SCRATCH``. In order to use the container you built, you need to copy it on the cluster you want to use. .. warning:: You should always store your container in $SCRATCH ! Then reserve a node with srun/sbatch, copy the container and your dataset on the node given by SLURM (i.e in ``$SLURM_TMPDIR``) and execute the code ```` within the container ```` with: .. code-block:: console singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/ python Remember that ``/dataset``, ``/tmp_log`` and ``/final_log`` were created in the previous section. Now each time, we'll use singularity, we are explicitly telling it to mount ``$SLURM_TMPDIR`` on the cluster's node in the folder ``/dataset`` inside the container with the option ``-B`` such that each dataset downloaded by PyTorch in ``/dataset`` will be available in ``$SLURM_TMPDIR``. This will allow us to have code and scripts that are invariant to the cluster environment. The option ``-H`` specify what will be the container's home. For example, if you have your code in ``$HOME/Project12345/Version35/`` you can specify ``-H $HOME/Project12345/Version35:/home``, thus the container will only have access to the code inside ``Version35``. If you want to run multiple commands inside the container you can use: .. code-block:: console singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \ $SLURM_TMPDIR/ bash -c 'pwd && ls && python ' Example: Interactive case (srun/salloc) """"""""""""""""""""""""""""""""""""""" Once you get an interactive session with SLURM, copy ```` and ```` to ``$SLURM_TMPDIR`` .. code-block:: console # 0. Get an interactive session $ srun --gres=gpu:1 # 1. Copy your container on the compute node $ rsync -avz $SCRATCH/ $SLURM_TMPDIR # 2. Copy your dataset on the compute node $ rsync -avz $SCRATCH/ $SLURM_TMPDIR Then use ``singularity shell`` to get a shell inside the container .. code-block:: console # 3. Get a shell in your environment $ singularity shell --nv \ -H $HOME:/home \ -B $SLURM_TMPDIR:/dataset/ \ -B $SLURM_TMPDIR:/tmp_log/ \ -B $SCRATCH:/final_log/ \ $SLURM_TMPDIR/ .. code-block:: console # 4. Execute your code $ python **or** use ``singularity exec`` to execute ````. .. code-block:: console # 3. Execute your code $ singularity exec --nv \ -H $HOME:/home \ -B $SLURM_TMPDIR:/dataset/ \ -B $SLURM_TMPDIR:/tmp_log/ \ -B $SCRATCH:/final_log/ \ $SLURM_TMPDIR/ \ python You can create also the following alias to make your life easier. .. code-block:: console alias my_env='singularity exec --nv \ -H $HOME:/home \ -B $SLURM_TMPDIR:/dataset/ \ -B $SLURM_TMPDIR:/tmp_log/ \ -B $SCRATCH:/final_log/ \ $SLURM_TMPDIR/' This will allow you to run any code with: .. code-block:: console my_env python Example: sbatch case """""""""""""""""""" You can also create a ``sbatch`` script: .. code-block:: console #!/bin/bash #SBATCH --cpus-per-task=6 # Ask for 6 CPUs #SBATCH --gres=gpu:1 # Ask for 1 GPU #SBATCH --mem=10G # Ask for 10 GB of RAM #SBATCH --time=0:10:00 # The job will run for 10 minutes # 1. Copy your container on the compute node rsync -avz $SCRATCH/ $SLURM_TMPDIR # 2. Copy your dataset on the compute node rsync -avz $SCRATCH/ $SLURM_TMPDIR # 3. Executing your code with singularity singularity exec --nv \ -H $HOME:/home \ -B $SLURM_TMPDIR:/dataset/ \ -B $SLURM_TMPDIR:/tmp_log/ \ -B $SCRATCH:/final_log/ \ $SLURM_TMPDIR/ \ python "" # 4. Copy whatever you want to save on $SCRATCH rsync -avz $SLURM_TMPDIR/ $SCRATCH Issue with PyBullet and OpenGL libraries """""""""""""""""""""""""""""""""""""""" If you are running certain gym environments that require ``pyglet``, you may encounter a problem when running your singularity instance with the Nvidia drivers using the ``--nv`` flag. This happens because the ``--nv`` flag also provides the OpenGL libraries: .. code-block:: console libGL.so.1 => /.singularity.d/libs/libGL.so.1 libGLX.so.0 => /.singularity.d/libs/libGLX.so.0 If you don't experience those problems with ``pyglet``, you probably don't need to address this. Otherwise, you can resolve those problems by ``apt-get install -y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx``, and then making sure that your ``LD_LIBRARY_PATH`` points to those libraries before the ones in ``/.singularity.d/libs``. .. code-block:: console %environment # ... export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH Apuana cluster """""""""""" On the Apuana cluster ``$SCRATCH`` is not yet defined, you should add the experiment results you want to keep in ``/network/scratch///``. In order to use the sbatch script above and to match other cluster environment's names, you can define ``$SCRATCH`` as an alias for ``/network/scratch//`` with: .. code-block:: console echo "export SCRATCH=/network/scratch/${USER:0:1}/$USER" >> ~/.bashrc Then, you can follow the general procedure explained above.