Run Pytorch in Docker With Cuda Without Pulling Your Hair Out

Posted on Oct 21, 2023

This week I’ve been working on containerizing some Pytorch based software. Below are some notes on how to make this process more pleasant and relaxing by avoiding a few pitfalls.

I used Paperspace Machines, specifically their ML-in-a-Box which is an Ubuntu 20.04 based image that ships with Docker, CUDA and NVidia Docker already installed. Let me point you to this hopefully evergreen SO thread in case your system is missing one or several of these components.

Without further ado, let’s get into it

Know your CUDA

Let’s start by checking your CUDA version. You do so by running the following command:

nvcc --version

In my case, i got 11.7. This is NOT the latest version and hopefully yours is more up to date. I hope Paperspace is going roll an updated version out sometime soon (see this GH issue).

Pick your base Docker image

Now that we know which CUDA version we are running, head to nvidia/cuda on Docker Hub and pick an image that works for you. I went with nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04. Note that this image targets CUDA 11.7. If you are more up to date on your CUDA, you will need a newer version.

So far, my Docker file looks as follows:

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

Grab your Pytorch

You might have run into some cryptic errors like

OSError: /home/user/miniconda/lib/python3.9/site-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

Or something like

Traceback (most recent call last):
  File "/home/user/ai-voice-cloning/./src/main.py", line 11, in <module>
    from utils import *
  File "/home/user/ai-voice-cloning/src/utils.py", line 29, in <module>
    import torchaudio
  File "/home/user/miniconda/lib/python3.9/site-packages/torchaudio/__init__.py", line 1, in <module>
    from . import (  # noqa: F401
  File "/home/user/miniconda/lib/python3.9/site-packages/torchaudio/_extension/__init__.py", line 45, in <module>
    _load_lib("libtorchaudio")
  File "/home/user/miniconda/lib/python3.9/site-packages/torchaudio/_extension/utils.py", line 64, in _load_lib
    torch.ops.load_library(path)
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/_ops.py", line 643, in load_library
    ctypes.CDLL(path)
  File "/home/user/miniconda/lib/python3.9/ctypes/__init__.py", line 382, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.12: cannot open shared object file: No such file or directory

As far as I can tell, this happens when you install pytorch packages that are not built against your CUDA version. To target proper binaries when installing packages, you can use the following command:

RUN pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117

To pick the right version for your setup, head to Pytorch installation section and use commands in Linux and Windows section for the corresponding pytorch and CUDA version. There is also an interactive Instal Pytorch section on Pytorch landing page for the latest versions of PyTorch and CUDA.

Run your container

There are 2 things that you want to pay attention to when running your container: targeting GPUs and adjusting shared memory available to Docker:

docker run --gpus all --shm-size 8G -it IMAGE_NAME

By default, Docker gets 64 Mb of shared memory available and it’s very likely that you will run through this limit pretty fast if you are running anything serious inside your container.

Recap

I hope you find this useful and in case you got some gems to share, please chip in here please