Devcontainers Diminish Dependency Difficulties in Deep Learning

Friends don’t let friends develop in unreproducible environments.

2022-01-20

3762 words ~15 minute read

This post walks you through the basics of using Docker (optionally with VScode Devcontainers) to create reproducible Deep Learning project environments. I first provide some motivation as to why this is a good idea, then explain the design principles that I believe projects should follow in order to improve environment reproducibility. I then give a brief description of relevant docker concepts for deep learning, explain how dockerfiles work, and introduce VScode devcontainers which include some useful features. The post finishes by listing some miscellaneous gotchas, limitations, and tips.

This post is accompanied by a repo with the examples shown in this post.

Why Containerise Development Environments?

The environments we use for GPU-accelerated deep learning research are fragile. They rely on a web of dependencies between the CUDA toolkit, GPU driver, OS, Python version, and Python dependencies. The current standard practice is to use python virtual environments and a requirements.txt file to manage dependencies, but this only pins the versions of the python packages used and does nothing to manage the rest of the technology stack we are building on. This makes our environments extremely fragile. How many hours do you think you spend each year fixing broken environments? How many times have you struggled to reproduce papers even if the authors were kind enough to provide a requirements file? Not convinced? Try this: delete the current project you’re working on and try to set it up again on a coworker’s machine. How long do you think it will take before you become productive on the new machine?

The above figure shows the technology stack that deep learning projects typically sit on, demonstrating how virtual environments alone are hardly enough to ensure full reproducibility. Conda seems to be a solution to this problem - and it definitely helps - but it comes with a few key drawbacks: Conda environments are not portable across operating systems, and they also do not manage OS packages. This is problematic because it is common to include shell scripts with your project to automatically download datasets and do any other setup. Even if this is not a big deal to you, there is one major drawback to using Conda: lack of incremental buy-in. If you use Conda, you lock yourself and anyone who wishes to reproduce your project into using it. Although it is a fairly popular tool and many researchers know how to use it, I don’t feel comfortable limiting the set of researchers who can reproduce my work to those who are proficient in a non-standard technology.

Containers are a great solution to this problem. They can be thought of as a lightweight alternative to virtual machines and allow the entire technology stack for a project to be reproduced - they also do not preclude you from using any of the other technologies we have mentioned so far. The best solution I have found is to use Docker containers + VScode devcontainer features + Python virtualenv (inside the container). This setup even allows people to reproduce the entire development environment in one click! This article is an introduction to the ideas behind this.

Devcontainers Suggested Structure

Before diving into the details of how containers work, I think it’s first best to demonstrate how this new workflow will fit into your repository. The only change you will need to make to your current project is to add a .devcontainer directory with two specific files in it. These files are a Dockerfile which tells docker how to build the image (specifying details about the OS, CUDA, Python version etc) and a devcontainer.json file which tells VScode how to create/access the container with any tools you want to use in the project. These are explained further in the next sections.

An example file tree for your project is shown below:

your_brilliant_project/
    ├── .devcontainer/
    │    ├── Dockerfile
    │    └── Devcontainer.json
    ├── src/
    │    ├── main.py
    │    └── other_source_code.py
    ├── LICENSE
    ├── README.md
    └── requirements.txt

The key principle behind this file structure is incremental buy-in. For example, someone who does not use Docker can ignore the .devcontainer directory and install dependencies in a virtual environment in the usual way; someone who uses Docker but not VScode can build an image/container from the dockerfile manually; and someone who uses Docker and VScode can install and run the entire project in one click. When you set up the project like this, installation becomes very smooth; below is an example set of instructions for installation, taken from the README of one of my projects.

Installation

This project includes a requirements.txt file, as well as a Dockerfile and devcontainer.json. This enables two methods for installation.

If you have Docker and VScode (with the remote development extension pack) installed, you can reproduce the entire development environment including OS, Python version, CUDA version, and dependencies by simply running Remote containers: Clone Repository in Container Volume from the command palette (alternatively, you could clone the repository normally and run Remote Containers: Open folder in Container). This is the easiest way to install the project. If you use Docker but don’t like VScode, feel free to try building from the Dockerfile, although some minor tweaks may be necessary.

This method requires GPU drivers capable of CUDA 11.3 - you can check this by running nvidia-smi and ensuring CUDA Version = 11.3 or greater

Clone the repository, then create, activate, and install dependencies in a Python virtual environment in the usual way. Ensure you are using Python 3.8 - this is what the project is built on.

Depending on your CUDA driver capabilities / CUDA toolkit version, you may have to reinstall the deep learning libraries with versions suited to your setup. Instructions can be found here for PyTorch, JAX, and TensorFlow.

Notice that there is still one thing that the user must check - their GPU capability. Unfortunately, there is no simple way to make all code run on all GPUs because some hardware simply has different capabilities. Generally, I would recommend picking a minimum CUDA version that you want your project to support (it is common to pick either 10.2 or 11.3) and using your Dockerfile to enforce this (I’ll explain how later). You can then guarantee that anyone whose hardware supports this version can perfectly reproduce your project - those who don’t can use a virtual environment and they are no worse off than they would have been otherwise. This is the only compatibility that the user needs to think about and it is made explicit by using containers.

Anatomy of Docker and Dockerfiles

Docker is the most popular system for building containers and is the one we focus on today. We will use Nvidia’s container runtime which allows GPUs to be visible inside Docker containers. To install Docker with the Nvidia container runtime, follow the official instructions here.

A Dockerfile tells Docker how to build an ‘image’ and an image can be used to create a container. Building images can take a few minutes (especially if you have lots of requirements) but once the image is built, it is cached and stored locally and containers can then be spun up very quickly. There are lots of resources to help you understand the details/syntax of Dockerfiles, so I won’t go into them here. Instead, I provide an example Dockerfile for deep learning projects and annotate it with explanation of the key ideas below.

# Pull from base image (1)
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

ARG PYTHON_VERSION=3.8

# Install system packages (2)
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
    build-essential \
    cmake \
    git \
    wget \
    unzip \
    python${PYTHON_VERSION} \
    python${PYTHON_VERSION}-dev \
    python${PYTHON_VERSION}-venv \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

ARG USERNAME=vscode
ARG USER_UID=1000
ARG USER_GID=1000

# Create the user (3)
RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && apt-get update \
    && apt-get install -y sudo \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

# Create virtual environment and add to path (4)
ENV VIRTUAL_ENV=/opt/venv
RUN python${PYTHON_VERSION} -m venv $VIRTUAL_ENV && chmod -R a+rwX $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install requirements (5)
COPY requirements.txt /tmp/pip-tmp/
RUN pip3 --disable-pip-version-check --no-cache-dir install -r /tmp/pip-tmp/requirements.txt \
    && rm -rf /tmp/pip-tmp

1. Most dockerfiles start by pulling from a base image. This base image comes from Nvidia and includes CUDA 11.3.1 and CuDNN 8 and is built upon ubuntu 20.04. If your GPU + driver is not compatible with CUDA 11.3.1, the image will fail to build. Nvidia provide lots of other base images so you can use a different one if you want to support an older/newer mininum CUDA capability.

2. This line installs system packages. You can see that we install Python 3.8 and some other relevant packages such as wget and unzip. Other packages can be added here if you would like to include them in the image.

3. Create a non-root user for the container. It is good practice to use the container as a user and not as root (if you use a container as root and edit mounted files, it might mess with the permissions in your local filesystem).

4. Create a python virtual environment and add it to the path - this is equivalent to activating it. Again, it is good practice to use a virtualenv inside your container even if it is not strictly necessary. Some people even use Anaconda inside containers although I think it is a bit overkill (and also has the disadvantage of locking your users into Anaconda). The chmod part is simply to change the permissions to allow the user to add packages while the container is running.

5. Install requirements from requirements.txt.

Note, most dockerfiles end with a CMD line which specifies what command to run when the container starts, this could be running the application or simply running bash. This is not included in this example because you do not need that line when using VScode devcontainers. If you want to use the Dockerfile without VScode devcontainers, you will need to add this line.

VScode Devcontainers

Docker containers are great for ensuring a consistent environment but there’s an issue with using them on their own - Docker is mostly designed for deployment, not development. You could spin up a container, SSH into it, mount your data and source code, forward any necessary ports, and do your development through the terminal but that would not be fun. Fortunately, VScode has a brilliant feature called Devcontainers which does those steps for you and also attaches a VScode window to the container. This allows you to develop seamlessly using the remote development features of VScode and gives you the same experience as any local project. To use this, simply ensure you have the Remote Development Extension Pack installed. The VScode docs are very good and are worth reading but the main thing to understand is that you can use a devcontainer.json file to tell VScode how to build the container - an annotated example file is shown below, and the reference docs can be found here:

{
  "name": "Deep Learning GPU: CUDA 11.3",
  // Build args (1)
  "build": {
    "dockerfile": "Dockerfile",
    "context": "..",
    "args": {
      "PYTHON_VERSION": "3.8"
    }
  },
  // Run args (2)
  "runArgs": ["--gpus=all", "--privileged"],
  // Mounts (3)
  "mounts": [
    "source=/vol/biodata/data,target=${containerWorkspaceFolder}/mounted-data,type=bind"
  ],
  // settings for the vscode workspace (can also be set when it's running)
  "settings": {
    // This is the venv path set in the Dockerfile
    "python.defaultInterpreterPath": "/opt/venv/bin/python"
  },
  // Extensions to preinstall in the container (you can install more when it's running)
  "extensions": [
    "ms-python.python",
    "ms-python.vscode-pylance",
    "github.copilot",
    "github.vscode-pull-request-github",
    "njpwerner.autodocstring"
  ],
  "features": {
    "github-cli": "latest"
  },
  "containerUser": "vscode", // we created this user in the Dockerfile
  "shutdownAction": "none" // don't stop container on exit
}

1. Behind the scenes, VScode is simply building an image from the Dockerfile by running docker build. These are the arguments to provide for that command. Here, we provide the path to the Dockerfile to build, the context (we set it to .. to make the rest of the project visible to Docker), and the PYTHON_VERSION argument that overrides the default value in the Dockerfile.

2. Again, VScode starts the container with docker run, we set --gpus=all to make all available GPUs visible to the container. We also run as --privileged. This is not always necessary but fixes a bug that sometimes happens where your GPUs aren’t visible.

3. Mount any necessary files/data to the container (you do not need to mount the source code of the project because VScode does that automatically). In this example, we mount the biodata/data/ datasets to a directory in the workspace called mounted-data/.

Notice also that you can specify any settings and extensions you want to set up for the workspace in the container. These are not strictly necessary because you can always change extensions/settings while the container is running (i.e. in the normal way through the VScode UI). The nice thing about this is that someone can easily reproduce all your tooling to get a setup that suits the project needs. It appears that companies are starting to see the benefits of this and have started using devcontainers to help speed up onboarding.

A nice feature of VScode devcontainers is one-click installation of projects, simply run Remote containers: Clone Repository in Container Volume from the command palette (alternatively, clone the repository normally and run Remote Containers: Open folder in Container) and you’re done! There is a minor difference between these two methods - the first stores your source code in a volume (essentially a filesystem that only Docker can access), whereas the second method involves storing the code in your local filesystem and mounting it to the container. On Linux, there is generally not much difference between these options, but on Windows/MacOS, volumes are generally a bit more performant.

Miscellaneous: Tips & Tricks, Gotchas, Limitations, and Workarounds

-f https://storage.googleapis.com/jax-releases/jax_releases.html
-f https://download.pytorch.org/whl/cu113/torch_stable.html
jax==0.2.26
jaxlib==0.1.75+cuda11.cudnn82
torch==1.10.0+cu113

Note: this is not a container-specific problem and is something you should do in all your projects.