Adding NVIDIA GPU Support to Docker Model Runner

Enable GPU acceleration for Docker Model Runner with NVIDIA CUDA support

Docker Model Runner is Docker’s official tool for running AI models locally, but enabling NVidia GPU acceleration in Docker Model Runner requires specific configuration.

Unlike standard docker run commands, docker model run doesn’t support --gpus or -e flags, so GPU support must be configured at the Docker daemon level and during runner installation.

If you’re looking for an alternative LLM hosting solution with easier GPU configuration, consider Ollama, which has built-in GPU support and simpler setup. However, Docker Model Runner offers better integration with Docker’s ecosystem and OCI artifact distribution.

Docker Model Runner with NVIDIA GPU support This nice image is generated by AI model Flux 1 dev.

Prerequisites

Before configuring GPU support, ensure you have:

Verify your GPU is accessible:

nvidia-smi

Test Docker GPU access:

docker run --rm --gpus all nvidia/cuda:12.2.2-base-ubi8 nvidia-smi

For more Docker commands and configuration options, see our Docker Cheatsheet.

Step 1: Configure Docker Daemon for NVIDIA Runtime

Docker Model Runner requires the NVIDIA runtime to be set as the default runtime in the Docker daemon configuration.

Find NVIDIA Container Runtime Path

First, locate where nvidia-container-runtime is installed:

which nvidia-container-runtime

This typically outputs /usr/bin/nvidia-container-runtime. Note this path for the next step.

Configure Docker Daemon

Create or update /etc/docker/daemon.json to set NVIDIA as the default runtime:

sudo tee /etc/docker/daemon.json > /dev/null << 'EOF'
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF

Important: If which nvidia-container-runtime returned a different path, update the "path" value in the JSON configuration accordingly.

Restart Docker Service

Apply the configuration by restarting Docker:

sudo systemctl restart docker

Verify Configuration

Confirm the NVIDIA runtime is configured:

docker info | grep -i runtime

You should see Default Runtime: nvidia in the output.

Step 2: Install Docker Model Runner with GPU Support

Docker Model Runner must be installed or reinstalled with explicit GPU support. The runner container itself needs to be the CUDA-enabled version.

Stop Current Runner (if running)

If Docker Model Runner is already installed, stop it first:

docker model stop-runner

Install/Reinstall with CUDA Support

Install or reinstall Docker Model Runner with CUDA GPU support:

docker model reinstall-runner --gpu cuda

This command:

  • Pulls the CUDA-enabled version (docker/model-runner:latest-cuda) instead of the CPU-only version
  • Configures the runner container to use NVIDIA runtime
  • Enables GPU acceleration for all models

Note: If you’ve already installed Docker Model Runner without GPU support, you must reinstall it with the --gpu cuda flag. Simply configuring the Docker daemon is not enough—the runner container itself needs to be the CUDA-enabled version.

Available GPU Backends

Docker Model Runner supports multiple GPU backends:

  • cuda - NVIDIA CUDA (most common for NVIDIA GPUs)
  • rocm - AMD ROCm (for AMD GPUs)
  • musa - Moore Threads MUSA
  • cann - Huawei CANN
  • auto - Automatic detection (default, may not work correctly)
  • none - CPU only

For NVIDIA GPUs, always use --gpu cuda explicitly.

Step 3: Verify GPU Access

After installation, verify that Docker Model Runner can access your GPU.

Check Runner Container GPU Access

Test GPU access from within the Docker Model Runner container:

docker exec docker-model-runner nvidia-smi

This should display your GPU information, confirming the container has GPU access.

Check Runner Status

Verify Docker Model Runner is running:

docker model status

You should see the runner is active with llama.cpp support.

Step 4: Test Model with GPU

Run a model and verify it’s using the GPU.

Run a Model

Start a model inference:

docker model run ai/qwen3:14B-Q6_K "who are you?"

Verify GPU Usage in Logs

Check the Docker Model Runner logs for GPU confirmation:

docker model logs | grep -i cuda

You should see messages indicating GPU usage:

  • using device CUDA0 (NVIDIA GeForce RTX 4080) - GPU device detected
  • offloaded 41/41 layers to GPU - Model layers loaded on GPU
  • CUDA0 model buffer size = 10946.13 MiB - GPU memory allocation
  • CUDA0 KV buffer size = 640.00 MiB - Key-value cache on GPU
  • CUDA0 compute buffer size = 306.75 MiB - Compute buffer on GPU

Monitor GPU Usage

In another terminal, monitor GPU usage in real-time:

nvidia-smi -l 1

You should see GPU memory usage and utilization increase when the model is running.

For more advanced GPU monitoring options and tools, see our guide on GPU monitoring applications in Linux / Ubuntu.

Troubleshooting

Model Still Using CPU

If the model is still running on CPU:

  1. Verify Docker daemon configuration:

    docker info | grep -i runtime
    

    Should show Default Runtime: nvidia

  2. Check runner container runtime:

    docker inspect docker-model-runner | grep -A 2 '"Runtime"'
    

    Should show "Runtime": "nvidia"

  3. Reinstall runner with GPU support:

    docker model reinstall-runner --gpu cuda
    
  4. Check logs for errors:

    docker model logs | tail -50
    

GPU Not Detected

If GPU is not detected:

  1. Verify NVIDIA Container Toolkit is installed:

    dpkg -l | grep nvidia-container-toolkit
    
  2. Test GPU access with standard Docker:

    docker run --rm --gpus all nvidia/cuda:12.2.2-base-ubi8 nvidia-smi
    

    For troubleshooting Docker issues, refer to our Docker Cheatsheet.

  3. Check NVIDIA drivers:

    nvidia-smi
    

Performance Issues

If GPU performance is poor:

  1. Check GPU utilization:

    nvidia-smi
    

    Look for high GPU utilization percentage

  2. Verify model layers are on GPU:

    docker model logs | grep "offloaded.*layers to GPU"
    

    All layers should be offloaded to GPU

  3. Check for memory issues:

    nvidia-smi
    

    Ensure GPU memory isn’t exhausted

Best Practices

  1. Always specify GPU backend explicitly: Use --gpu cuda instead of --gpu auto for NVIDIA GPUs to ensure correct configuration.

  2. Verify configuration after changes: Always check docker info | grep -i runtime after modifying Docker daemon settings.

  3. Monitor GPU usage: Use nvidia-smi to monitor GPU memory and utilization during model inference. For more advanced monitoring tools, see our guide on GPU monitoring applications in Linux / Ubuntu.

  4. Check logs regularly: Review docker model logs to ensure models are using GPU acceleration.

  5. Use appropriate model sizes: Ensure your GPU has sufficient memory for the model. Use quantized models (Q4, Q5, Q6, Q8) for better GPU memory efficiency. For help choosing the right GPU for your AI workloads, see our guide on Comparing NVidia GPU specs suitability for AI.