How do I install Docker Model Runner?

For Docker Desktop, enable it through Settings > AI tab. For Docker Engine, install the docker-model-plugin package using your system’s package manager. No complex configuration is needed - GPU support is automatic.

What’s the difference between docker model run and docker run?

docker model run is specifically designed for AI models with automatic GPU detection, model management, and OpenAI-compatible API serving. docker run is for general containers. DMR simplifies LLM deployment without complex Dockerfiles or nvidia-docker configuration.

Can I use my own models with Docker Model Runner?

Yes! Use docker model package –gguf /path/to/model.gguf to package GGUF models as OCI artifacts. You can then push them to Docker Hub or private registries and pull them like any Docker image.

Where are Docker Model Runner models stored?

Models are stored as OCI artifacts in Docker’s storage system, similar to container images. You can manage them with standard Docker commands and distribute them through Docker Hub or any OCI-compliant registry.

Does Docker Model Runner work without GPU?

Yes, Docker Model Runner automatically falls back to CPU inference when no GPU is available. Performance will be slower (5-10x) but it works without any special configuration.

How do I expose the Docker Model Runner API?

When you run a model with docker model run, it automatically exposes an OpenAI-compatible API endpoint (typically on port 8080). You can configure ports and other settings using Docker Compose or command-line flags.

Where does Docker Model Runner fit in the broader LLM hosting landscape?

Docker Model Runner is one local option. Our main LLM Hosting guide compares it with Ollama, vLLM, LocalAI and cloud providers, including cost and infrastructure trade-offs.

When should I choose Docker Model Runner over Ollama or cloud APIs?

Choose DMR for Docker-native workflows and OCI model distribution. For simpler setup or pay-per-use, the LLM Hosting guide compares all options.

Docker Model Runner Cheatsheet: Commands & Examples

Q: What is Docker Model Runner?

Docker Model Runner (DMR) is Docker’s official tool introduced in April 2025 for running AI models locally. It uses native Docker commands (docker model pull, run, package) to manage and deploy LLMs with OCI artifact packaging and OpenAI-compatible APIs.

Quick reference for Docker Model Runner commands

Docker Model Runner (DMR) is Docker’s official solution for running AI models locally, introduced in April 2025. This cheatsheet provides a quick reference for all essential commands, configurations, and best practices.

For a broader comparison of Docker Model Runner with Ollama, vLLM, LocalAI and cloud providers—including cost and infrastructure trade-offs—see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared.

list of gemma models available in docker model runner

Installation

Docker Desktop

Enable Docker Model Runner through the GUI:

Open Docker Desktop
Go to Settings → AI tab
Click Enable Docker Model Runner
Restart Docker Desktop

/home/rg/prj/hugo-pers/content/post/2025/10/docker-model-runner-cheatsheet/docker-model-runner_w678.jpg docker model runner windows

Docker Engine (Linux)

Install the plugin package:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker-model-plugin

# Fedora/RHEL
sudo dnf install docker-model-plugin

# Arch Linux
sudo pacman -S docker-model-plugin

Verify installation:

docker model --help

NVIDIA RTX support for Docker

To enable make LLMs running on GPU instead of CPU, install nvidia-container-toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

then you can run containers with --gpus all

docker run --rm --gpus all <image> <command>

check that container can see the GPU:

docker run --rm --gpus all nvidia/cuda:12.2.2-base-ubi8 nvidia-smi

Adding NVidia Support for Docker Model Runner

Docker Model Runner requires explicit GPU configuration. Unlike standard docker run commands, docker model run doesn’t support --gpus or -e flags. Instead, you need to:

Configure Docker daemon to use NVIDIA runtime by default

First, check where nvidia-container-runtime is installed:

which nvidia-container-runtime

This will typically output /usr/bin/nvidia-container-runtime. Use this path in the configuration below.

Create or update /etc/docker/daemon.json:

sudo tee /etc/docker/daemon.json > /dev/null << 'EOF'
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF

Note: If which nvidia-container-runtime returns a different path, update the "path" value in the JSON configuration accordingly.

Restart Docker:

sudo systemctl restart docker

Verify the configuration:

docker info | grep -i runtime

You should see Default Runtime: nvidia in the output.

Reinstall Docker Model Runner with GPU support

Docker Model Runner must be installed/reinstalled with explicit GPU support:

# Stop the current runner
docker model stop-runner

# Reinstall with CUDA GPU support
docker model reinstall-runner --gpu cuda

This will pull the CUDA-enabled version (docker/model-runner:latest-cuda) instead of the CPU-only version.

Verify GPU access

Check that the Docker Model Runner container can access the GPU:

docker exec docker-model-runner nvidia-smi

Test model with GPU

Run a model and check the logs to confirm GPU usage:

docker model run ai/qwen3:14B-Q6_K "who are you?"

Check the logs for GPU confirmation:

docker model logs | grep -i cuda

You should see messages like:

using device CUDA0 (NVIDIA GeForce RTX 4080)
offloaded 41/41 layers to GPU
CUDA0 model buffer size = 10946.13 MiB

Note: If you’ve already installed Docker Model Runner without GPU support, you must reinstall it with --gpu cuda flag. Simply configuring the Docker daemon is not enough - the runner container itself needs to be the CUDA-enabled version.

Available GPU backends:

cuda - NVIDIA CUDA (most common)
rocm - AMD ROCm
musa - Moore Threads MUSA
cann - Huawei CANN
auto - Automatic detection (default)
none - CPU only

Core Commands

Pulling Models

Pull pre-packaged models from Docker Hub:

# Basic pull
docker model pull ai/llama2

# Pull specific version
docker model pull ai/llama2:7b-q4

# Pull from custom registry
docker model pull myregistry.com/models/mistral:latest

# List available models in a namespace
docker search ai/

Running Models

Start a model with automatic API serving:

# Basic run (interactive)
docker model run ai/llama2 "What is Docker?"

# Run as service (background)
docker model run -d

overall we don’t have many options with running models via CLI:

docker model run --help
Usage:  docker model run MODEL [PROMPT]

Run a model and interact with it using a submitted prompt or chat mode

Options:
      --color string                  Use colored output (auto|yes|no) (default "auto")
      --debug                         Enable debug logging
  -d, --detach                        Load the model in the background without interaction
      --ignore-runtime-memory-check   Do not block pull if estimated runtime memory for model exceeds system resources.

Listing Models

View downloaded and running models:

# List all downloaded models
docker model ls

# List running models
docker model ps

# List with detailed information
docker model ls --json

# List with detailed information
docker model ls --openai

# Will return shust hashcodes
docker model ls -q

Removing Models

Delete models from local storage:

# Remove specific model
docker model rm ai/llama2

# Remove with force (even if running)
docker model rm -f ai/llama2

# Remove unused models
docker model prune

# Remove all models
docker model rm $(docker model ls -q)

Configuring Model Context Sizes

We can not use cli to specify context size for particular request.

Basically can control model context size only in three ways:

Package model ourselves, specifying desired hardcoded context size (See more about this in the next section.)
When using docker model runner configure command with –context-size parameter like

docker model configure --context-size=10000 ai/gemma3-qat:4B

Then you can call curl to it, but can not do docker moder run... - that one will ignore configure.

In docker-compose.yaml file, but we can not use docker-model-runner image this way, necause it is passing to the model hardcoded context size of 4096

...
models:
  llm_model:
    model: ai/gemma3-qat:4B
    context_size: 10240
...

For more details please see dedicated post about this: Specifying context size in DMR

Packaging Custom Models

Create OCI Artifact from GGUF

Package your own GGUF models:

# Basic packaging
docker model package --gguf /path/to/model.gguf myorg/mymodel:latest

# Package with metadata
docker model package \
  --gguf /path/to/model.gguf \
  --label "description=Custom Llama model" \
  --label "version=1.0" \
  myorg/mymodel:v1.0

# Package and push in one command
docker model package --gguf /path/to/model.gguf --push myorg/mymodel:latest

# Package with custom context size
docker model package \
  --gguf /path/to/model.gguf \
  --context 8192 \
  myorg/mymodel:latest

Publishing Models

Push models to registries:

# Login to Docker Hub
docker login

# Push to Docker Hub
docker model push myorg/mymodel:latest

# Push to private registry
docker login myregistry.com
docker model push myregistry.com/models/mymodel:latest

# Tag and push
docker model tag mymodel:latest myorg/mymodel:v1.0
docker model push myorg/mymodel:v1.0

API Usage

OpenAI-Compatible Endpoints

Docker Model Runner automatically exposes OpenAI-compatible APIs:

# Start model with API
docker model run -d -p 8080:8080 --name llm ai/llama2

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Text generation
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Once upon a time",
    "max_tokens": 100
  }'

# Streaming response
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

# List available models via API
curl http://localhost:8080/v1/models

# Model information
curl http://localhost:8080/v1/models/llama2

Docker Compose Configuration

Basic Compose File

version: '3.8'

services:
  llm:
    image: docker-model-runner
    model: ai/llama2:7b-q4
    ports:
      - "8080:8080"
    environment:
      - MODEL_TEMPERATURE=0.7
    volumes:
      - docker-model-runner-models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  docker-model-runner-models:
    external: true

Multi-Model Setup

version: '3.8'

services:
  llama:
    image: docker-model-runner
    model: ai/llama2
    ports:
      - "8080:8080"
    
  mistral:
    image: docker-model-runner
    model: ai/mistral
    ports:
      - "8081:8080"
    
  embedding:
    image: docker-model-runner
    model: ai/nomic-embed-text
    ports:
      - "8082:8080"

For more advanced Docker Compose configurations and commands, see our Docker Compose Cheatsheet covering networking, volumes, and orchestration patterns.

Environment Variables

Configure model behavior with environment variables:

# Temperature (0.0-1.0)
MODEL_TEMPERATURE=0.7

# Top-p sampling
MODEL_TOP_P=0.9

# Top-k sampling
MODEL_TOP_K=40

# Maximum tokens
MODEL_MAX_TOKENS=2048

# Number of GPU layers
MODEL_GPU_LAYERS=35

# Batch size
MODEL_BATCH_SIZE=512

# Thread count (CPU)
MODEL_THREADS=8

# Enable verbose logging
MODEL_VERBOSE=true

# API key for authentication
MODEL_API_KEY=your-secret-key

Run with environment variables:

docker model run \
  -e MODEL_TEMPERATURE=0.8 \
  -e MODEL_API_KEY=secret123 \
  ai/llama2

GPU Configuration

Automatic GPU Detection

DMR automatically detects and uses available GPUs:

# Use all GPUs
docker model run --gpus all ai/llama2

# Use specific GPU
docker model run --gpus 0 ai/llama2

# Use multiple specific GPUs
docker model run --gpus 0,1,2 ai/llama2

# GPU with memory limit
docker model run --gpus all --memory 16g ai/llama2

CPU-Only Mode

Force CPU inference when GPU is available:

docker model run --no-gpu ai/llama2

Multi-GPU Tensor Parallelism

Distribute large models across GPUs:

docker model run \
  --gpus all \
  --tensor-parallel 2 \
  ai/llama2-70b

Inspection and Debugging

View Model Details

# Inspect model configuration
docker model inspect ai/llama2

# View model layers
docker model history ai/llama2

# Check model size and metadata
docker model inspect --format='{{.Size}}' ai/llama2

Logs and Monitoring

# View model logs
docker model logs llm

# Follow logs in real-time
docker model logs -f llm

# View last 100 lines
docker model logs --tail 100 llm

# View logs with timestamps
docker model logs -t llm

Performance Stats

# Resource usage
docker model stats

# Specific model stats
docker model stats llm

# Stats in JSON format
docker model stats --format json

Networking

Exposing APIs

# Default port (8080)
docker model run -p 8080:8080 ai/llama2

# Custom port
docker model run -p 3000:8080 ai/llama2

# Bind to specific interface
docker model run -p 127.0.0.1:8080:8080 ai/llama2

# Multiple ports
docker model run -p 8080:8080 -p 9090:9090 ai/llama2

Network Configuration

# Create custom network
docker network create llm-network

# Run model on custom network
docker model run --network llm-network --name llm ai/llama2

# Connect to existing network
docker model run --network host ai/llama2

Security

Access Control

# Run with API key authentication
docker model run \
  -e MODEL_API_KEY=my-secret-key \
  ai/llama2

# Use with authentication
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer my-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "messages": [...]}'

Registry Authentication

# Login to private registry
docker login myregistry.com -u username -p password

# Pull from private registry
docker model pull myregistry.com/private/model:latest

# Use credentials helper
docker login --password-stdin < token.txt

Best Practices

Model Selection

# Use quantized models for faster inference
docker model pull ai/llama2:7b-q4     # 4-bit quantization
docker model pull ai/llama2:7b-q5     # 5-bit quantization
docker model pull ai/llama2:7b-q8     # 8-bit quantization

# Check model variants
docker search ai/llama2

Resource Management

# Set memory limits
docker model run --memory 8g --memory-swap 16g ai/llama2

# Set CPU limits
docker model run --cpus 4 ai/llama2

# Limit GPU memory
docker model run --gpus all --gpu-memory 8g ai/llama2

Health Checks

# Run with health check
docker model run \
  --health-cmd "curl -f http://localhost:8080/health || exit 1" \
  --health-interval 30s \
  --health-timeout 10s \
  --health-retries 3 \
  ai/llama2

Production Orchestration

For production deployments with Kubernetes, Docker Model Runner containers can be orchestrated using standard Kubernetes manifests. Define deployments with resource limits, autoscaling, and load balancing. For comprehensive Kubernetes command reference and deployment patterns, check our Kubernetes Cheatsheet.

# Example: Deploy to Kubernetes cluster
kubectl apply -f llm-deployment.yaml

# Scale deployment
kubectl scale deployment llm --replicas=3

# Expose as service
kubectl expose deployment llm --type=LoadBalancer --port=8080

Troubleshooting

Common Issues

Model won’t start:

# Check available disk space
df -h

# View detailed error logs
docker model logs --tail 50 llm

# Verify GPU availability
nvidia-smi  # For NVIDIA GPUs

Out of memory errors:

# Use smaller quantized model
docker model pull ai/llama2:7b-q4

# Reduce context size
docker model run -e MODEL_CONTEXT=2048 ai/llama2

# Limit batch size
docker model run -e MODEL_BATCH_SIZE=256 ai/llama2

Slow inference:

# Check GPU usage
docker model stats llm

# Ensure GPU is being used
docker model logs llm | grep -i gpu

# Increase GPU layers
docker model run -e MODEL_GPU_LAYERS=40 ai/llama2

Diagnostic Commands

# System information
docker model system info

# Disk usage
docker model system df

# Clean up unused resources
docker model system prune

# Full cleanup (remove all models)
docker model system prune -a

Integration Examples

Python Integration

import openai

# Configure client for Docker Model Runner
client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # DMR doesn't require key by default
)

# Chat completion
response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Bash Script

#!/bin/bash

# Start model if not running
if ! docker model ps | grep -q "llm"; then
    docker model run -d --name llm -p 8080:8080 ai/llama2
    echo "Waiting for model to start..."
    sleep 10
fi

# Make API call
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "'"$1"'"}]
  }' | jq -r '.choices[0].message.content'

Node.js Integration

import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'http://localhost:8080/v1',
    apiKey: 'not-needed'
});

async function chat(message) {
    const completion = await client.chat.completions.create({
        model: 'llama2',
        messages: [{ role: 'user', content: message }]
    });
    
    return completion.choices[0].message.content;
}

// Usage
const response = await chat('What is Docker Model Runner?');
console.log(response);

To see how Docker Model Runner fits with Ollama, vLLM, LocalAI and cloud providers, check our LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared guide.