Docker Model Runner Cheatsheet: Commands & Examples

Quick reference for Docker Model Runner commands

Docker Model Runner (DMR) is Docker’s official solution for running AI models locally, introduced in April 2025. This cheatsheet provides a quick reference for all essential commands, configurations, and best practices.

list of gemma models available in docker model runner

Installation

Docker Desktop

Enable Docker Model Runner through the GUI:

  1. Open Docker Desktop
  2. Go to SettingsAI tab
  3. Click Enable Docker Model Runner
  4. Restart Docker Desktop

/home/rg/prj/hugo-pers/content/post/2025/10/docker-model-runner-cheatsheet/docker-model-runner_w678.jpg docker model runner windows

Docker Engine (Linux)

Install the plugin package:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker-model-plugin

# Fedora/RHEL
sudo dnf install docker-model-plugin

# Arch Linux
sudo pacman -S docker-model-plugin

Verify installation:

docker model --help

Core Commands

Pulling Models

Pull pre-packaged models from Docker Hub:

# Basic pull
docker model pull ai/llama2

# Pull specific version
docker model pull ai/llama2:7b-q4

# Pull from custom registry
docker model pull myregistry.com/models/mistral:latest

# List available models in a namespace
docker search ai/

Running Models

Start a model with automatic API serving:

# Basic run (interactive)
docker model run ai/llama2 "What is Docker?"

# Run as service (background)
docker model run -d --name my-llm ai/llama2

# Run with custom port
docker model run -p 8080:8080 ai/llama2

# Run with GPU specification
docker model run --gpus 0,1 ai/llama2

# Run with memory limit
docker model run --memory 8g ai/llama2

# Run with environment variables
docker model run -e MODEL_CONTEXT=4096 ai/llama2

# Run with volume mount for persistent data
docker model run -v model-data:/data ai/llama2

Listing Models

View downloaded and running models:

# List all downloaded models
docker model ls

# List running models
docker model ps

# List with detailed information
docker model ls --all --format json

# Filter by name
docker model ls --filter "name=llama"

Stopping Models

Stop running model instances:

# Stop specific model
docker model stop my-llm

# Stop all running models
docker model stop $(docker model ps -q)

# Stop with timeout
docker model stop --time 30 my-llm

Removing Models

Delete models from local storage:

# Remove specific model
docker model rm ai/llama2

# Remove with force (even if running)
docker model rm -f ai/llama2

# Remove unused models
docker model prune

# Remove all models
docker model rm $(docker model ls -q)

Packaging Custom Models

Create OCI Artifact from GGUF

Package your own GGUF models:

# Basic packaging
docker model package --gguf /path/to/model.gguf myorg/mymodel:latest

# Package with metadata
docker model package \
  --gguf /path/to/model.gguf \
  --label "description=Custom Llama model" \
  --label "version=1.0" \
  myorg/mymodel:v1.0

# Package and push in one command
docker model package --gguf /path/to/model.gguf --push myorg/mymodel:latest

# Package with custom context size
docker model package \
  --gguf /path/to/model.gguf \
  --context 8192 \
  myorg/mymodel:latest

Publishing Models

Push models to registries:

# Login to Docker Hub
docker login

# Push to Docker Hub
docker model push myorg/mymodel:latest

# Push to private registry
docker login myregistry.com
docker model push myregistry.com/models/mymodel:latest

# Tag and push
docker model tag mymodel:latest myorg/mymodel:v1.0
docker model push myorg/mymodel:v1.0

API Usage

OpenAI-Compatible Endpoints

Docker Model Runner automatically exposes OpenAI-compatible APIs:

# Start model with API
docker model run -d -p 8080:8080 --name llm ai/llama2

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Text generation
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Once upon a time",
    "max_tokens": 100
  }'

# Streaming response
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

# List available models via API
curl http://localhost:8080/v1/models

# Model information
curl http://localhost:8080/v1/models/llama2

Docker Compose Configuration

Basic Compose File

version: '3.8'

services:
  llm:
    image: docker-model-runner
    model: ai/llama2:7b-q4
    ports:
      - "8080:8080"
    environment:
      - MODEL_CONTEXT=4096
      - MODEL_TEMPERATURE=0.7
    volumes:
      - model-data:/root/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  model-data:

Multi-Model Setup

version: '3.8'

services:
  llama:
    image: docker-model-runner
    model: ai/llama2
    ports:
      - "8080:8080"
    
  mistral:
    image: docker-model-runner
    model: ai/mistral
    ports:
      - "8081:8080"
    
  embedding:
    image: docker-model-runner
    model: ai/nomic-embed-text
    ports:
      - "8082:8080"

For more advanced Docker Compose configurations and commands, see our Docker Compose Cheatsheet covering networking, volumes, and orchestration patterns.

Environment Variables

Configure model behavior with environment variables:

# Context window size
MODEL_CONTEXT=4096

# Temperature (0.0-1.0)
MODEL_TEMPERATURE=0.7

# Top-p sampling
MODEL_TOP_P=0.9

# Top-k sampling
MODEL_TOP_K=40

# Maximum tokens
MODEL_MAX_TOKENS=2048

# Number of GPU layers
MODEL_GPU_LAYERS=35

# Batch size
MODEL_BATCH_SIZE=512

# Thread count (CPU)
MODEL_THREADS=8

# Enable verbose logging
MODEL_VERBOSE=true

# API key for authentication
MODEL_API_KEY=your-secret-key

Run with environment variables:

docker model run \
  -e MODEL_CONTEXT=8192 \
  -e MODEL_TEMPERATURE=0.8 \
  -e MODEL_API_KEY=secret123 \
  ai/llama2

GPU Configuration

Automatic GPU Detection

DMR automatically detects and uses available GPUs:

# Use all GPUs
docker model run --gpus all ai/llama2

# Use specific GPU
docker model run --gpus 0 ai/llama2

# Use multiple specific GPUs
docker model run --gpus 0,1,2 ai/llama2

# GPU with memory limit
docker model run --gpus all --memory 16g ai/llama2

CPU-Only Mode

Force CPU inference when GPU is available:

docker model run --no-gpu ai/llama2

Multi-GPU Tensor Parallelism

Distribute large models across GPUs:

docker model run \
  --gpus all \
  --tensor-parallel 2 \
  ai/llama2-70b

Inspection and Debugging

View Model Details

# Inspect model configuration
docker model inspect ai/llama2

# View model layers
docker model history ai/llama2

# Check model size and metadata
docker model inspect --format='{{.Size}}' ai/llama2

Logs and Monitoring

# View model logs
docker model logs llm

# Follow logs in real-time
docker model logs -f llm

# View last 100 lines
docker model logs --tail 100 llm

# View logs with timestamps
docker model logs -t llm

Performance Stats

# Resource usage
docker model stats

# Specific model stats
docker model stats llm

# Stats in JSON format
docker model stats --format json

Networking

Exposing APIs

# Default port (8080)
docker model run -p 8080:8080 ai/llama2

# Custom port
docker model run -p 3000:8080 ai/llama2

# Bind to specific interface
docker model run -p 127.0.0.1:8080:8080 ai/llama2

# Multiple ports
docker model run -p 8080:8080 -p 9090:9090 ai/llama2

Network Configuration

# Create custom network
docker network create llm-network

# Run model on custom network
docker model run --network llm-network --name llm ai/llama2

# Connect to existing network
docker model run --network host ai/llama2

Security

Access Control

# Run with API key authentication
docker model run \
  -e MODEL_API_KEY=my-secret-key \
  ai/llama2

# Use with authentication
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer my-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "messages": [...]}'

Registry Authentication

# Login to private registry
docker login myregistry.com -u username -p password

# Pull from private registry
docker model pull myregistry.com/private/model:latest

# Use credentials helper
docker login --password-stdin < token.txt

Best Practices

Model Selection

# Use quantized models for faster inference
docker model pull ai/llama2:7b-q4     # 4-bit quantization
docker model pull ai/llama2:7b-q5     # 5-bit quantization
docker model pull ai/llama2:7b-q8     # 8-bit quantization

# Check model variants
docker search ai/llama2

Resource Management

# Set memory limits
docker model run --memory 8g --memory-swap 16g ai/llama2

# Set CPU limits
docker model run --cpus 4 ai/llama2

# Limit GPU memory
docker model run --gpus all --gpu-memory 8g ai/llama2

Health Checks

# Run with health check
docker model run \
  --health-cmd "curl -f http://localhost:8080/health || exit 1" \
  --health-interval 30s \
  --health-timeout 10s \
  --health-retries 3 \
  ai/llama2

Production Orchestration

For production deployments with Kubernetes, Docker Model Runner containers can be orchestrated using standard Kubernetes manifests. Define deployments with resource limits, autoscaling, and load balancing. For comprehensive Kubernetes command reference and deployment patterns, check our Kubernetes Cheatsheet.

# Example: Deploy to Kubernetes cluster
kubectl apply -f llm-deployment.yaml

# Scale deployment
kubectl scale deployment llm --replicas=3

# Expose as service
kubectl expose deployment llm --type=LoadBalancer --port=8080

Troubleshooting

Common Issues

Model won’t start:

# Check available disk space
df -h

# View detailed error logs
docker model logs --tail 50 llm

# Verify GPU availability
nvidia-smi  # For NVIDIA GPUs

Out of memory errors:

# Use smaller quantized model
docker model pull ai/llama2:7b-q4

# Reduce context size
docker model run -e MODEL_CONTEXT=2048 ai/llama2

# Limit batch size
docker model run -e MODEL_BATCH_SIZE=256 ai/llama2

Slow inference:

# Check GPU usage
docker model stats llm

# Ensure GPU is being used
docker model logs llm | grep -i gpu

# Increase GPU layers
docker model run -e MODEL_GPU_LAYERS=40 ai/llama2

Diagnostic Commands

# System information
docker model system info

# Disk usage
docker model system df

# Clean up unused resources
docker model system prune

# Full cleanup (remove all models)
docker model system prune -a

Integration Examples

Python Integration

import openai

# Configure client for Docker Model Runner
client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # DMR doesn't require key by default
)

# Chat completion
response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Bash Script

#!/bin/bash

# Start model if not running
if ! docker model ps | grep -q "llm"; then
    docker model run -d --name llm -p 8080:8080 ai/llama2
    echo "Waiting for model to start..."
    sleep 10
fi

# Make API call
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "'"$1"'"}]
  }' | jq -r '.choices[0].message.content'

Node.js Integration

import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'http://localhost:8080/v1',
    apiKey: 'not-needed'
});

async function chat(message) {
    const completion = await client.chat.completions.create({
        model: 'llama2',
        messages: [{ role: 'user', content: message }]
    });
    
    return completion.choices[0].message.content;
}

// Usage
const response = await chat('What is Docker Model Runner?');
console.log(response);

Official Documentation

Comparison Articles