Why doesn’t context_size in docker-compose.yml work with docker/model-runner image?

The docker/model-runner:latest-cuda image hardcodes –ctx-size 4096 when calling llama.cpp, ignoring the context_size parameter in the models section. This is a known limitation of the current implementation.

Can I change context size per API request?

No, context size is set at model initialization time, not per request. You can only control output length with max_tokens parameter in API requests.

Does docker model configure –context-size work?

Yes, but only when using docker model run directly. The configuration is ignored when using docker-compose with the docker/model-runner image.

How can I verify the actual context size being used?

Check the docker compose logs with docker compose logs 2>&1 | grep -i "n_ctx" to see the actual context size (n_ctx) that llama.cpp is using.

What’s the best way to use custom context sizes?

Package your own model with docker model package –context-size, or use docker model configure for direct docker model run commands. For docker-compose, you may need to wait for future updates that support context_size properly.

Why is context size important for VRAM usage?

Larger context sizes require more VRAM. However, with docker/model-runner, context size is fixed at 4096, so VRAM usage is primarily determined by model size, quantization level, and number of GPU layers (MODEL_GPU_LAYERS).

Docker Model Runner: Context Size Config Guide

Configure context sizes in Docker Model Runner with workarounds

Page content

Configuring context sizes in Docker Model Runner is more complex than it should be.

While the context_size parameter exists in docker-compose configuration, it’s often ignored by the docker/model-runner:latest-cuda image, which hardcodes a 4096-token context size. This guide explores the limitations and provides practical workarounds.

configuring car This image was generated by Flux 1 dev.

Understanding the Problem

When using Docker Model Runner with docker-compose, you might configure context size like this:

services:
  llm:
    image: docker/model-runner:latest-cuda
    models:
      - llm_model

models:
  llm_model:
    model: ai/gemma3-qat:4B
    context_size: 10240

However, checking the logs reveals the actual context size being used:

docker compose logs 2>&1 | grep -i "n_ctx"

You’ll see output like:

llamaCppArgs: [... --ctx-size 4096 ...]
llama_context: n_ctx = 4096

The docker/model-runner:latest-cuda image is hardcoding --ctx-size 4096 when calling llama.cpp, completely ignoring your context_size: 10240 configuration.

Why This Happens

The context size (n_ctx) is set at model initialization time in llama.cpp, the underlying inference engine that Docker Model Runner uses. This happens during the model’s context construction phase, before any API requests are processed. Docker Model Runner’s compose integration appears to have a bug where it doesn’t properly pass the context_size parameter from the models section to the underlying llama.cpp process. Instead, it defaults to 4096 tokens regardless of your configuration.

This limitation means that even though Docker Compose recognizes the context_size parameter in your YAML configuration, the docker/model-runner:latest-cuda image doesn’t respect it when constructing the llama.cpp command-line arguments. The hardcoded --ctx-size 4096 flag takes precedence over any configuration you specify.

Workarounds and Solutions

What to do? Methods 1-2-3 will work, but they have limitations.

Method 1. ad-hoc, will work. Method 2. hardcoded in model. Method 3. required conteinerisation and putting into composition your own app. That’s closer to production.

Method 1: Use `docker model configure` (Limited)

You can configure context size using the Docker Model CLI, which stores the configuration in Docker’s model metadata:

docker model configure --context-size=10000 ai/gemma3-qat:4B

This command updates the model’s configuration, but the implementation has significant limitations. The configuration is stored but not always applied correctly.

Limitations:

This is not working when using docker model run directly, only via curl to the API endpoint
You cannot use docker model run after configuring - it will ignore the configuration
The configuration is ignored when using docker-compose with the docker/model-runner:latest-cuda image
The configuration may be lost when the model is updated or pulled again

This method works best for testing with direct API calls, but isn’t suitable for production deployments using docker-compose.

Method 2: Package Your Own Model

The most reliable way to set a custom context size is to package your own model with the desired context size using docker model package. This bakes the context size into the model’s metadata at packaging time:

docker model package \
  --gguf /path/to/model.gguf \
  --context-size 10240 \
  --name my-model:custom-context

This creates a new OCI artifact (similar to a Docker image) with the context size permanently configured. The packaged model can then be pushed to Docker Hub or any OCI-compliant registry and pulled like any other Docker Model Runner model.

However, this approach requires:

Access to the original GGUF model file (the quantized format used by llama.cpp)
Repackaging every time you want to change context size, which can be time-consuming
Managing your own model registry or Docker Hub account
Understanding of the Docker Model Runner packaging workflow

This method is best suited for production environments where you need consistent, reproducible context sizes across deployments.

Method 3: Docker Compose

This is currently broken for docker/model-runner:latest-cuda

But for your own app in the image might work :)

While the syntax exists in docker-compose.yml:

services:
  llm:
    image: docker/model-runner:latest-cuda
    models:
      - llm_model

models:
  llm_model:
    model: ai/gemma3-qat:4B
    context_size: 10240

This doesn’t work - the context_size parameter is recognized by docker-compose but not applied. The model still uses 4096 tokens.

Method 4: Environment Variables (Also Broken)

Attempting to use MODEL_CONTEXT environment variable:

services:
  llm:
    image: docker/model-runner:latest-cuda
    environment:
      - MODEL_CONTEXT=10240

This also doesn’t work - the environment variable is not respected when using docker-compose.

Verifying Context Size

To check what context size is actually being used, examine the logs:

# Check llama.cpp arguments
docker compose logs 2>&1 | grep "llamaCppArgs"

# Check actual context size
docker compose logs 2>&1 | grep -i "n_ctx" | tail -10

You’ll see output like:

llamaCppArgs: [-ngl 999 --metrics --model /models/... --ctx-size 4096 ...]
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096

If you see n_ctx = 4096 despite configuring a different value, your configuration is being ignored.

Testing Context Size

To verify whether your context size configuration is actually being applied, you need to test with prompts that exceed the default 4096-token limit. Here’s a practical script using Python to test if your context size configuration is working:

#!/bin/bash
MODEL="ai/gemma3-qat:4B"
PORT=8085

# Test with large prompt
python3 -c "print('test ' * 5000)" > large_prompt.txt

python3 << 'PYTHON' > request.json
import json
import os

with open('large_prompt.txt', 'r') as f:
    large_prompt = f.read().strip()

request = {
    "model": os.environ.get("MODEL", "ai/gemma3-qat:4B"),
    "messages": [{
        "role": "user",
        "content": large_prompt
    }]
}
print(json.dumps(request))
PYTHON

curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @request.json > response.json

# Check token usage
python3 << 'PYTHON'
import json
with open('response.json') as f:
    r = json.load(f)
    if 'usage' in r:
        print(f"Prompt tokens: {r['usage']['prompt_tokens']}")
        if r['usage']['prompt_tokens'] > 4096:
            print("✅ Context window is larger than 4096!")
        else:
            print("⚠️ Context window appears to be limited to 4096")
PYTHON

Alternative Solutions

If you need more flexible context size configuration, consider these alternatives:

Ollama - An alternative LLM hosting solution that provides better control over context sizes and simpler configuration. Ollama allows you to specify context size per model and doesn’t have the same docker-compose limitations.
Docker Model Runner vs Ollama comparison - A detailed comparison of both solutions, including context size configuration capabilities, performance, and when to choose each platform.

Docker Model Runner

Docker Model Runner Cheatsheet - Complete command reference with examples for all Docker Model Runner operations
Adding NVIDIA GPU Support to Docker Model Runner - Step-by-step guide for enabling GPU acceleration
Docker Model Runner vs Ollama: Which to Choose?

Docker and Infrastructure

Docker Cheatsheet - Essential Docker commands for container management
Docker Compose Cheatsheet - Complete guide to Docker Compose configuration and commands

Alternative LLM Solutions

Ollama Cheatsheet - Alternative LLM hosting solution with built-in GPU support and simpler context size configuration
Integrating Ollama with Python

Official Documentation

Docker Model Runner Documentation - Official Docker documentation for Model Runner
llama.cpp Context Configuration - Underlying inference engine documentation

Conclusion

Configuring context sizes in Docker Model Runner is currently problematic when using docker-compose. While the configuration syntax exists in the Docker Compose specification, it’s not properly implemented in the docker/model-runner:latest-cuda image, which hardcodes a 4096-token context size regardless of your configuration.

The most reliable workaround is to package your own models with the desired context size using docker model package, though this adds complexity to your workflow and requires access to the original GGUF model files. Alternatively, you can use docker model configure for direct API access, but this doesn’t work with docker-compose deployments.

For most use cases, the default 4096-token context size is sufficient for typical conversational AI applications. If you need larger context windows or more flexible configuration, consider using Ollama as an alternative, which provides better control over context sizes without the docker-compose limitations.

You can still optimize VRAM usage through other means like model quantization (Q4, Q6, Q8) and GPU layer configuration (MODEL_GPU_LAYERS), which are more effective for reducing memory consumption than context size adjustments anyway.

For more details on GPU optimization and VRAM management, see our guide on configuring NVIDIA GPU support.