Docker Model Runner: Context Size Config Guide
Configure context sizes in Docker Model Runner with workarounds
Configuring context sizes in Docker Model Runner is more complex than it should be.
While the context_size parameter exists in docker-compose configuration, it’s often ignored by the docker/model-runner:latest-cuda image, which hardcodes a 4096-token context size. This guide explores the limitations and provides practical workarounds.
This image was generated by Flux 1 dev.
Understanding the Problem
When using Docker Model Runner with docker-compose, you might configure context size like this:
services:
llm:
image: docker/model-runner:latest-cuda
models:
- llm_model
models:
llm_model:
model: ai/gemma3-qat:4B
context_size: 10240
However, checking the logs reveals the actual context size being used:
docker compose logs 2>&1 | grep -i "n_ctx"
You’ll see output like:
llamaCppArgs: [... --ctx-size 4096 ...]
llama_context: n_ctx = 4096
The docker/model-runner:latest-cuda image is hardcoding --ctx-size 4096 when calling llama.cpp, completely ignoring your context_size: 10240 configuration.
Why This Happens
The context size (n_ctx) is set at model initialization time in llama.cpp, the underlying inference engine that Docker Model Runner uses. This happens during the model’s context construction phase, before any API requests are processed. Docker Model Runner’s compose integration appears to have a bug where it doesn’t properly pass the context_size parameter from the models section to the underlying llama.cpp process. Instead, it defaults to 4096 tokens regardless of your configuration.
This limitation means that even though Docker Compose recognizes the context_size parameter in your YAML configuration, the docker/model-runner:latest-cuda image doesn’t respect it when constructing the llama.cpp command-line arguments. The hardcoded --ctx-size 4096 flag takes precedence over any configuration you specify.
Workarounds and Solutions
What to do? Methods 1-2-3 will work, but they have limitations.
Method 1. ad-hoc, will work. Method 2. hardcoded in model. Method 3. required conteinerisation and putting into composition your own app. That’s closer to production.
Method 1: Use docker model configure (Limited)
You can configure context size using the Docker Model CLI, which stores the configuration in Docker’s model metadata:
docker model configure --context-size=10000 ai/gemma3-qat:4B
This command updates the model’s configuration, but the implementation has significant limitations. The configuration is stored but not always applied correctly.
Limitations:
- This is not working when using
docker model rundirectly, only via curl to the API endpoint - You cannot use
docker model runafter configuring - it will ignore the configuration - The configuration is ignored when using docker-compose with the
docker/model-runner:latest-cudaimage - The configuration may be lost when the model is updated or pulled again
This method works best for testing with direct API calls, but isn’t suitable for production deployments using docker-compose.
Method 2: Package Your Own Model
The most reliable way to set a custom context size is to package your own model with the desired context size using docker model package. This bakes the context size into the model’s metadata at packaging time:
docker model package \
--gguf /path/to/model.gguf \
--context-size 10240 \
--name my-model:custom-context
This creates a new OCI artifact (similar to a Docker image) with the context size permanently configured. The packaged model can then be pushed to Docker Hub or any OCI-compliant registry and pulled like any other Docker Model Runner model.
However, this approach requires:
- Access to the original GGUF model file (the quantized format used by llama.cpp)
- Repackaging every time you want to change context size, which can be time-consuming
- Managing your own model registry or Docker Hub account
- Understanding of the Docker Model Runner packaging workflow
This method is best suited for production environments where you need consistent, reproducible context sizes across deployments.
Method 3: Docker Compose
This is currently broken for docker/model-runner:latest-cuda
But for your own app in the image might work :)
While the syntax exists in docker-compose.yml:
services:
llm:
image: docker/model-runner:latest-cuda
models:
- llm_model
models:
llm_model:
model: ai/gemma3-qat:4B
context_size: 10240
This doesn’t work - the context_size parameter is recognized by docker-compose but not applied. The model still uses 4096 tokens.
Method 4: Environment Variables (Also Broken)
Attempting to use MODEL_CONTEXT environment variable:
services:
llm:
image: docker/model-runner:latest-cuda
environment:
- MODEL_CONTEXT=10240
This also doesn’t work - the environment variable is not respected when using docker-compose.
Verifying Context Size
To check what context size is actually being used, examine the logs:
# Check llama.cpp arguments
docker compose logs 2>&1 | grep "llamaCppArgs"
# Check actual context size
docker compose logs 2>&1 | grep -i "n_ctx" | tail -10
You’ll see output like:
llamaCppArgs: [-ngl 999 --metrics --model /models/... --ctx-size 4096 ...]
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
If you see n_ctx = 4096 despite configuring a different value, your configuration is being ignored.
Testing Context Size
To verify whether your context size configuration is actually being applied, you need to test with prompts that exceed the default 4096-token limit. Here’s a practical script using Python to test if your context size configuration is working:
#!/bin/bash
MODEL="ai/gemma3-qat:4B"
PORT=8085
# Test with large prompt
python3 -c "print('test ' * 5000)" > large_prompt.txt
python3 << 'PYTHON' > request.json
import json
import os
with open('large_prompt.txt', 'r') as f:
large_prompt = f.read().strip()
request = {
"model": os.environ.get("MODEL", "ai/gemma3-qat:4B"),
"messages": [{
"role": "user",
"content": large_prompt
}]
}
print(json.dumps(request))
PYTHON
curl -s http://localhost:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d @request.json > response.json
# Check token usage
python3 << 'PYTHON'
import json
with open('response.json') as f:
r = json.load(f)
if 'usage' in r:
print(f"Prompt tokens: {r['usage']['prompt_tokens']}")
if r['usage']['prompt_tokens'] > 4096:
print("✅ Context window is larger than 4096!")
else:
print("⚠️ Context window appears to be limited to 4096")
PYTHON
Alternative Solutions
If you need more flexible context size configuration, consider these alternatives:
-
Ollama - An alternative LLM hosting solution that provides better control over context sizes and simpler configuration. Ollama allows you to specify context size per model and doesn’t have the same docker-compose limitations.
-
Docker Model Runner vs Ollama comparison - A detailed comparison of both solutions, including context size configuration capabilities, performance, and when to choose each platform.
Related Resources
Docker Model Runner
- Docker Model Runner Cheatsheet - Complete command reference with examples for all Docker Model Runner operations
- Adding NVIDIA GPU Support to Docker Model Runner - Step-by-step guide for enabling GPU acceleration
- Docker Model Runner vs Ollama: Which to Choose?
Docker and Infrastructure
- Docker Cheatsheet - Essential Docker commands for container management
- Docker Compose Cheatsheet - Complete guide to Docker Compose configuration and commands
Alternative LLM Solutions
- Ollama Cheatsheet - Alternative LLM hosting solution with built-in GPU support and simpler context size configuration
- Integrating Ollama with Python
Official Documentation
- Docker Model Runner Documentation - Official Docker documentation for Model Runner
- llama.cpp Context Configuration - Underlying inference engine documentation
Conclusion
Configuring context sizes in Docker Model Runner is currently problematic when using docker-compose. While the configuration syntax exists in the Docker Compose specification, it’s not properly implemented in the docker/model-runner:latest-cuda image, which hardcodes a 4096-token context size regardless of your configuration.
The most reliable workaround is to package your own models with the desired context size using docker model package, though this adds complexity to your workflow and requires access to the original GGUF model files. Alternatively, you can use docker model configure for direct API access, but this doesn’t work with docker-compose deployments.
For most use cases, the default 4096-token context size is sufficient for typical conversational AI applications. If you need larger context windows or more flexible configuration, consider using Ollama as an alternative, which provides better control over context sizes without the docker-compose limitations.
You can still optimize VRAM usage through other means like model quantization (Q4, Q6, Q8) and GPU layer configuration (MODEL_GPU_LAYERS), which are more effective for reducing memory consumption than context size adjustments anyway.
For more details on GPU optimization and VRAM management, see our guide on configuring NVIDIA GPU support.