How Ollama Handles Parallel Requests
Configuring ollama for parallel requests executions.
When the Ollama server receives two requests at the same time, its behavior depends on its configuration and available system resources.
Concurrent Request Handling
-
Parallel Processing: Ollama supports concurrent processing of requests. If the system has enough available memory (RAM for CPU inference, VRAM for GPU inference), multiple models can be loaded at once, and each loaded model can handle several requests in parallel. This is controlled by the environment variable
OLLAMA_NUM_PARALLEL
, which sets the maximum number of parallel requests each model can process simultaneously. By default, this is set to 4 (or 1, depending on memory availability), but it can be adjusted. -
Batching: When multiple requests for the same model arrive simultaneously, Ollama batches them and processes them together. This means that both requests are handled in parallel, and users will see responses streaming back at the same time. The server does not intentionally wait to fill a batch; processing starts as soon as requests are available.
Queuing and Limits
-
Queuing: If the number of concurrent requests exceeds the configured parallelism (e.g., more than
OLLAMA_NUM_PARALLEL
requests for a model), additional requests are queued. The queue operates in a first-in, first-out (FIFO) manner. -
Queue Limits: The maximum number of queued requests is controlled by
OLLAMA_MAX_QUEUE
(default: 512). If the queue is full, new requests receive a 503 error indicating the server is overloaded. -
Model Loading: The number of different models that can be loaded at the same time is controlled by
OLLAMA_MAX_LOADED_MODELS
. If a request requires loading a new model and memory is insufficient, Ollama will unload idle models to make room, and the request will be queued until the model is loaded.
Example Scenario
If two requests for the same model arrive at the same time and the server’s parallelism is set to at least 2, both requests will be processed together in a batch, and both users will receive responses concurrently. If parallelism is set to 1, one request is processed immediately, and the other is queued until the first finishes.
If the requests are for different models and there is enough memory, both models can be loaded and the requests handled in parallel. If not, one model may need to be unloaded, and the request will be queued.
Summary Table
Scenario | Result |
---|---|
Two requests, same model, enough parallelism | Both processed together in parallel (batched) |
Two requests, same model, parallelism=1 | One processed, second queued until first completes |
Two requests, different models, enough memory | Both models loaded, requests handled in parallel |
Two requests, different models, not enough memory | One queued until memory is available or a model is unloaded |
In summary, Ollama is designed to handle multiple simultaneous requests efficiently, provided the server is configured for concurrency and has sufficient resources. Otherwise, requests are queued and processed in order.
Memory Insufficiency Handling
When Ollama encounters insufficient memory to handle incoming requests, it employs a combination of queuing mechanisms and resource management strategies to maintain stability:
Request Queuing
- New requests are placed in a FIFO (First-In, First-Out) queue when memory cannot be immediately allocated.
- The queue size is controlled by OLLAMA_MAX_QUEUE (default: 512 requests).
- If the queue reaches capacity, new requests receive 503 “Server Overloaded” errors.
Model Management
- Active models may be unloaded from memory when they become idle to free resources for queued requests.
- The number of concurrently loaded models is limited by OLLAMA_MAX_LOADED_MODELS (default: 3×GPU count or 3 for CPU).
Memory Optimization
- Attempts to batch process requests for the same model to maximize memory efficiency.
- For GPU inference, requires full VRAM allocation per model - partial loads aren’t supported.
Failure Scenarios
Critical Memory Exhaustion: When even queued requests exceed available resources, Ollama may:
- Page to disk (severely degrading performance)
- Return “out of memory” errors
- Crash the model instance in extreme cases
Configuration Controls Setting | Purpose | Default Value |
---|---|---|
OLLAMA_MAX_QUEUE | Maximum queued requests | 512 |
OLLAMA_NUM_PARALLEL | Parallel requests per loaded model | 4 (or 1 if limited) |
OLLAMA_MAX_LOADED_MODELS | Maximum concurrently loaded models | 3×GPU count or 3 |
Administrators should monitor memory usage and adjust these parameters based on their hardware capabilities. Insufficient memory handling becomes crucial when running larger models (7B+ parameters) or processing multiple concurrent requests.
Ollama optimisation strategies
Enable GPU acceleration with export OLLAMA_CUDA=1 and set CPU threads via export OLLAMA_NUM_THREADS=84. Hardware Enhancements
- RAM: 32GB+ for 13B models, 64GB+ for 70B models
- Storage: NVMe SSDs for faster model loading/swapping
- GPU: NVIDIA RTX 3080/4090 with 16GB+ VRAM for larger models
Operational Strategies
- Batch Requests: Process multiple queries simultaneously to amortize memory overhead
- Automatic Model Unloading: Lets Ollama purge idle models from memory
- Cache Frequently Used Models: Keep common models memory-resident
Monitoring & Troubleshooting
- Use nvidia-smi (GPU) and htop (CPU/RAM) to identify bottlenecks
- For memory errors:
- Upgrade to quantized models
- Reduce concurrent requests
- Increase swap space
Example optimization workflow:
# Use quantized model with GPU acceleration
export OLLAMA_CUDA=1
ollama run llama2:7b-q4_0 --context-size 2048
# Limit loaded models and parallel requests
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4
These adjustments can reduce memory usage by 30-60% while maintaining response quality, particularly beneficial when running multiple models or handling high request volumes.
Ollama: Batching Requests vs Parallel Execution
Batching in Ollama refers to the practice of grouping multiple incoming requests together and processing them as a unit. This allows for more efficient use of computational resources, especially when running on hardware that benefits from parallelized operations (such as GPUs).
When multiple requests for the same model arrive simultaneously, Ollama can process them together in a batch if memory allows. This increases throughput and can reduce latency for each request, as the model can leverage optimized matrix operations over the batch.
Batching is particularly effective when requests are similar in size and complexity, as this allows for better hardware utilization.
Parallel execution in Ollama means handling multiple requests at the same time, either for the same model or for different models, depending on available memory and configuration.
Ollama supports two levels of parallelism:
- Multiple Model Loading: If enough memory is available, several models can be loaded and serve requests simultaneously.
- Parallel Requests per Model: Each loaded model can process several requests in parallel, controlled by the OLLAMA_NUM_PARALLEL setting (default is 1 or 4, depending on memory).
When requests exceed the parallelism limit, they are queued (FIFO) up to OLLAMA_MAX_QUEUE.
Takeaway
Ollama leverages both batching and parallel execution to process multiple requests efficiently. Batching groups requests for simultaneous processing, while parallel execution allows multiple requests (or models) to run concurrently. Both methods depend on system memory and are configurable for optimal performance.