Can Ollama handle parallel requests?

Yes, Ollama can handle parallel requests very well. It has several configuration parameters and strategies for this.

How Ollama Handles Parallel Requests

Configuring ollama for parallel requests executions.

Page content

When the Ollama server receives two requests at the same time, its behavior depends on its configuration and available system resources.

five awesome llamas are standing in the field

Concurrent Request Handling

Parallel Processing: Ollama supports concurrent processing of requests. If the system has enough available memory (RAM for CPU inference, VRAM for GPU inference), multiple models can be loaded at once, and each loaded model can handle several requests in parallel. This is controlled by the environment variable OLLAMA_NUM_PARALLEL, which sets the maximum number of parallel requests each model can process simultaneously. By default, this is set to 4 (or 1, depending on memory availability), but it can be adjusted.
Batching: When multiple requests for the same model arrive simultaneously, Ollama batches them and processes them together. This means that both requests are handled in parallel, and users will see responses streaming back at the same time. The server does not intentionally wait to fill a batch; processing starts as soon as requests are available.

Queuing and Limits

Queuing: If the number of concurrent requests exceeds the configured parallelism (e.g., more than OLLAMA_NUM_PARALLEL requests for a model), additional requests are queued. The queue operates in a first-in, first-out (FIFO) manner.
Queue Limits: The maximum number of queued requests is controlled by OLLAMA_MAX_QUEUE (default: 512). If the queue is full, new requests receive a 503 error indicating the server is overloaded.
Model Loading: The number of different models that can be loaded at the same time is controlled by OLLAMA_MAX_LOADED_MODELS. If a request requires loading a new model and memory is insufficient, Ollama will unload idle models to make room, and the request will be queued until the model is loaded.

Example Scenario

If two requests for the same model arrive at the same time and the server’s parallelism is set to at least 2, both requests will be processed together in a batch, and both users will receive responses concurrently. If parallelism is set to 1, one request is processed immediately, and the other is queued until the first finishes.

If the requests are for different models and there is enough memory, both models can be loaded and the requests handled in parallel. If not, one model may need to be unloaded, and the request will be queued.

Summary Table

Scenario	Result
Two requests, same model, enough parallelism	Both processed together in parallel (batched)
Two requests, same model, parallelism=1	One processed, second queued until first completes
Two requests, different models, enough memory	Both models loaded, requests handled in parallel
Two requests, different models, not enough memory	One queued until memory is available or a model is unloaded

In summary, Ollama is designed to handle multiple simultaneous requests efficiently, provided the server is configured for concurrency and has sufficient resources. Otherwise, requests are queued and processed in order.

Memory Insufficiency Handling

When Ollama encounters insufficient memory to handle incoming requests, it employs a combination of queuing mechanisms and resource management strategies to maintain stability:

Request Queuing

New requests are placed in a FIFO (First-In, First-Out) queue when memory cannot be immediately allocated.
The queue size is controlled by OLLAMA_MAX_QUEUE (default: 512 requests).
If the queue reaches capacity, new requests receive 503 “Server Overloaded” errors.

Model Management

Active models may be unloaded from memory when they become idle to free resources for queued requests.
The number of concurrently loaded models is limited by OLLAMA_MAX_LOADED_MODELS (default: 3×GPU count or 3 for CPU).

Memory Optimization

Attempts to batch process requests for the same model to maximize memory efficiency.
For GPU inference, requires full VRAM allocation per model - partial loads aren’t supported.

Failure Scenarios

Critical Memory Exhaustion: When even queued requests exceed available resources, Ollama may:

Page to disk (severely degrading performance)
Return “out of memory” errors
Crash the model instance in extreme cases

Configuration Controls Setting	Purpose	Default Value
OLLAMA_MAX_QUEUE	Maximum queued requests	512
OLLAMA_NUM_PARALLEL	Parallel requests per loaded model	4 (or 1 if limited)
OLLAMA_MAX_LOADED_MODELS	Maximum concurrently loaded models	3×GPU count or 3

Administrators should monitor memory usage and adjust these parameters based on their hardware capabilities. Insufficient memory handling becomes crucial when running larger models (7B+ parameters) or processing multiple concurrent requests.

Ollama optimisation strategies

Enable GPU acceleration with export OLLAMA_CUDA=1 and set CPU threads via export OLLAMA_NUM_THREADS=84. Hardware Enhancements

RAM: 32GB+ for 13B models, 64GB+ for 70B models
Storage: NVMe SSDs for faster model loading/swapping
GPU: NVIDIA RTX 3080/4090 with 16GB+ VRAM for larger models

Operational Strategies

Batch Requests: Process multiple queries simultaneously to amortize memory overhead
Automatic Model Unloading: Lets Ollama purge idle models from memory
Cache Frequently Used Models: Keep common models memory-resident

Monitoring & Troubleshooting

Use nvidia-smi (GPU) and htop (CPU/RAM) to identify bottlenecks
For memory errors:
Upgrade to quantized models
Reduce concurrent requests
Increase swap space

Example optimization workflow:

# Use quantized model with GPU acceleration
export OLLAMA_CUDA=1
ollama run llama2:7b-q4_0 --context-size 2048

# Limit loaded models and parallel requests
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4

These adjustments can reduce memory usage by 30-60% while maintaining response quality, particularly beneficial when running multiple models or handling high request volumes.

Ollama: Batching Requests vs Parallel Execution

Batching in Ollama refers to the practice of grouping multiple incoming requests together and processing them as a unit. This allows for more efficient use of computational resources, especially when running on hardware that benefits from parallelized operations (such as GPUs).

When multiple requests for the same model arrive simultaneously, Ollama can process them together in a batch if memory allows. This increases throughput and can reduce latency for each request, as the model can leverage optimized matrix operations over the batch.

Batching is particularly effective when requests are similar in size and complexity, as this allows for better hardware utilization.

Parallel execution in Ollama means handling multiple requests at the same time, either for the same model or for different models, depending on available memory and configuration.

Ollama supports two levels of parallelism:

Multiple Model Loading: If enough memory is available, several models can be loaded and serve requests simultaneously.
Parallel Requests per Model: Each loaded model can process several requests in parallel, controlled by the OLLAMA_NUM_PARALLEL setting (default is 1 or 4, depending on memory).

When requests exceed the parallelism limit, they are queued (FIFO) up to OLLAMA_MAX_QUEUE.

Takeaway

Ollama leverages both batching and parallel execution to process multiple requests efficiently. Batching groups requests for simultaneous processing, while parallel execution allows multiple requests (or models) to run concurrently. Both methods depend on system memory and are configurable for optimal performance.