Can Ollama handle parallel requests?

Yes. Ollama can process multiple requests at once via batching and parallel execution. How many run in parallel is set by OLLAMA_NUM_PARALLEL (default 4 or 1). Extra requests are queued up to OLLAMA_MAX_QUEUE.

How does Ollama queue requests when limits are reached?

When concurrent requests exceed the parallelism limit, Ollama queues them in FIFO order. Queue size is capped by OLLAMA_MAX_QUEUE (default 512). If the queue is full, new requests get a 503 Server Overloaded response.

How many models can Ollama load at once?

OLLAMA_MAX_LOADED_MODELS controls this (default 3 per GPU or 3 for CPU). If a new model is needed and memory is low, Ollama can unload idle models and queue the request until the model is loaded.

What causes Ollama 503 or out-of-memory errors?

503 usually means the request queue is full (OLLAMA_MAX_QUEUE). Out-of-memory can happen when VRAM or RAM is exhausted—reduce concurrent requests, use smaller or quantized models, or increase OLLAMA_MAX_QUEUE only if the system can handle the load.

Where can I find more on LLM performance and benchmarks?

Our LLM Performance hub covers throughput vs latency, VRAM limits, parallel requests, and benchmarks across runtimes and hardware.

How Ollama Handles Parallel Requests

Understand Ollama concurrency, queueing, and how to tune OLLAMA_NUM_PARALLEL for stable parallel requests.

Page content

This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).

Jump links: What is OLLAMA_NUM_PARALLEL? · Quick tuning recipes · How queuing works · Troubleshooting · Related: Ollama CLI commands cheatsheet

For more on throughput, latency, VRAM, and benchmarks across runtimes and hardware, see LLM Performance: Benchmarks, Bottlenecks & Optimization.

five awesome llamas are standing in the field

Concurrent Request Handling

Parallel Processing: Ollama supports concurrent processing of requests. If the system has enough available memory (RAM for CPU inference, VRAM for GPU inference), multiple models can be loaded at once, and each loaded model can handle several requests in parallel. This is controlled by the environment variable OLLAMA_NUM_PARALLEL, which sets the maximum number of parallel requests each model can process simultaneously. By default, this is set to 4 (or 1, depending on memory availability), but it can be adjusted.
Batching: When multiple requests for the same model arrive simultaneously, Ollama batches them and processes them together. This means that both requests are handled in parallel, and users will see responses streaming back at the same time. The server does not intentionally wait to fill a batch; processing starts as soon as requests are available.

Queuing and Limits

Queuing: If the number of concurrent requests exceeds the configured parallelism (e.g., more than OLLAMA_NUM_PARALLEL requests for a model), additional requests are queued. The queue operates in a first-in, first-out (FIFO) manner.
Queue Limits: The maximum number of queued requests is controlled by OLLAMA_MAX_QUEUE (default: 512). If the queue is full, new requests receive a 503 error indicating the server is overloaded.
Model Loading: The number of different models that can be loaded at the same time is controlled by OLLAMA_MAX_LOADED_MODELS. If a request requires loading a new model and memory is insufficient, Ollama will unload idle models to make room, and the request will be queued until the model is loaded.

Example Scenario

If two requests for the same model arrive at the same time and the server’s parallelism is set to at least 2, both requests will be processed together in a batch, and both users will receive responses concurrently. If parallelism is set to 1, one request is processed immediately, and the other is queued until the first finishes.

If the requests are for different models and there is enough memory, both models can be loaded and the requests handled in parallel. If not, one model may need to be unloaded, and the request will be queued.

Summary Table

Scenario	Result
Two requests, same model, enough parallelism	Both processed together in parallel (batched)
Two requests, same model, parallelism=1	One processed, second queued until first completes
Two requests, different models, enough memory	Both models loaded, requests handled in parallel
Two requests, different models, not enough memory	One queued until memory is available or a model is unloaded

In summary, Ollama is designed to handle multiple simultaneous requests efficiently, provided the server is configured for concurrency and has sufficient resources. Otherwise, requests are queued and processed in order.

Memory Insufficiency Handling

When Ollama encounters insufficient memory to handle incoming requests, it employs a combination of queuing mechanisms and resource management strategies to maintain stability:

Request Queuing

New requests are placed in a FIFO (First-In, First-Out) queue when memory cannot be immediately allocated.
The queue size is controlled by OLLAMA_MAX_QUEUE (default: 512 requests).
If the queue reaches capacity, new requests receive 503 “Server Overloaded” errors.

Model Management

Active models may be unloaded from memory when they become idle to free resources for queued requests.
The number of concurrently loaded models is limited by OLLAMA_MAX_LOADED_MODELS (default: 3×GPU count or 3 for CPU).

Memory Optimization

Attempts to batch process requests for the same model to maximize memory efficiency.
For GPU inference, requires full VRAM allocation per model - partial loads aren’t supported.

Failure Scenarios

Critical Memory Exhaustion: When even queued requests exceed available resources, Ollama may:

Page to disk (severely degrading performance)
Return “out of memory” errors
Crash the model instance in extreme cases

Configuration Controls Setting	Purpose	Default Value
OLLAMA_MAX_QUEUE	Maximum queued requests	512
OLLAMA_NUM_PARALLEL	Parallel requests per loaded model	4 (or 1 if limited)
OLLAMA_MAX_LOADED_MODELS	Maximum concurrently loaded models	3×GPU count or 3

Administrators should monitor memory usage and adjust these parameters based on their hardware capabilities. Insufficient memory handling becomes crucial when running larger models (7B+ parameters) or processing multiple concurrent requests.

Ollama optimisation strategies

Enable GPU acceleration with export OLLAMA_CUDA=1 and set CPU threads via export OLLAMA_NUM_THREADS=84. Hardware Enhancements

RAM: 32GB+ for 13B models, 64GB+ for 70B models
Storage: NVMe SSDs for faster model loading/swapping
GPU: NVIDIA RTX 3080/4090 with 16GB+ VRAM for larger models

Operational Strategies

Batch Requests: Process multiple queries simultaneously to amortize memory overhead
Automatic Model Unloading: Lets Ollama purge idle models from memory
Cache Frequently Used Models: Keep common models memory-resident

Monitoring & Troubleshooting

Use nvidia-smi (GPU) and htop (CPU/RAM) to identify bottlenecks
For memory errors:
Upgrade to quantized models
Reduce concurrent requests
Increase swap space

Example optimization workflow:

### Use quantized model with GPU acceleration
export OLLAMA_CUDA=1
ollama run llama2:7b-q4_0 --context-size 2048

### Limit loaded models and parallel requests
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4

These adjustments can reduce memory usage by 30-60% while maintaining response quality, particularly beneficial when running multiple models or handling high request volumes.

`OLLAMA_NUM_PARALLEL` environment variable

OLLAMA_NUM_PARALLEL controls how many requests Ollama will execute in parallel. If you send multiple requests to the same Ollama server, this setting largely decides whether they run concurrently or queue up.

Higher values can increase throughput if you have enough CPU/GPU/VRAM, but may increase latency and memory pressure.
Lower values reduce contention and can improve stability, but requests will queue more often.

How to set `OLLAMA_NUM_PARALLEL`

Linux / macOS (systemd service or shell):

export OLLAMA_NUM_PARALLEL=2
ollama serve

One-off run (prefix just for this command):

OLLAMA_NUM_PARALLEL=2 ollama serve

Docker (example):

docker run --rm -e OLLAMA_NUM_PARALLEL=2 -p 11434:11434 ollama/ollama

How to choose a value

Start with 1–2 for a single GPU / limited VRAM, then increase gradually while watching:

GPU VRAM usage (OOM / evictions)
CPU usage and load average
p95 latency of your typical requests
error rate / timeouts

If you’re optimizing a specific page for CLI usage, see the Ollama CLI section in the cheatsheet, plus command examples for ollama serve, ollama ps, and ollama run.

Quick tuning recipes

Stability-first

OLLAMA_NUM_PARALLEL=1
Use smaller / quantized models
Prefer shorter context sizes

Throughput -first

OLLAMA_NUM_PARALLEL=2 (or higher if you have headroom)
Consider request batching at the client layer
Ensure sufficient VRAM and CPU threads

“I run out of VRAM when two requests arrive”

Reduce OLLAMA_NUM_PARALLEL
Use a more aggressively quantized model
Reduce context length / max tokens

Troubleshooting

Symptoms that `OLLAMA_NUM_PARALLEL` is too high

Requests fail intermittently under load
GPU OOM / model unloading happens frequently
Latency spikes when the second request arrives

Symptoms that `OLLAMA_NUM_PARALLEL` is too low

CPU/GPU is underutilized
Queueing delays dominate total response time

Tip: If you also control your client, add retries with jitter and keep-alive connections. Many “Ollama is slow” issues are really queueing + connection overhead.

Ollama: Batching Requests vs Parallel Execution

Batching in Ollama refers to the practice of grouping multiple incoming requests together and processing them as a unit. This allows for more efficient use of computational resources, especially when running on hardware that benefits from parallelized operations (such as GPUs).

When multiple requests for the same model arrive simultaneously, Ollama can process them together in a batch if memory allows. This increases throughput and can reduce latency for each request, as the model can leverage optimized matrix operations over the batch.

Batching is particularly effective when requests are similar in size and complexity, as this allows for better hardware utilization.

Parallel execution in Ollama means handling multiple requests at the same time, either for the same model or for different models, depending on available memory and configuration.

Ollama supports two levels of parallelism:

Multiple Model Loading: If enough memory is available, several models can be loaded and serve requests simultaneously.
Parallel Requests per Model: Each loaded model can process several requests in parallel, controlled by the OLLAMA_NUM_PARALLEL setting (default is 1 or 4, depending on memory).

When requests exceed the parallelism limit, they are queued (FIFO) up to OLLAMA_MAX_QUEUE.

Takeaway

Ollama leverages both batching and parallel execution to process multiple requests efficiently. Batching groups requests for simultaneous processing, while parallel execution allows multiple requests (or models) to run concurrently. Both methods depend on system memory and are configurable for optimal performance.

For more benchmarks, concurrency tuning, and performance guidance, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.