How does Ollama use CPU cores when the model does not fit in VRAM?

Ollama offloads some layers to CPU. You can set how many threads it uses via the num_thread option in the API (e.g. in the generate request). Using more threads (e.g. 16) can give around 10–14% better tokens per second than low thread counts, but going too high can hurt performance.

Does OLLAMA_NUM_THREADS limit CPU usage in Ollama?

OLLAMA_NUM_THREADS is an environment variable intended to limit threads. In some setups it may not be respected and Ollama may still use many cores. The per-request num_thread option in the API often has a clearer effect.

What is the best num_thread value for Ollama CPU offload?

It depends on your CPU. In tests with Intel performance and efficient cores, values like 12–16 threads often gave the best tokens per second for offloaded layers. Using all physical cores (e.g. 20) sometimes reduced performance. Tuning per machine is recommended.

Why is my Ollama inference slow when the model uses CPU?

CPU offload is much slower than full GPU. Even with more threads, the gap is large (e.g. single-digit vs tens of tokens per second). For faster inference, use a smaller or quantized model that fits in VRAM, or a GPU with more VRAM.

Where can I find more on LLM performance and benchmarks?

Our LLM Performance hub covers throughput vs latency, VRAM limits, parallel requests, and benchmarks across runtimes and hardware.

Test: How Ollama is using Intel CPU Performance and Efficient Cores

Ollama on Intel CPU Efficient vs Performance cores

Page content

I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.

For more on throughput, latency, VRAM, and benchmarks across runtimes and hardware, see LLM Performance: Benchmarks, Bottlenecks & Optimization.

To be presise

ollama ps

is showing

gemma3:27b    a418f5838eaf    22 GB    29%/71% CPU/GPU

Though it doesn’t look terrible, but it is the layers split. The actual load is: GPU:28%, CPU: 560%. Yes, several cores are used.

The portrait of Llama and flying CPUs

And here is idea:

What if we push ollama to use ALL Intel CPU cores - both of Performace and Efficient kinds?

OLLAMA_NUM_THREADS config param

Ollama has environment variable config parameter OLLAMA_NUM_THREADS which supposed to tell ollama how many threads and cores accordingly it should utilise.

I tried to restrict it to 3 cores first:

sudo xed /etc/systemd/system/ollama.service

# put OLLAMA_NUM_THREADS=3 as
# Environment="OLLAMA_NUM_THREADS=3"

sudo systemctl daemon-reload
sudo systemctl restart ollama

and it didn’t work.

Ollama was still using ~560% of CPU when running Gemma 3 27B LLM.

Bad luck.

num_thread Call option

Lets try to call

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 8
}
}'  | jq .

The result:

CPU usage: 585%
GPU usage: 25%
GPU power: 67w
Performance eval: 6.5 tokens/sec

Now let’s try to double cores. Telling ollama to use mix of performance and efficient cores:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 16
}
}'  | jq .

The result:

CPU usage: 1030%
GPU usage: 26%
GPU power: 70w
Performance eval: 7.4 t/s

Good! Performance increased by ~14%!

Now let’s go extream! All physical cores go!

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 20
}
}'  | jq .

The result:

CPU usage: 1250%
GPU usage: 10-26% (unstable)
GPU power: 67w
Performance eval: 6.9 t/s

Ok. Now we see some performance drop. Let’s try some 8 Performance + 4 efficient:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 12
}
}'  | jq .

The result:

CPU usage: 801%
GPU usage: 27% (unstable)
GPU power: 70w
Performance eval: 7.1 t/s

Here-there.

For comparison - running Gemma 3 14b, it is less smart comparing to Gemma 27b, but fits into GPU VRAM nicely.

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:12b-it-qat",  
"prompt": "Why is the blue sky blue?",  
"stream": false
}'  | jq .

The result:

CPU usage: 106%
GPU usage: 94% (unstable)
GPU power: 225w
Performance eval: 61.1 t/s

That is what we call a performance. Even though Gemma 3 27b is smarter then 14b, but not 10 times!

Conclusion

If LLM doen’t fit into GPU VRAM and some layers are offloaded by Ollama onto CPU

We can increase the LLM performance by 10-14% by providing num_thread parameter
The performance drop because of offloading is much higher and not compansated by this increase.
Better have more powerful GPU with more VRAM. RTX 3090 is better than RTX 5080, though I don’t have any of these…

For more benchmarks, CPU/GPU tuning, and performance guidance, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.

OLLAMA_NUM_THREADS config param

num_thread Call option

Conclusion

Useful links