Test: How Ollama is using Intel CPU Performance and Efficient Cores
Ollama on Intel CPU Efficient vs Performance cores
I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.
For more on throughput, latency, VRAM, and benchmarks across runtimes and hardware, see LLM Performance: Benchmarks, Bottlenecks & Optimization.
To be presise
ollama ps
is showing
gemma3:27b a418f5838eaf 22 GB 29%/71% CPU/GPU
Though it doesn’t look terrible, but it is the layers split. The actual load is: GPU:28%, CPU: 560%. Yes, several cores are used.

And here is idea:
What if we push ollama to use ALL Intel CPU cores - both of Performace and Efficient kinds?
OLLAMA_NUM_THREADS config param
Ollama has environment variable config parameter OLLAMA_NUM_THREADS which supposed to tell ollama how many threads and cores accordingly it should utilise.
I tried to restrict it to 3 cores first:
sudo xed /etc/systemd/system/ollama.service
# put OLLAMA_NUM_THREADS=3 as
# Environment="OLLAMA_NUM_THREADS=3"
sudo systemctl daemon-reload
sudo systemctl restart ollama
and it didn’t work.
Ollama was still using ~560% of CPU when running Gemma 3 27B LLM.
Bad luck.
num_thread Call option
Lets try to call
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 8
}
}' | jq .
The result:
- CPU usage: 585%
- GPU usage: 25%
- GPU power: 67w
- Performance eval: 6.5 tokens/sec
Now let’s try to double cores. Telling ollama to use mix of performance and efficient cores:
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 16
}
}' | jq .
The result:
- CPU usage: 1030%
- GPU usage: 26%
- GPU power: 70w
- Performance eval: 7.4 t/s
Good! Performance increased by ~14%!
Now let’s go extream! All physical cores go!
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 20
}
}' | jq .
The result:
- CPU usage: 1250%
- GPU usage: 10-26% (unstable)
- GPU power: 67w
- Performance eval: 6.9 t/s
Ok. Now we see some performance drop. Let’s try some 8 Performance + 4 efficient:
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 12
}
}' | jq .
The result:
- CPU usage: 801%
- GPU usage: 27% (unstable)
- GPU power: 70w
- Performance eval: 7.1 t/s
Here-there.
For comparison - running Gemma 3 14b, it is less smart comparing to Gemma 27b, but fits into GPU VRAM nicely.
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:12b-it-qat",
"prompt": "Why is the blue sky blue?",
"stream": false
}' | jq .
The result:
- CPU usage: 106%
- GPU usage: 94% (unstable)
- GPU power: 225w
- Performance eval: 61.1 t/s
That is what we call a performance. Even though Gemma 3 27b is smarter then 14b, but not 10 times!
Conclusion
If LLM doen’t fit into GPU VRAM and some layers are offloaded by Ollama onto CPU
- We can increase the LLM performance by 10-14% by providing
num_threadparameter - The performance drop because of offloading is much higher and not compansated by this increase.
- Better have more powerful GPU with more VRAM. RTX 3090 is better than RTX 5080, though I don’t have any of these…
For more benchmarks, CPU/GPU tuning, and performance guidance, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.