Test: How Ollama is using Intel CPU Performance and Efficient Cores
Ollama on Intel CPU Efficient vs Performance cores
I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.
To be presise
ollama ps
is showing
gemma3:27b a418f5838eaf 22 GB 29%/71% CPU/GPU
Though it doesn’t look terrible, but it is the layers split. The actual load is: GPU:28%, CPU: 560%. Yes, several cores are used.
And here is idea:
What if we push ollama to use ALL Intel CPU cores - both of Performace and Efficient kinds?
OLLAMA_NUM_THREADS config param
Ollama has environment variable config parameter OLLAMA_NUM_THREADS which supposed to tell ollama how many threads and cores accordingly it should utilise.
I tried to restrict it to 3 cores first:
sudo xed /etc/systemd/system/ollama.service
# put OLLAMA_NUM_THREADS=3 as
# Environment="OLLAMA_NUM_THREADS=3"
sudo systemctl daemon-reload
sudo systemctl restart ollama
and it didn’t work.
Ollama was still using ~560% of CPU when running Gemma 3 27B LLM.
Bad luck.
num_thread Call option
Lets try to call
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 8
}
}' | jq .
The result:
- CPU usage: 585%
- GPU usage: 25%
- GPU power: 67w
- Performance eval: 6.5 tokens/sec
Now let’s try to double cores. Telling ollama to use mix of performance and efficient cores:
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 16
}
}' | jq .
The result:
- CPU usage: 1030%
- GPU usage: 26%
- GPU power: 70w
- Performance eval: 7.4 t/s
Good! Performance increased by ~14%!
Now let’s go extream! All physical cores go!
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 20
}
}' | jq .
The result:
- CPU usage: 1250%
- GPU usage: 10-26% (unstable)
- GPU power: 67w
- Performance eval: 6.9 t/s
Ok. Now we see some performance drop. Let’s try some 8 Performance + 4 efficient:
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:27b",
"prompt": "Why is the blue sky blue?",
"stream": false,
"options":{
"num_thread": 12
}
}' | jq .
The result:
- CPU usage: 801%
- GPU usage: 27% (unstable)
- GPU power: 70w
- Performance eval: 7.1 t/s
Here-there.
For comparison - running Gemma 3 14b, it is less smart comparing to Gemma 27b, but fits into GPU VRAM nicely.
curl http://localhost:11434/api/generate -d '
{
"model": "gemma3:12b-it-qat",
"prompt": "Why is the blue sky blue?",
"stream": false
}' | jq .
The result:
- CPU usage: 106%
- GPU usage: 94% (unstable)
- GPU power: 225w
- Performance eval: 61.1 t/s
That is what we call a performance. Even though Gemma 3 27b is smarter then 14b, but not 10 times!
Conclusion
If LLM doen’t fit into GPU VRAM and some layers are offloaded by Ollama onto CPU
- We can increase the LLM performance by 10-14% by providing
num_thread
parameter - The performance drop because of offloading is much higher and not compansated by this increase.
- Better have more powerful GPU with more VRAM. RTX 3090 is better than RTX 5080, though I don’t have any of these…
Useful links
- How Ollama Handles Parallel Requests
- Move Ollama Models to Different Drive or Folder
- Ollama cheatsheet
- LLM speed performance comparison
- Deepseek-r1 on Ollama
- Logical Fallacy Detection with LLMs
- Deploy Hugo-generated website to AWS S3
- Comparing LLM Summarising Abilities
- Writing effective prompts for LLMs
- Degradation Issues in Intel’s 13th and 14th Generation CPUs
- Convert HTML content to Markdown using LLM and Ollama
- Is the Quadro RTX 5880 Ada 48GB Any Good?
- Reranking text documents with Ollama and Qwen3 Embedding model - in Go