Test: How Ollama is using Intel CPU Performance and Efficient Cores

Ollama on Intel CPU Efficient vs Performance cores

Page content

I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.

For more on throughput, latency, VRAM, and benchmarks across runtimes and hardware, see LLM Performance: Benchmarks, Bottlenecks & Optimization.

To be presise

ollama ps

is showing

gemma3:27b    a418f5838eaf    22 GB    29%/71% CPU/GPU

Though it doesn’t look terrible, but it is the layers split. The actual load is: GPU:28%, CPU: 560%. Yes, several cores are used.

The portrait of Llama and flying CPUs

And here is idea:

What if we push ollama to use ALL Intel CPU cores - both of Performace and Efficient kinds?

OLLAMA_NUM_THREADS config param

Ollama has environment variable config parameter OLLAMA_NUM_THREADS which supposed to tell ollama how many threads and cores accordingly it should utilise.

I tried to restrict it to 3 cores first:

sudo xed /etc/systemd/system/ollama.service

# put OLLAMA_NUM_THREADS=3 as
# Environment="OLLAMA_NUM_THREADS=3"

sudo systemctl daemon-reload
sudo systemctl restart ollama

and it didn’t work.

Ollama was still using ~560% of CPU when running Gemma 3 27B LLM.

Bad luck.

num_thread Call option

Lets try to call

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 8
}
}'  | jq .

The result:

  • CPU usage: 585%
  • GPU usage: 25%
  • GPU power: 67w
  • Performance eval: 6.5 tokens/sec

Now let’s try to double cores. Telling ollama to use mix of performance and efficient cores:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 16
}
}'  | jq .

The result:

  • CPU usage: 1030%
  • GPU usage: 26%
  • GPU power: 70w
  • Performance eval: 7.4 t/s

Good! Performance increased by ~14%!

Now let’s go extream! All physical cores go!

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 20
}
}'  | jq .

The result:

  • CPU usage: 1250%
  • GPU usage: 10-26% (unstable)
  • GPU power: 67w
  • Performance eval: 6.9 t/s

Ok. Now we see some performance drop. Let’s try some 8 Performance + 4 efficient:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 12
}
}'  | jq .

The result:

  • CPU usage: 801%
  • GPU usage: 27% (unstable)
  • GPU power: 70w
  • Performance eval: 7.1 t/s

Here-there.

For comparison - running Gemma 3 14b, it is less smart comparing to Gemma 27b, but fits into GPU VRAM nicely.

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:12b-it-qat",  
"prompt": "Why is the blue sky blue?",  
"stream": false
}'  | jq .

The result:

  • CPU usage: 106%
  • GPU usage: 94% (unstable)
  • GPU power: 225w
  • Performance eval: 61.1 t/s

That is what we call a performance. Even though Gemma 3 27b is smarter then 14b, but not 10 times!

Conclusion

If LLM doen’t fit into GPU VRAM and some layers are offloaded by Ollama onto CPU

  • We can increase the LLM performance by 10-14% by providing num_thread parameter
  • The performance drop because of offloading is much higher and not compansated by this increase.
  • Better have more powerful GPU with more VRAM. RTX 3090 is better than RTX 5080, though I don’t have any of these

For more benchmarks, CPU/GPU tuning, and performance guidance, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.