Test: How Ollama is using Intel CPU Performance and Efficient Cores

Ollama on Intel CPU Efficient vs Performance cores

Page content

I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.

To be presise

ollama ps

is showing

gemma3:27b    a418f5838eaf    22 GB    29%/71% CPU/GPU

Though it doesn’t look terrible, but it is the layers split. The actual load is: GPU:28%, CPU: 560%. Yes, several cores are used.

The portrait of Llama and flying CPUs

And here is idea:

What if we push ollama to use ALL Intel CPU cores - both of Performace and Efficient kinds?

OLLAMA_NUM_THREADS config param

Ollama has environment variable config parameter OLLAMA_NUM_THREADS which supposed to tell ollama how many threads and cores accordingly it should utilise.

I tried to restrict it to 3 cores first:

sudo xed /etc/systemd/system/ollama.service

# put OLLAMA_NUM_THREADS=3 as
# Environment="OLLAMA_NUM_THREADS=3"

sudo systemctl daemon-reload
sudo systemctl restart ollama

and it didn’t work.

Ollama was still using ~560% of CPU when running Gemma 3 27B LLM.

Bad luck.

num_thread Call option

Lets try to call

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 8
}
}'  | jq .

The result:

  • CPU usage: 585%
  • GPU usage: 25%
  • GPU power: 67w
  • Performance eval: 6.5 tokens/sec

Now let’s try to double cores. Telling ollama to use mix of performance and efficient cores:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 16
}
}'  | jq .

The result:

  • CPU usage: 1030%
  • GPU usage: 26%
  • GPU power: 70w
  • Performance eval: 7.4 t/s

Good! Performance increased by ~14%!

Now let’s go extream! All physical cores go!

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 20
}
}'  | jq .

The result:

  • CPU usage: 1250%
  • GPU usage: 10-26% (unstable)
  • GPU power: 67w
  • Performance eval: 6.9 t/s

Ok. Now we see some performance drop. Let’s try some 8 Performance + 4 efficient:

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:27b",  
"prompt": "Why is the blue sky blue?",  
"stream": false,
"options":{
  "num_thread": 12
}
}'  | jq .

The result:

  • CPU usage: 801%
  • GPU usage: 27% (unstable)
  • GPU power: 70w
  • Performance eval: 7.1 t/s

Here-there.

For comparison - running Gemma 3 14b, it is less smart comparing to Gemma 27b, but fits into GPU VRAM nicely.

curl http://localhost:11434/api/generate -d '
{  
"model": "gemma3:12b-it-qat",  
"prompt": "Why is the blue sky blue?",  
"stream": false
}'  | jq .

The result:

  • CPU usage: 106%
  • GPU usage: 94% (unstable)
  • GPU power: 225w
  • Performance eval: 61.1 t/s

That is what we call a performance. Even though Gemma 3 27b is smarter then 14b, but not 10 times!

Conclusion

If LLM doen’t fit into GPU VRAM and some layers are offloaded by Ollama onto CPU

  • We can increase the LLM performance by 10-14% by providing num_thread parameter
  • The performance drop because of offloading is much higher and not compansated by this increase.
  • Better have more powerful GPU with more VRAM. RTX 3090 is better than RTX 5080, though I don’t have any of these