16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

llama.cpp token speed on 16 GB VRAM (tables).

Page content

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows.

Stylized GPU with VRAM blocks and benchmark-style charts

In this post I am recording my attempts to squeeze as much performance in a sense of speed as possible.

LLM speed comparison table (tokens per second and VRAM)

Model Size 19K VRAM 19K GPU/CPU 19K T/s 32K VRAM 32K Load 32K T/s 64K VRAM 64K Load 64K: T/s
Qwen3.5-35B-A3B-UD-IQ3_S 13.6 14.3GB 93%/100% 136.4 14.6GB 93%/100% 138.5 14.9GB 88%/115% 136.8
Qwen3.5-27B-UD-IQ3_XXS 11.5 12.9 98/100 45.3 13.7 98/100 45.1 14.7 45/410 22.7
Qwen3.5-122B-A10B-UD-IQ3_XXS 44.7 14.7 30/470 22.3 14.7 30/480 21.8 14.7 28/490 21.5
nvidia Nemotron-Cascade-2-30B IQ4_XS 18.2 14.6 60/305 115.8 14.7 57/311 113.6 14.7 55/324 103.4
gemma-4-26B-A4B-it-UD-IQ4_XS 13.4 14.7 95/100 121.7 14.9 95/115 114.9 14.9 75/190 96.1
gemma-4-31B-it-UD-IQ3_XXS 11.8 14.8 68/287 29.2 14.8 41/480 18.4 14.8 18/634 8.1
GLM-4.7-Flash-IQ4_XS 16.3 15.0 66/240 91.8 14.9 62/262 86.1 14.9 53/313 72.5
GLM-4.7-Flash-REAP-23B IQ4_XS 12.6 13.7 92/100 122.0 14.4 95/102 123.2 14.9 71/196 97.1

19K, 32K, and 64K are the context sizes.

The load above is a GPU Load. If you see a low number in this column- that means model is running mostly on CPU and can not get any decent speed on this hardware. That pattern matches what people see when too little of the model fits on the GPU or when context pushes work back to the host.

About llama.cpp, LLM performance, OpenCode and other comparisons

If you want install paths, llama-cli and llama-server examples, and the flags that matter for VRAM and tokens per second (context size, batching, -ngl), start with llama.cpp Quickstart with CLI and Server.

For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization.

The quality of the response is analysed in other articles, for instance:

I did run similar test for LLMs on Ollama: Best LLMs for Ollama on 16GB VRAM GPU.

Why context length changes tokens per second

As you move from 19K to 32K or 64K tokens, the KV cache grows and VRAM pressure rises. Some rows show a big drop in tokens per second at 64K while others stay flat, which is the signal to revisit quants, context limits, or layer offload rather than assuming the model is “slow” in general.

The models and quants I’ve chosen to test - are to run by myself and see do they give a good gain in a sense of cost/benefit on this equipment or not. So no q8 quants here with 200k context :) …

GPU/CPU is a load, measured by nvitop.

llama.cpp when autoconfiguring the layers unloading to GPU is trying to keep 1GB free. We manually specify this parameter via commandline param -ngl but I’m not finetuning it here, just need to understand that if there is significant performance drop when increasing context window size from 32k to 64k - we can try to increase speed on 64k by finetuning number of unloaded layers.

Test hardware and llama.cpp setup

I tested the LLM speed on a PC with this config:

  • CPU i-14700
  • RAM 64GB 6000Hz (2x32GB)
  • GPU RTX-4080
  • Ubuntu with NVidia drivers
  • llama.cpp/llama-cli, no unloaded layers specified
  • Initial VRAM used, before starting llama-cli: 300MB

Extra runs at 128K context (Qwen3.5 27B and 122B)

Model 128K Load 128K: T/s
Qwen3.5-27B-UD-IQ3_XXS 16/625 9.6
Qwen3.5-122B-A10B-UD-IQ3_XXS 27/496 19.2

Takeaways for 16 GB VRAM builds

  • My current favorite Qwen3.5-27B-UD-IQ3_XXS is looking good on it’s sweetspot 50k context (I am getting approx 36t/s)
  • Qwen3.5-122B-A10B-UD-IQ3_XXS is overtaking performance-wise the Qwen3.5 27B on the contexts above 64K.
  • I can push Qwen3.5-35B-A3B-UD-IQ3_S to handle context 100k tokens, and it fits into vram, so no performance drop
  • I will not use gemma-4-31B on 16GB VRAM, but gemma-4-26B might be medium-well…, need to test.
  • Need to test how well Nemotron cascade 2 and GLM-4.7 Flash REAP 23B work. will they be better then Qwen3.5-35B q3? I doubt but still, might test to confirm the suspicion.