What does low GPU load mean when running an LLM with llama.cpp?

It often means many layers stay on the CPU, so token speed drops sharply. You can try raising GPU layer offload with ngl, picking a smaller quant, or shortening context so more of the model fits on the GPU.

How does a longer context window affect tokens per second on 16 GB VRAM?

Longer context grows the KV cache and VRAM use, which can cut tokens per second or force partial offload. Shorter context, smaller models, or more aggressive quants help keep speed up on a single 16 GB card.

Which quantization trade-offs matter for fitting large models in 16 GB?

Lower bit quants use less VRAM and can raise tokens per second but may hurt quality. IQ3 and IQ4 style quants are common compromises for llama.cpp on one 16 GB GPU when you need long context or bigger checkpoints.

Can mixture-of-experts models be fast on one 16 GB GPU?

MoE models can reach high tokens per second because only part of the network runs each forward step, but you still need enough VRAM for active experts plus weights and the KV cache at your chosen context length.

What can you try if 64K context is much slower than 32K on the same GPU?

Tune how many layers load on the GPU with ngl, accept a shorter context if that fits your task, or move to a lighter quant so more layers stay on the device and the KV cache pressure eases.

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

llama.cpp token speed on 16 GB VRAM (tables).

Page content

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows.

Stylized GPU with VRAM blocks and benchmark-style charts

In this post I am recording my attempts to squeeze as much performance in a sense of speed as possible.

LLM speed comparison table (tokens per second and VRAM)

Model	Size	19K VRAM	19K GPU/CPU	19K T/s	32K VRAM	32K Load	32K T/s	64K VRAM	64K Load	64K: T/s
Qwen3.5-35B-A3B-UD-IQ3_S	13.6	14.3GB	93%/100%	136.4	14.6GB	93%/100%	138.5	14.9GB	88%/115%	136.8
Qwen3.5-27B-UD-IQ3_XXS	11.5	12.9	98/100	45.3	13.7	98/100	45.1	14.7	45/410	22.7
Qwen3.5-122B-A10B-UD-IQ3_XXS	44.7	14.7	30/470	22.3	14.7	30/480	21.8	14.7	28/490	21.5
nvidia Nemotron-Cascade-2-30B IQ4_XS	18.2	14.6	60/305	115.8	14.7	57/311	113.6	14.7	55/324	103.4
gemma-4-26B-A4B-it-UD-IQ4_XS	13.4	14.7	95/100	121.7	14.9	95/115	114.9	14.9	75/190	96.1
gemma-4-31B-it-UD-IQ3_XXS	11.8	14.8	68/287	29.2	14.8	41/480	18.4	14.8	18/634	8.1
GLM-4.7-Flash-IQ4_XS	16.3	15.0	66/240	91.8	14.9	62/262	86.1	14.9	53/313	72.5
GLM-4.7-Flash-REAP-23B IQ4_XS	12.6	13.7	92/100	122.0	14.4	95/102	123.2	14.9	71/196	97.1

19K, 32K, and 64K are the context sizes.

The load above is a GPU Load. If you see a low number in this column- that means model is running mostly on CPU and can not get any decent speed on this hardware. That pattern matches what people see when too little of the model fits on the GPU or when context pushes work back to the host.

About llama.cpp, LLM performance, OpenCode and other comparisons

If you want install paths, llama-cli and llama-server examples, and the flags that matter for VRAM and tokens per second (context size, batching, -ngl), start with llama.cpp Quickstart with CLI and Server.

For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization.

The quality of the response is analysed in other articles, for instance:

Best LLMs for OpenCode - Tested Locally. You can read more about Opencode in OpenCode Quickstart: Install, Configure, and Use the Terminal AI Coding Agent
Comparison of Hugo Page Translation quality - LLMs on Ollama

I did run similar test for LLMs on Ollama: Best LLMs for Ollama on 16GB VRAM GPU.

Why context length changes tokens per second

As you move from 19K to 32K or 64K tokens, the KV cache grows and VRAM pressure rises. Some rows show a big drop in tokens per second at 64K while others stay flat, which is the signal to revisit quants, context limits, or layer offload rather than assuming the model is “slow” in general.

The models and quants I’ve chosen to test - are to run by myself and see do they give a good gain in a sense of cost/benefit on this equipment or not. So no q8 quants here with 200k context :) …

GPU/CPU is a load, measured by nvitop.

llama.cpp when autoconfiguring the layers unloading to GPU is trying to keep 1GB free. We manually specify this parameter via commandline param -ngl but I’m not finetuning it here, just need to understand that if there is significant performance drop when increasing context window size from 32k to 64k - we can try to increase speed on 64k by finetuning number of unloaded layers.

Test hardware and llama.cpp setup

I tested the LLM speed on a PC with this config:

CPU i-14700
RAM 64GB 6000Hz (2x32GB)
GPU RTX-4080
Ubuntu with NVidia drivers
llama.cpp/llama-cli, no unloaded layers specified
Initial VRAM used, before starting llama-cli: 300MB

Extra runs at 128K context (Qwen3.5 27B and 122B)

Model	128K Load	128K: T/s
Qwen3.5-27B-UD-IQ3_XXS	16/625	9.6
Qwen3.5-122B-A10B-UD-IQ3_XXS	27/496	19.2

Takeaways for 16 GB VRAM builds

My current favorite Qwen3.5-27B-UD-IQ3_XXS is looking good on it’s sweetspot 50k context (I am getting approx 36t/s)
Qwen3.5-122B-A10B-UD-IQ3_XXS is overtaking performance-wise the Qwen3.5 27B on the contexts above 64K.
I can push Qwen3.5-35B-A3B-UD-IQ3_S to handle context 100k tokens, and it fits into vram, so no performance drop
I will not use gemma-4-31B on 16GB VRAM, but gemma-4-26B might be medium-well…, need to test.
Need to test how well Nemotron cascade 2 and GLM-4.7 Flash REAP 23B work. will they be better then Qwen3.5-35B q3? I doubt but still, might test to confirm the suspicion.