Which LLM is fastest on a 16GB VRAM GPU with Ollama?

GPT-OSS 20B achieved the highest speed at 139.93 tokens/sec while fitting entirely in 16GB VRAM. It runs 100% on GPU without CPU offloading, making it ideal for speed-critical applications.

What happens when an LLM exceeds 16GB VRAM?

Ollama automatically offloads model layers to system RAM and CPU. This significantly reduces performance—for example, Mistral Small 3.2 24B drops to 18.51 tokens/sec when 18% of layers run on CPU.

How does context size affect VRAM usage in Ollama?

Larger context windows require more VRAM for the KV cache. Using 19K context, a model that fits in VRAM with 4K context may need CPU offloading. Reduce context size if you need to maximize GPU utilization.

Is Qwen3 14B good for a 16GB GPU?

Yes. Qwen3 14B uses only 12GB VRAM and runs 100% on GPU at 61.85 tokens/sec. It offers excellent instruction-following and fits comfortably in 16GB with room for larger context sizes.

Should I use larger models with CPU offloading or smaller models fully on GPU?

For interactive use cases, smaller models running 100% on GPU are usually better. The speed penalty from CPU offloading is substantial—GPT-OSS 120B at 12.64 tokens/sec feels sluggish compared to GPT-OSS 20B at 139.93 tokens/sec.

Where can I find more LLM performance benchmarks and optimization guides?

Our LLM Performance hub covers throughput vs latency, VRAM limits, parallel requests, memory allocation, and benchmarks across runtimes and hardware.

How does VRAM usage relate to token speed in Ollama?

Models that fit entirely in VRAM avoid CPU offloading and run much faster. The LLM Performance guide explains VRAM limits and how they affect inference speed.

Comparing LLMs performance on Ollama on 16GB VRAM GPU

LLM speed test on RTX 4080 with 16GB VRAM

Page content

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.

With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference. For more on LLM performance—throughput vs latency, VRAM limits, parallel requests, and benchmarks across runtimes—see LLM Performance: Benchmarks, Bottlenecks & Optimization.

LLM performance on Ollama - reranking cockroaches

TL;DR

Here is updated comparison table of LLM performance on RTX 4080 16GB with Ollama 0.17.7, (2026-03-09) added Qwen 3.5 9b, 9bq8, 27b and 35b models:

Model	RAM+VRAM Used	CPU/GPU Split	Tokens/sec
gpt-oss:20b	14 GB	100% GPU	139.93
qwen3.5:9b	9.3 GB	100% GPU	90.89
ministral-3:14b	13 GB	100% GPU	70.13
qwen3:14b	12 GB	100% GPU	61.85
qwen3.5:9b-q8_0	13 GB	100% GPU	61.22
qwen3-coder:30b	20 GB	25%/75% CPU/GPU	57.17
qwen3-vl:30b-a3b	22 GB	30%/70% CPU/GPU	50.99
glm-4.7-flash	21 GB	27%/73% CPU/GPU	33.86
nemotron-3-nano:30b	25 GB	38%/62% CPU/GPU	32.77
qwen3.5:35b	27 GB	43%/57% CPU/GPU	20.66
devstral-small-2:24b	19 GB	18%/82% CPU/GPU	18.67
mistral-small3.2:24b	19 GB	18%/82% CPU/GPU	18.51
gpt-oss:120b	66 GB	78%/22% CPU/GPU	12.64
qwen3.5:27b	24 GB	43%/57% CPU/GPU	6.48

Key insight: Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed difference.

Test Hardware Setup

The benchmark was conducted on the following system:

GPU: NVIDIA RTX 4080 with 16GB VRAM
CPU: Intel Core i7-14700 (8 P-cores + 12 E-cores)
RAM: 64GB DDR5-6000

This represents a common high-end consumer configuration for local LLM inference. The 16GB VRAM is the critical constraint—it determines which models run entirely on GPU versus requiring CPU offloading.

Understanding how Ollama uses Intel CPU cores becomes important when models exceed VRAM capacity, as CPU performance directly impacts offloaded layer inference speed.

Purpose of This Benchmark

The primary goal was measuring inference speed under realistic conditions. I already knew from experience that Mistral Small 3.2 24B excels at language quality while Qwen3 14B offers superior instruction-following for my specific use cases.

This benchmark answers the practical question: How fast can each model generate text, and what’s the speed penalty for exceeding VRAM limits?

The test parameters were:

Context size: 19,000 tokens. This is average value in my Generate requests.
Prompt: “compare weather and climate between capital cities of australia”
Metric: eval rate (tokens per second during generation)

Ollama Installation and Version

All tests used Ollama version 0.15.2, the latest release at the time of testing. Later re-ran on Ollama v 0.17.7 - to add Qwen3.5 models. For a complete reference of Ollama commands used in this benchmark, see the Ollama cheatsheet.

To quick recoup - install Ollama on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

If you need to store models on a different drive due to space constraints, check out how to move Ollama models to a different drive.

Models Tested

The following models were benchmarked, in alphabetical order:

Model	Parameters	Quantization	Notes
devstral-small-2:24b	24B	Q4_K_M	Code-focused
glm-4.7-flash	30B	Q4_K_M	Thinking model
gpt-oss:20b	20B	Q4_K_M	Fastest overall
gpt-oss:120b	120B	Q4_K_M	Largest tested
ministral-3:14b	14B	Q4_K_M	Mistral’s efficient model
mistral-small3.2:24b	24B	Q4_K_M	Strong language quality
nemotron-3-nano:30b	30B	Q4_K_M	NVIDIA’s offering
qwen3:14b	14B	Q4_K_M	Best instruction-following
qwen3.5:9b	9B	Q4_K_M	Fast, fully GPU
qwen3.5:9b-q8_0	9B	Q8_0	Higher quality, fully GPU
qwen3.5:27b	27B	Q4_K_M	Excellent quality, slow on Ollama
qwen3-vl:30b-a3b	30B	Q4_K_M	Vision-capable
qwen3-coder:30b	30B	Q4_K_M	Code-focused
qwen3.5:35b	35B	Q4_K_M	Good coding capabilities

To download any model:

ollama pull gpt-oss:20b
ollama pull qwen3:14b

Understanding CPU Offloading

When a model’s memory requirements exceed available VRAM, Ollama automatically distributes model layers between GPU and system RAM. The output shows this as a percentage split like “18%/82% CPU/GPU”.

This has massive performance implications. Each token generation requires data transfer between CPU and GPU memory—a bottleneck that compounds with every layer offloaded to CPU.

The pattern is clear from our results:

100% GPU models: 61-140 tokens/sec
70-82% GPU models: 19-51 tokens/sec
22% GPU (mostly CPU): 12.6 tokens/sec

This explains why a 20B parameter model can outperform a 120B model by 11x in practice. If you’re planning to serve multiple concurrent requests, understanding how Ollama handles parallel requests becomes essential for capacity planning.

Detailed Benchmark Results

Models Running 100% on GPU

GPT-OSS 20B — The Speed Champion

ollama run gpt-oss:20b --verbose
/set parameter num_ctx 19000

NAME           SIZE     PROCESSOR    CONTEXT
gpt-oss:20b    14 GB    100% GPU     19000

eval count:           2856 token(s)
eval duration:        20.410517947s
eval rate:            139.93 tokens/s

At 139.93 tokens/sec, GPT-OSS 20B is the clear winner for speed-critical applications. It uses only 14GB of VRAM, leaving headroom for larger context windows or other GPU workloads.

Qwen3 14B — Excellent Balance

ollama run qwen3:14b --verbose
/set parameter num_ctx 19000

NAME         SIZE     PROCESSOR    CONTEXT
qwen3:14b    12 GB    100% GPU     19000

eval count:           3094 token(s)
eval duration:        50.020594575s
eval rate:            61.85 tokens/s

Qwen3 14B offers the best instruction-following in my experience, with a comfortable 12GB memory footprint. At 61.85 tokens/sec, it’s responsive enough for interactive use.

For developers integrating Qwen3 into applications, see LLM Structured Output with Ollama and Qwen3 for extracting structured JSON responses.

Ministral 3 14B — Fast and Compact

ollama run ministral-3:14b --verbose
/set parameter num_ctx 19000

NAME               SIZE     PROCESSOR    CONTEXT
ministral-3:14b    13 GB    100% GPU     19000

eval count:           1481 token(s)
eval duration:        21.11734277s
eval rate:            70.13 tokens/s

Mistral’s smaller model delivers 70.13 tokens/sec while fitting entirely in VRAM. A solid choice when you need Mistral-family quality at maximum speed.

qwen3.5:9b - quick and new

ollama run  qwen3.5:9b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME          ID              SIZE      PROCESSOR    CONTEXT
qwen3.5:9b    6488c96fa5fa    9.3 GB    100% GPU     19000

eval count:           3802 token(s)
eval duration:        41.830174597s
eval rate:            90.89 tokens/s

qwen3.5:9b-q8_0 - q8 quant

This quant pushes performance qwen3.5:9b performance down by 30% comparing to q4.

ollama run  qwen3.5:9b-q8_0 --verbose
/set parameter num_ctx 19000

compare weather and climate between capital cities of australia
NAME               ID              SIZE     PROCESSOR    CONTEXT
qwen3.5:9b-q8_0    441ec31e4d2a    13 GB    100% GPU     19000

eval count:           3526 token(s)
eval duration:        57.595540159s
eval rate:            61.22 tokens/s

Models Requiring CPU Offloading

qwen3-coder:30b - fastest from 30b LLM set because of being text-only

ollama run qwen3-coder:30b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME               ID              SIZE     PROCESSOR          CONTEXT
qwen3-coder:30b    06c1097efce0    20 GB    25%/75% CPU/GPU    19000
22%/605%

eval count:           559 token(s)
eval duration:        9.77768875s
eval rate:            57.17 tokens/s

Qwen3-VL 30B — Best Partially-Offloaded Performance

ollama run qwen3-vl:30b-a3b-instruct --verbose
/set parameter num_ctx 19000

NAME                         SIZE     PROCESSOR          CONTEXT
qwen3-vl:30b-a3b-instruct    22 GB    30%/70% CPU/GPU    19000

eval count:           1450 token(s)
eval duration:        28.439319709s
eval rate:            50.99 tokens/s

Despite 30% of layers on CPU, Qwen3-VL maintains 50.99 tokens/sec—faster than some 100% GPU models. The vision capability adds versatility for multimodal tasks.

Mistral Small 3.2 24B — Quality vs Speed Trade-off

ollama run mistral-small3.2:24b --verbose
/set parameter num_ctx 19000

NAME                    SIZE     PROCESSOR          CONTEXT
mistral-small3.2:24b    19 GB    18%/82% CPU/GPU    19000

eval count:           831 token(s)
eval duration:        44.899859038s
eval rate:            18.51 tokens/s

Mistral Small 3.2 offers superior language quality but pays a steep speed penalty. At 18.51 tokens/sec, it feels noticeably slower for interactive chat. Worth it for tasks where quality matters more than latency.

GLM 4.7 Flash — MoE Thinking Model

ollama run glm-4.7-flash --verbose
/set parameter num_ctx 19000

NAME                 SIZE     PROCESSOR          CONTEXT
glm-4.7-flash        21 GB    27%/73% CPU/GPU    19000

eval count:           2446 token(s)
eval duration:        1m12.239164004s
eval rate:            33.86 tokens/s

GLM 4.7 Flash is a 30B-A3B Mixture of Experts model—30B total parameters with only 3B active per token. As a “thinking” model, it generates internal reasoning before responses. The 33.86 tokens/sec includes both thinking and output tokens. Despite CPU offloading, the MoE architecture keeps it reasonably fast.

qwen3.5:35b - New model with decent self-hosted performance

ollama run qwen3.5:35b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME           ID              SIZE     PROCESSOR          CONTEXT
qwen3.5:35b    4af949f8bdf0    27 GB    43%/57% CPU/GPU    19000

eval count:           3418 token(s)
eval duration:        2m45.458926548s
eval rate:            20.66 tokens/s

GPT-OSS 120B — The Heavy Hitter

ollama run gpt-oss:120b --verbose
/set parameter num_ctx 19000

NAME            SIZE     PROCESSOR          CONTEXT
gpt-oss:120b    66 GB    78%/22% CPU/GPU    19000

eval count:           5008 token(s)
eval duration:        6m36.168233066s
eval rate:            12.64 tokens/s

Running a 120B model on 16GB VRAM is technically possible but painful. With 78% on CPU, the 12.64 tokens/sec makes interactive use frustrating. Better suited for batch processing where latency doesn’t matter.

qwen3.5:27b - Smart but slow on Ollama

ollama run qwen3.5:27b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME           ID              SIZE     PROCESSOR          CONTEXT
qwen3.5:27b    193ec05b1e80    24 GB    43%/57% CPU/GPU    19000

eval count:           3370 token(s)
eval duration:        8m40.087510281s
eval rate:            6.48 tokens/s

I have tested qwen3.5:27b and got extremely good opinion on this model performance with OpenCode. It is very capable, knowlegeable, really good tool calling, hough it is slow on my machine on Ollama. I have tried other LLM self-hosting platforms, and got much higher speeds. I believe it’s time to let Ollama go. Will write about it a bit later.

Practical Recommendations

For Interactive Chat

Use models that fit 100% in VRAM:

GPT-OSS 20B — Maximum speed (139.93 t/s)
Ministral 3 14B — Good speed with Mistral quality (70.13 t/s)
Qwen3 14B — Best instruction-following (61.85 t/s)

For a better chat experience, consider Open-Source Chat UIs for local Ollama.

For Batch Processing

This is again, on my equipment - 14GB VRAM.

When speed is less critical:

Mistral Small 3.2 24B — Superior language quality
Qwen3-VL 30B — Vision + text capability

When speed is not critical at all:

Qwen3.5:35b - Good coding capabilities
Qwen3.5:27b - Extremely good, but slow on Ollama. I have had quite a success hosting this model on llama.cpp though.

For Development and Coding

If you’re building applications with Ollama:

Alternative Hosting Options

If Ollama’s limitations concern you (see Ollama enshittification concerns), explore other options in the Local LLM Hosting Guide or compare Docker Model Runner vs Ollama.

Conclusion

With 16GB VRAM, you can run capable LLMs at impressive speeds—if you choose wisely. The key findings:

Stay within VRAM limits for interactive use. A 20B model at 140 tokens/sec beats a 120B model at 12 tokens/sec for most practical purposes.
GPT-OSS 20B wins on pure speed, but Qwen3 14B offers the best balance of speed and capability for instruction-following tasks.
CPU offloading works but expect 3-10x slowdowns. Acceptable for batch processing, frustrating for chat.
Context size matters. The 19K context used here increases VRAM usage significantly. Reduce context for better GPU utilization.

For AI-powered search combining local LLMs with web results, see self-hosting Perplexica with Ollama.

To explore more benchmarks, VRAM and throughput trade-offs, and performance tuning across Ollama and other runtimes, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.