Comparing LLMs performance on Ollama on 16GB VRAM GPU

LLM speed test on RTX 4080 with 16GB VRAM

Page content

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.

With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference. For more on LLM performance—throughput vs latency, VRAM limits, parallel requests, and benchmarks across runtimes—see LLM Performance: Benchmarks, Bottlenecks & Optimization.

LLM performance on Ollama - reranking cockroaches

TL;DR

Here is updated comparison table of LLM performance on RTX 4080 16GB with Ollama 0.17.7, (2026-03-09) added Qwen 3.5 9b, 9bq8, 27b and 35b models:

Model RAM+VRAM Used CPU/GPU Split Tokens/sec
gpt-oss:20b 14 GB 100% GPU 139.93
qwen3.5:9b 9.3 GB 100% GPU 90.89
ministral-3:14b 13 GB 100% GPU 70.13
qwen3:14b 12 GB 100% GPU 61.85
qwen3.5:9b-q8_0 13 GB 100% GPU 61.22
qwen3-coder:30b 20 GB 25%/75% CPU/GPU 57.17
qwen3-vl:30b-a3b 22 GB 30%/70% CPU/GPU 50.99
glm-4.7-flash 21 GB 27%/73% CPU/GPU 33.86
nemotron-3-nano:30b 25 GB 38%/62% CPU/GPU 32.77
qwen3.5:35b 27 GB 43%/57% CPU/GPU 20.66
devstral-small-2:24b 19 GB 18%/82% CPU/GPU 18.67
mistral-small3.2:24b 19 GB 18%/82% CPU/GPU 18.51
gpt-oss:120b 66 GB 78%/22% CPU/GPU 12.64
qwen3.5:27b 24 GB 43%/57% CPU/GPU 6.48

Key insight: Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed difference.

Test Hardware Setup

The benchmark was conducted on the following system:

  • GPU: NVIDIA RTX 4080 with 16GB VRAM
  • CPU: Intel Core i7-14700 (8 P-cores + 12 E-cores)
  • RAM: 64GB DDR5-6000

This represents a common high-end consumer configuration for local LLM inference. The 16GB VRAM is the critical constraint—it determines which models run entirely on GPU versus requiring CPU offloading.

Understanding how Ollama uses Intel CPU cores becomes important when models exceed VRAM capacity, as CPU performance directly impacts offloaded layer inference speed.

Purpose of This Benchmark

The primary goal was measuring inference speed under realistic conditions. I already knew from experience that Mistral Small 3.2 24B excels at language quality while Qwen3 14B offers superior instruction-following for my specific use cases.

This benchmark answers the practical question: How fast can each model generate text, and what’s the speed penalty for exceeding VRAM limits?

The test parameters were:

  • Context size: 19,000 tokens. This is average value in my Generate requests.
  • Prompt: “compare weather and climate between capital cities of australia”
  • Metric: eval rate (tokens per second during generation)

Ollama Installation and Version

All tests used Ollama version 0.15.2, the latest release at the time of testing. Later re-ran on Ollama v 0.17.7 - to add Qwen3.5 models. For a complete reference of Ollama commands used in this benchmark, see the Ollama cheatsheet.

To quick recoup - install Ollama on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

If you need to store models on a different drive due to space constraints, check out how to move Ollama models to a different drive.

Models Tested

The following models were benchmarked, in alphabetical order:

Model Parameters Quantization Notes
devstral-small-2:24b 24B Q4_K_M Code-focused
glm-4.7-flash 30B Q4_K_M Thinking model
gpt-oss:20b 20B Q4_K_M Fastest overall
gpt-oss:120b 120B Q4_K_M Largest tested
ministral-3:14b 14B Q4_K_M Mistral’s efficient model
mistral-small3.2:24b 24B Q4_K_M Strong language quality
nemotron-3-nano:30b 30B Q4_K_M NVIDIA’s offering
qwen3:14b 14B Q4_K_M Best instruction-following
qwen3.5:9b 9B Q4_K_M Fast, fully GPU
qwen3.5:9b-q8_0 9B Q8_0 Higher quality, fully GPU
qwen3.5:27b 27B Q4_K_M Excellent quality, slow on Ollama
qwen3-vl:30b-a3b 30B Q4_K_M Vision-capable
qwen3-coder:30b 30B Q4_K_M Code-focused
qwen3.5:35b 35B Q4_K_M Good coding capabilities

To download any model:

ollama pull gpt-oss:20b
ollama pull qwen3:14b

Understanding CPU Offloading

When a model’s memory requirements exceed available VRAM, Ollama automatically distributes model layers between GPU and system RAM. The output shows this as a percentage split like “18%/82% CPU/GPU”.

This has massive performance implications. Each token generation requires data transfer between CPU and GPU memory—a bottleneck that compounds with every layer offloaded to CPU.

The pattern is clear from our results:

  • 100% GPU models: 61-140 tokens/sec
  • 70-82% GPU models: 19-51 tokens/sec
  • 22% GPU (mostly CPU): 12.6 tokens/sec

This explains why a 20B parameter model can outperform a 120B model by 11x in practice. If you’re planning to serve multiple concurrent requests, understanding how Ollama handles parallel requests becomes essential for capacity planning.

Detailed Benchmark Results

Models Running 100% on GPU

GPT-OSS 20B — The Speed Champion

ollama run gpt-oss:20b --verbose
/set parameter num_ctx 19000

NAME           SIZE     PROCESSOR    CONTEXT
gpt-oss:20b    14 GB    100% GPU     19000

eval count:           2856 token(s)
eval duration:        20.410517947s
eval rate:            139.93 tokens/s

At 139.93 tokens/sec, GPT-OSS 20B is the clear winner for speed-critical applications. It uses only 14GB of VRAM, leaving headroom for larger context windows or other GPU workloads.

Qwen3 14B — Excellent Balance

ollama run qwen3:14b --verbose
/set parameter num_ctx 19000

NAME         SIZE     PROCESSOR    CONTEXT
qwen3:14b    12 GB    100% GPU     19000

eval count:           3094 token(s)
eval duration:        50.020594575s
eval rate:            61.85 tokens/s

Qwen3 14B offers the best instruction-following in my experience, with a comfortable 12GB memory footprint. At 61.85 tokens/sec, it’s responsive enough for interactive use.

For developers integrating Qwen3 into applications, see LLM Structured Output with Ollama and Qwen3 for extracting structured JSON responses.

Ministral 3 14B — Fast and Compact

ollama run ministral-3:14b --verbose
/set parameter num_ctx 19000

NAME               SIZE     PROCESSOR    CONTEXT
ministral-3:14b    13 GB    100% GPU     19000

eval count:           1481 token(s)
eval duration:        21.11734277s
eval rate:            70.13 tokens/s

Mistral’s smaller model delivers 70.13 tokens/sec while fitting entirely in VRAM. A solid choice when you need Mistral-family quality at maximum speed.

qwen3.5:9b - quick and new

ollama run  qwen3.5:9b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME          ID              SIZE      PROCESSOR    CONTEXT
qwen3.5:9b    6488c96fa5fa    9.3 GB    100% GPU     19000

eval count:           3802 token(s)
eval duration:        41.830174597s
eval rate:            90.89 tokens/s

qwen3.5:9b-q8_0 - q8 quant

This quant pushes performance qwen3.5:9b performance down by 30% comparing to q4.

ollama run  qwen3.5:9b-q8_0 --verbose
/set parameter num_ctx 19000

compare weather and climate between capital cities of australia
NAME               ID              SIZE     PROCESSOR    CONTEXT
qwen3.5:9b-q8_0    441ec31e4d2a    13 GB    100% GPU     19000

eval count:           3526 token(s)
eval duration:        57.595540159s
eval rate:            61.22 tokens/s

Models Requiring CPU Offloading

qwen3-coder:30b - fastest from 30b LLM set because of being text-only

ollama run qwen3-coder:30b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME               ID              SIZE     PROCESSOR          CONTEXT
qwen3-coder:30b    06c1097efce0    20 GB    25%/75% CPU/GPU    19000
22%/605%

eval count:           559 token(s)
eval duration:        9.77768875s
eval rate:            57.17 tokens/s

Qwen3-VL 30B — Best Partially-Offloaded Performance

ollama run qwen3-vl:30b-a3b-instruct --verbose
/set parameter num_ctx 19000

NAME                         SIZE     PROCESSOR          CONTEXT
qwen3-vl:30b-a3b-instruct    22 GB    30%/70% CPU/GPU    19000

eval count:           1450 token(s)
eval duration:        28.439319709s
eval rate:            50.99 tokens/s

Despite 30% of layers on CPU, Qwen3-VL maintains 50.99 tokens/sec—faster than some 100% GPU models. The vision capability adds versatility for multimodal tasks.

Mistral Small 3.2 24B — Quality vs Speed Trade-off

ollama run mistral-small3.2:24b --verbose
/set parameter num_ctx 19000

NAME                    SIZE     PROCESSOR          CONTEXT
mistral-small3.2:24b    19 GB    18%/82% CPU/GPU    19000

eval count:           831 token(s)
eval duration:        44.899859038s
eval rate:            18.51 tokens/s

Mistral Small 3.2 offers superior language quality but pays a steep speed penalty. At 18.51 tokens/sec, it feels noticeably slower for interactive chat. Worth it for tasks where quality matters more than latency.

GLM 4.7 Flash — MoE Thinking Model

ollama run glm-4.7-flash --verbose
/set parameter num_ctx 19000

NAME                 SIZE     PROCESSOR          CONTEXT
glm-4.7-flash        21 GB    27%/73% CPU/GPU    19000

eval count:           2446 token(s)
eval duration:        1m12.239164004s
eval rate:            33.86 tokens/s

GLM 4.7 Flash is a 30B-A3B Mixture of Experts model—30B total parameters with only 3B active per token. As a “thinking” model, it generates internal reasoning before responses. The 33.86 tokens/sec includes both thinking and output tokens. Despite CPU offloading, the MoE architecture keeps it reasonably fast.

qwen3.5:35b - New model with decent self-hosted performance

ollama run qwen3.5:35b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME           ID              SIZE     PROCESSOR          CONTEXT
qwen3.5:35b    4af949f8bdf0    27 GB    43%/57% CPU/GPU    19000

eval count:           3418 token(s)
eval duration:        2m45.458926548s
eval rate:            20.66 tokens/s

GPT-OSS 120B — The Heavy Hitter

ollama run gpt-oss:120b --verbose
/set parameter num_ctx 19000

NAME            SIZE     PROCESSOR          CONTEXT
gpt-oss:120b    66 GB    78%/22% CPU/GPU    19000

eval count:           5008 token(s)
eval duration:        6m36.168233066s
eval rate:            12.64 tokens/s

Running a 120B model on 16GB VRAM is technically possible but painful. With 78% on CPU, the 12.64 tokens/sec makes interactive use frustrating. Better suited for batch processing where latency doesn’t matter.

qwen3.5:27b - Smart but slow on Ollama

ollama run qwen3.5:27b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia

NAME           ID              SIZE     PROCESSOR          CONTEXT
qwen3.5:27b    193ec05b1e80    24 GB    43%/57% CPU/GPU    19000

eval count:           3370 token(s)
eval duration:        8m40.087510281s
eval rate:            6.48 tokens/s

I have tested qwen3.5:27b and got extremely good opinion on this model performance with OpenCode. It is very capable, knowlegeable, really good tool calling, hough it is slow on my machine on Ollama. I have tried other LLM self-hosting platforms, and got much higher speeds. I believe it’s time to let Ollama go. Will write about it a bit later.

Practical Recommendations

For Interactive Chat

Use models that fit 100% in VRAM:

  1. GPT-OSS 20B — Maximum speed (139.93 t/s)
  2. Ministral 3 14B — Good speed with Mistral quality (70.13 t/s)
  3. Qwen3 14B — Best instruction-following (61.85 t/s)

For a better chat experience, consider Open-Source Chat UIs for local Ollama.

For Batch Processing

This is again, on my equipment - 14GB VRAM.

When speed is less critical:

  • Mistral Small 3.2 24B — Superior language quality
  • Qwen3-VL 30B — Vision + text capability

When speed is not critical at all:

  • Qwen3.5:35b - Good coding capabilities
  • Qwen3.5:27b - Extremely good, but slow on Ollama. I have had quite a success hosting this model on llama.cpp though.

For Development and Coding

If you’re building applications with Ollama:

Alternative Hosting Options

If Ollama’s limitations concern you (see Ollama enshittification concerns), explore other options in the Local LLM Hosting Guide or compare Docker Model Runner vs Ollama.

Conclusion

With 16GB VRAM, you can run capable LLMs at impressive speeds—if you choose wisely. The key findings:

  1. Stay within VRAM limits for interactive use. A 20B model at 140 tokens/sec beats a 120B model at 12 tokens/sec for most practical purposes.

  2. GPT-OSS 20B wins on pure speed, but Qwen3 14B offers the best balance of speed and capability for instruction-following tasks.

  3. CPU offloading works but expect 3-10x slowdowns. Acceptable for batch processing, frustrating for chat.

  4. Context size matters. The 19K context used here increases VRAM usage significantly. Reduce context for better GPU utilization.

For AI-powered search combining local LLMs with web results, see self-hosting Perplexica with Ollama.

To explore more benchmarks, VRAM and throughput trade-offs, and performance tuning across Ollama and other runtimes, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.

Internal Resources

External References