Comparing LLMs performance on Ollama on 16GB VRAM GPU

LLM speed test on RTX 4080 with 16GB VRAM

Page content

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 9 popular LLMs on Ollama on an RTX 4080.

With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference.

7 llamas - Comparing LLMs on Ollama

TL;DR

Here is the comparison table of LLM performance on RTX 4080 16GB with Ollama 0.15.2:

Model RAM+VRAM Used CPU/GPU Split Tokens/sec
gpt-oss:20b 14 GB 100% GPU 139.93
ministral-3:14b 13 GB 100% GPU 70.13
qwen3:14b 12 GB 100% GPU 61.85
qwen3-vl:30b-a3b 22 GB 30%/70% 50.99
glm-4.7-flash 21 GB 27%/73% 33.86
nemotron-3-nano:30b 25 GB 38%/62% 32.77
devstral-small-2:24b 19 GB 18%/82% 18.67
mistral-small3.2:24b 19 GB 18%/82% 18.51
gpt-oss:120b 66 GB 78%/22% 12.64

Key insight: Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed difference.

Test Hardware Setup

The benchmark was conducted on the following system:

  • GPU: NVIDIA RTX 4080 with 16GB VRAM
  • CPU: Intel Core i7-14700 (8 P-cores + 12 E-cores)
  • RAM: 64GB DDR5-6000

This represents a common high-end consumer configuration for local LLM inference. The 16GB VRAM is the critical constraint—it determines which models run entirely on GPU versus requiring CPU offloading.

Understanding how Ollama uses Intel CPU cores becomes important when models exceed VRAM capacity, as CPU performance directly impacts offloaded layer inference speed.

Purpose of This Benchmark

The primary goal was measuring inference speed under realistic conditions. I already knew from experience that Mistral Small 3.2 24B excels at language quality while Qwen3 14B offers superior instruction-following for my specific use cases.

This benchmark answers the practical question: How fast can each model generate text, and what’s the speed penalty for exceeding VRAM limits?

The test parameters were:

  • Context size: 19,000 tokens
  • Prompt: “compare weather and climate between capital cities of australia”
  • Metric: eval rate (tokens per second during generation)

Ollama Installation and Version

All tests used Ollama version 0.15.2, the latest release at the time of testing. For a complete reference of Ollama commands used in this benchmark, see the Ollama cheatsheet.

To install Ollama on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

If you need to store models on a different drive due to space constraints, check out how to move Ollama models to a different drive.

Models Tested

The following models were benchmarked:

Model Parameters Quantization Notes
gpt-oss:20b 20B Q4_K_M Fastest overall
gpt-oss:120b 120B Q4_K_M Largest tested
qwen3:14b 14B Q4_K_M Best instruction-following
qwen3-vl:30b-a3b 30B Q4_K_M Vision-capable
ministral-3:14b 14B Q4_K_M Mistral’s efficient model
mistral-small3.2:24b 24B Q4_K_M Strong language quality
devstral-small-2:24b 24B Q4_K_M Code-focused
glm-4.7-flash 30B Q4_K_M Thinking model
nemotron-3-nano:30b 30B Q4_K_M NVIDIA’s offering

To download any model:

ollama pull gpt-oss:20b
ollama pull qwen3:14b

Understanding CPU Offloading

When a model’s memory requirements exceed available VRAM, Ollama automatically distributes model layers between GPU and system RAM. The output shows this as a percentage split like “18%/82% CPU/GPU”.

This has massive performance implications. Each token generation requires data transfer between CPU and GPU memory—a bottleneck that compounds with every layer offloaded to CPU.

The pattern is clear from our results:

  • 100% GPU models: 61-140 tokens/sec
  • 70-82% GPU models: 19-51 tokens/sec
  • 22% GPU (mostly CPU): 12.6 tokens/sec

This explains why a 20B parameter model can outperform a 120B model by 11x in practice. If you’re planning to serve multiple concurrent requests, understanding how Ollama handles parallel requests becomes essential for capacity planning.

Detailed Benchmark Results

Models Running 100% on GPU

GPT-OSS 20B — The Speed Champion

ollama run gpt-oss:20b --verbose
/set parameter num_ctx 19000

NAME           SIZE     PROCESSOR    CONTEXT
gpt-oss:20b    14 GB    100% GPU     19000

eval count:           2856 token(s)
eval duration:        20.410517947s
eval rate:            139.93 tokens/s

At 139.93 tokens/sec, GPT-OSS 20B is the clear winner for speed-critical applications. It uses only 14GB of VRAM, leaving headroom for larger context windows or other GPU workloads.

Qwen3 14B — Excellent Balance

ollama run qwen3:14b --verbose
/set parameter num_ctx 19000

NAME         SIZE     PROCESSOR    CONTEXT
qwen3:14b    12 GB    100% GPU     19000

eval count:           3094 token(s)
eval duration:        50.020594575s
eval rate:            61.85 tokens/s

Qwen3 14B offers the best instruction-following in my experience, with a comfortable 12GB memory footprint. At 61.85 tokens/sec, it’s responsive enough for interactive use.

For developers integrating Qwen3 into applications, see LLM Structured Output with Ollama and Qwen3 for extracting structured JSON responses.

Ministral 3 14B — Fast and Compact

ollama run ministral-3:14b --verbose
/set parameter num_ctx 19000

NAME               SIZE     PROCESSOR    CONTEXT
ministral-3:14b    13 GB    100% GPU     19000

eval count:           1481 token(s)
eval duration:        21.11734277s
eval rate:            70.13 tokens/s

Mistral’s smaller model delivers 70.13 tokens/sec while fitting entirely in VRAM. A solid choice when you need Mistral-family quality at maximum speed.

Models Requiring CPU Offloading

Qwen3-VL 30B — Best Partially-Offloaded Performance

ollama run qwen3-vl:30b-a3b-instruct --verbose
/set parameter num_ctx 19000

NAME                         SIZE     PROCESSOR          CONTEXT
qwen3-vl:30b-a3b-instruct    22 GB    30%/70% CPU/GPU    19000

eval count:           1450 token(s)
eval duration:        28.439319709s
eval rate:            50.99 tokens/s

Despite 30% of layers on CPU, Qwen3-VL maintains 50.99 tokens/sec—faster than some 100% GPU models. The vision capability adds versatility for multimodal tasks.

Mistral Small 3.2 24B — Quality vs Speed Trade-off

ollama run mistral-small3.2:24b --verbose
/set parameter num_ctx 19000

NAME                    SIZE     PROCESSOR          CONTEXT
mistral-small3.2:24b    19 GB    18%/82% CPU/GPU    19000

eval count:           831 token(s)
eval duration:        44.899859038s
eval rate:            18.51 tokens/s

Mistral Small 3.2 offers superior language quality but pays a steep speed penalty. At 18.51 tokens/sec, it feels noticeably slower for interactive chat. Worth it for tasks where quality matters more than latency.

GLM 4.7 Flash — MoE Thinking Model

ollama run glm-4.7-flash --verbose
/set parameter num_ctx 19000

NAME                 SIZE     PROCESSOR          CONTEXT
glm-4.7-flash        21 GB    27%/73% CPU/GPU    19000

eval count:           2446 token(s)
eval duration:        1m12.239164004s
eval rate:            33.86 tokens/s

GLM 4.7 Flash is a 30B-A3B Mixture of Experts model—30B total parameters with only 3B active per token. As a “thinking” model, it generates internal reasoning before responses. The 33.86 tokens/sec includes both thinking and output tokens. Despite CPU offloading, the MoE architecture keeps it reasonably fast.

GPT-OSS 120B — The Heavy Hitter

ollama run gpt-oss:120b --verbose
/set parameter num_ctx 19000

NAME            SIZE     PROCESSOR          CONTEXT
gpt-oss:120b    66 GB    78%/22% CPU/GPU    19000

eval count:           5008 token(s)
eval duration:        6m36.168233066s
eval rate:            12.64 tokens/s

Running a 120B model on 16GB VRAM is technically possible but painful. With 78% on CPU, the 12.64 tokens/sec makes interactive use frustrating. Better suited for batch processing where latency doesn’t matter.

Practical Recommendations

For Interactive Chat

Use models that fit 100% in VRAM:

  1. GPT-OSS 20B — Maximum speed (139.93 t/s)
  2. Ministral 3 14B — Good speed with Mistral quality (70.13 t/s)
  3. Qwen3 14B — Best instruction-following (61.85 t/s)

For a better chat experience, consider Open-Source Chat UIs for local Ollama.

For Batch Processing

When speed is less critical:

  • Mistral Small 3.2 24B — Superior language quality
  • Qwen3-VL 30B — Vision + text capability

For Development and Coding

If you’re building applications with Ollama:

Alternative Hosting Options

If Ollama’s limitations concern you (see Ollama enshittification concerns), explore other options in the Local LLM Hosting Guide or compare Docker Model Runner vs Ollama.

Conclusion

With 16GB VRAM, you can run capable LLMs at impressive speeds—if you choose wisely. The key findings:

  1. Stay within VRAM limits for interactive use. A 20B model at 140 tokens/sec beats a 120B model at 12 tokens/sec for most practical purposes.

  2. GPT-OSS 20B wins on pure speed, but Qwen3 14B offers the best balance of speed and capability for instruction-following tasks.

  3. CPU offloading works but expect 3-10x slowdowns. Acceptable for batch processing, frustrating for chat.

  4. Context size matters. The 19K context used here increases VRAM usage significantly. Reduce context for better GPU utilization.

For AI-powered search combining local LLMs with web results, see self-hosting Perplexica with Ollama.

Internal Resources

External References