Comparing LLMs performance on Ollama on 16GB VRAM GPU
LLM speed test on RTX 4080 with 16GB VRAM
Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 9 popular LLMs on Ollama on an RTX 4080.
With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference.

TL;DR
Here is the comparison table of LLM performance on RTX 4080 16GB with Ollama 0.15.2:
| Model | RAM+VRAM Used | CPU/GPU Split | Tokens/sec |
|---|---|---|---|
| gpt-oss:20b | 14 GB | 100% GPU | 139.93 |
| ministral-3:14b | 13 GB | 100% GPU | 70.13 |
| qwen3:14b | 12 GB | 100% GPU | 61.85 |
| qwen3-vl:30b-a3b | 22 GB | 30%/70% | 50.99 |
| glm-4.7-flash | 21 GB | 27%/73% | 33.86 |
| nemotron-3-nano:30b | 25 GB | 38%/62% | 32.77 |
| devstral-small-2:24b | 19 GB | 18%/82% | 18.67 |
| mistral-small3.2:24b | 19 GB | 18%/82% | 18.51 |
| gpt-oss:120b | 66 GB | 78%/22% | 12.64 |
Key insight: Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed difference.
Test Hardware Setup
The benchmark was conducted on the following system:
- GPU: NVIDIA RTX 4080 with 16GB VRAM
- CPU: Intel Core i7-14700 (8 P-cores + 12 E-cores)
- RAM: 64GB DDR5-6000
This represents a common high-end consumer configuration for local LLM inference. The 16GB VRAM is the critical constraint—it determines which models run entirely on GPU versus requiring CPU offloading.
Understanding how Ollama uses Intel CPU cores becomes important when models exceed VRAM capacity, as CPU performance directly impacts offloaded layer inference speed.
Purpose of This Benchmark
The primary goal was measuring inference speed under realistic conditions. I already knew from experience that Mistral Small 3.2 24B excels at language quality while Qwen3 14B offers superior instruction-following for my specific use cases.
This benchmark answers the practical question: How fast can each model generate text, and what’s the speed penalty for exceeding VRAM limits?
The test parameters were:
- Context size: 19,000 tokens
- Prompt: “compare weather and climate between capital cities of australia”
- Metric: eval rate (tokens per second during generation)
Ollama Installation and Version
All tests used Ollama version 0.15.2, the latest release at the time of testing. For a complete reference of Ollama commands used in this benchmark, see the Ollama cheatsheet.
To install Ollama on Linux:
curl -fsSL https://ollama.com/install.sh | sh
Verify installation:
ollama --version
If you need to store models on a different drive due to space constraints, check out how to move Ollama models to a different drive.
Models Tested
The following models were benchmarked:
| Model | Parameters | Quantization | Notes |
|---|---|---|---|
| gpt-oss:20b | 20B | Q4_K_M | Fastest overall |
| gpt-oss:120b | 120B | Q4_K_M | Largest tested |
| qwen3:14b | 14B | Q4_K_M | Best instruction-following |
| qwen3-vl:30b-a3b | 30B | Q4_K_M | Vision-capable |
| ministral-3:14b | 14B | Q4_K_M | Mistral’s efficient model |
| mistral-small3.2:24b | 24B | Q4_K_M | Strong language quality |
| devstral-small-2:24b | 24B | Q4_K_M | Code-focused |
| glm-4.7-flash | 30B | Q4_K_M | Thinking model |
| nemotron-3-nano:30b | 30B | Q4_K_M | NVIDIA’s offering |
To download any model:
ollama pull gpt-oss:20b
ollama pull qwen3:14b
Understanding CPU Offloading
When a model’s memory requirements exceed available VRAM, Ollama automatically distributes model layers between GPU and system RAM. The output shows this as a percentage split like “18%/82% CPU/GPU”.
This has massive performance implications. Each token generation requires data transfer between CPU and GPU memory—a bottleneck that compounds with every layer offloaded to CPU.
The pattern is clear from our results:
- 100% GPU models: 61-140 tokens/sec
- 70-82% GPU models: 19-51 tokens/sec
- 22% GPU (mostly CPU): 12.6 tokens/sec
This explains why a 20B parameter model can outperform a 120B model by 11x in practice. If you’re planning to serve multiple concurrent requests, understanding how Ollama handles parallel requests becomes essential for capacity planning.
Detailed Benchmark Results
Models Running 100% on GPU
GPT-OSS 20B — The Speed Champion
ollama run gpt-oss:20b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
gpt-oss:20b 14 GB 100% GPU 19000
eval count: 2856 token(s)
eval duration: 20.410517947s
eval rate: 139.93 tokens/s
At 139.93 tokens/sec, GPT-OSS 20B is the clear winner for speed-critical applications. It uses only 14GB of VRAM, leaving headroom for larger context windows or other GPU workloads.
Qwen3 14B — Excellent Balance
ollama run qwen3:14b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
qwen3:14b 12 GB 100% GPU 19000
eval count: 3094 token(s)
eval duration: 50.020594575s
eval rate: 61.85 tokens/s
Qwen3 14B offers the best instruction-following in my experience, with a comfortable 12GB memory footprint. At 61.85 tokens/sec, it’s responsive enough for interactive use.
For developers integrating Qwen3 into applications, see LLM Structured Output with Ollama and Qwen3 for extracting structured JSON responses.
Ministral 3 14B — Fast and Compact
ollama run ministral-3:14b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
ministral-3:14b 13 GB 100% GPU 19000
eval count: 1481 token(s)
eval duration: 21.11734277s
eval rate: 70.13 tokens/s
Mistral’s smaller model delivers 70.13 tokens/sec while fitting entirely in VRAM. A solid choice when you need Mistral-family quality at maximum speed.
Models Requiring CPU Offloading
Qwen3-VL 30B — Best Partially-Offloaded Performance
ollama run qwen3-vl:30b-a3b-instruct --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
qwen3-vl:30b-a3b-instruct 22 GB 30%/70% CPU/GPU 19000
eval count: 1450 token(s)
eval duration: 28.439319709s
eval rate: 50.99 tokens/s
Despite 30% of layers on CPU, Qwen3-VL maintains 50.99 tokens/sec—faster than some 100% GPU models. The vision capability adds versatility for multimodal tasks.
Mistral Small 3.2 24B — Quality vs Speed Trade-off
ollama run mistral-small3.2:24b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
mistral-small3.2:24b 19 GB 18%/82% CPU/GPU 19000
eval count: 831 token(s)
eval duration: 44.899859038s
eval rate: 18.51 tokens/s
Mistral Small 3.2 offers superior language quality but pays a steep speed penalty. At 18.51 tokens/sec, it feels noticeably slower for interactive chat. Worth it for tasks where quality matters more than latency.
GLM 4.7 Flash — MoE Thinking Model
ollama run glm-4.7-flash --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
glm-4.7-flash 21 GB 27%/73% CPU/GPU 19000
eval count: 2446 token(s)
eval duration: 1m12.239164004s
eval rate: 33.86 tokens/s
GLM 4.7 Flash is a 30B-A3B Mixture of Experts model—30B total parameters with only 3B active per token. As a “thinking” model, it generates internal reasoning before responses. The 33.86 tokens/sec includes both thinking and output tokens. Despite CPU offloading, the MoE architecture keeps it reasonably fast.
GPT-OSS 120B — The Heavy Hitter
ollama run gpt-oss:120b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
gpt-oss:120b 66 GB 78%/22% CPU/GPU 19000
eval count: 5008 token(s)
eval duration: 6m36.168233066s
eval rate: 12.64 tokens/s
Running a 120B model on 16GB VRAM is technically possible but painful. With 78% on CPU, the 12.64 tokens/sec makes interactive use frustrating. Better suited for batch processing where latency doesn’t matter.
Practical Recommendations
For Interactive Chat
Use models that fit 100% in VRAM:
- GPT-OSS 20B — Maximum speed (139.93 t/s)
- Ministral 3 14B — Good speed with Mistral quality (70.13 t/s)
- Qwen3 14B — Best instruction-following (61.85 t/s)
For a better chat experience, consider Open-Source Chat UIs for local Ollama.
For Batch Processing
When speed is less critical:
- Mistral Small 3.2 24B — Superior language quality
- Qwen3-VL 30B — Vision + text capability
For Development and Coding
If you’re building applications with Ollama:
Alternative Hosting Options
If Ollama’s limitations concern you (see Ollama enshittification concerns), explore other options in the Local LLM Hosting Guide or compare Docker Model Runner vs Ollama.
Conclusion
With 16GB VRAM, you can run capable LLMs at impressive speeds—if you choose wisely. The key findings:
-
Stay within VRAM limits for interactive use. A 20B model at 140 tokens/sec beats a 120B model at 12 tokens/sec for most practical purposes.
-
GPT-OSS 20B wins on pure speed, but Qwen3 14B offers the best balance of speed and capability for instruction-following tasks.
-
CPU offloading works but expect 3-10x slowdowns. Acceptable for batch processing, frustrating for chat.
-
Context size matters. The 19K context used here increases VRAM usage significantly. Reduce context for better GPU utilization.
For AI-powered search combining local LLMs with web results, see self-hosting Perplexica with Ollama.
Useful Links
Internal Resources
- Ollama cheatsheet: Most useful Ollama commands
- How Ollama Handles Parallel Requests
- How Ollama is using Intel CPU Performance and Efficient Cores
- How to Move Ollama Models to Different Drive or Folder
- LLM Structured Output on Ollama, Qwen3 & Python or Go
- Self-hosting Perplexica - with Ollama
- Open-Source Chat UIs for LLMs on Local Ollama Instances
- First Signs of Ollama Enshittification
- Docker Model Runner vs Ollama: Which to Choose?
- Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More
- Integrating Ollama with Python: REST API and Python Client Examples
- Go SDKs for Ollama - comparison with examples