Comparing LLMs performance on Ollama on 16GB VRAM GPU
LLM speed test on RTX 4080 with 16GB VRAM
Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.
With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference. For more on LLM performance—throughput vs latency, VRAM limits, parallel requests, and benchmarks across runtimes—see LLM Performance: Benchmarks, Bottlenecks & Optimization.

TL;DR
Here is updated comparison table of LLM performance on RTX 4080 16GB with Ollama 0.17.7, (2026-03-09) added Qwen 3.5 9b, 9bq8, 27b and 35b models:
| Model | RAM+VRAM Used | CPU/GPU Split | Tokens/sec |
|---|---|---|---|
| gpt-oss:20b | 14 GB | 100% GPU | 139.93 |
| qwen3.5:9b | 9.3 GB | 100% GPU | 90.89 |
| ministral-3:14b | 13 GB | 100% GPU | 70.13 |
| qwen3:14b | 12 GB | 100% GPU | 61.85 |
| qwen3.5:9b-q8_0 | 13 GB | 100% GPU | 61.22 |
| qwen3-coder:30b | 20 GB | 25%/75% CPU/GPU | 57.17 |
| qwen3-vl:30b-a3b | 22 GB | 30%/70% CPU/GPU | 50.99 |
| glm-4.7-flash | 21 GB | 27%/73% CPU/GPU | 33.86 |
| nemotron-3-nano:30b | 25 GB | 38%/62% CPU/GPU | 32.77 |
| qwen3.5:35b | 27 GB | 43%/57% CPU/GPU | 20.66 |
| devstral-small-2:24b | 19 GB | 18%/82% CPU/GPU | 18.67 |
| mistral-small3.2:24b | 19 GB | 18%/82% CPU/GPU | 18.51 |
| gpt-oss:120b | 66 GB | 78%/22% CPU/GPU | 12.64 |
| qwen3.5:27b | 24 GB | 43%/57% CPU/GPU | 6.48 |
Key insight: Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed difference.
Test Hardware Setup
The benchmark was conducted on the following system:
- GPU: NVIDIA RTX 4080 with 16GB VRAM
- CPU: Intel Core i7-14700 (8 P-cores + 12 E-cores)
- RAM: 64GB DDR5-6000
This represents a common high-end consumer configuration for local LLM inference. The 16GB VRAM is the critical constraint—it determines which models run entirely on GPU versus requiring CPU offloading.
Understanding how Ollama uses Intel CPU cores becomes important when models exceed VRAM capacity, as CPU performance directly impacts offloaded layer inference speed.
Purpose of This Benchmark
The primary goal was measuring inference speed under realistic conditions. I already knew from experience that Mistral Small 3.2 24B excels at language quality while Qwen3 14B offers superior instruction-following for my specific use cases.
This benchmark answers the practical question: How fast can each model generate text, and what’s the speed penalty for exceeding VRAM limits?
The test parameters were:
- Context size: 19,000 tokens. This is average value in my Generate requests.
- Prompt: “compare weather and climate between capital cities of australia”
- Metric: eval rate (tokens per second during generation)
Ollama Installation and Version
All tests used Ollama version 0.15.2, the latest release at the time of testing. Later re-ran on Ollama v 0.17.7 - to add Qwen3.5 models. For a complete reference of Ollama commands used in this benchmark, see the Ollama cheatsheet.
To quick recoup - install Ollama on Linux:
curl -fsSL https://ollama.com/install.sh | sh
Verify installation:
ollama --version
If you need to store models on a different drive due to space constraints, check out how to move Ollama models to a different drive.
Models Tested
The following models were benchmarked, in alphabetical order:
| Model | Parameters | Quantization | Notes |
|---|---|---|---|
| devstral-small-2:24b | 24B | Q4_K_M | Code-focused |
| glm-4.7-flash | 30B | Q4_K_M | Thinking model |
| gpt-oss:20b | 20B | Q4_K_M | Fastest overall |
| gpt-oss:120b | 120B | Q4_K_M | Largest tested |
| ministral-3:14b | 14B | Q4_K_M | Mistral’s efficient model |
| mistral-small3.2:24b | 24B | Q4_K_M | Strong language quality |
| nemotron-3-nano:30b | 30B | Q4_K_M | NVIDIA’s offering |
| qwen3:14b | 14B | Q4_K_M | Best instruction-following |
| qwen3.5:9b | 9B | Q4_K_M | Fast, fully GPU |
| qwen3.5:9b-q8_0 | 9B | Q8_0 | Higher quality, fully GPU |
| qwen3.5:27b | 27B | Q4_K_M | Excellent quality, slow on Ollama |
| qwen3-vl:30b-a3b | 30B | Q4_K_M | Vision-capable |
| qwen3-coder:30b | 30B | Q4_K_M | Code-focused |
| qwen3.5:35b | 35B | Q4_K_M | Good coding capabilities |
To download any model:
ollama pull gpt-oss:20b
ollama pull qwen3:14b
Understanding CPU Offloading
When a model’s memory requirements exceed available VRAM, Ollama automatically distributes model layers between GPU and system RAM. The output shows this as a percentage split like “18%/82% CPU/GPU”.
This has massive performance implications. Each token generation requires data transfer between CPU and GPU memory—a bottleneck that compounds with every layer offloaded to CPU.
The pattern is clear from our results:
- 100% GPU models: 61-140 tokens/sec
- 70-82% GPU models: 19-51 tokens/sec
- 22% GPU (mostly CPU): 12.6 tokens/sec
This explains why a 20B parameter model can outperform a 120B model by 11x in practice. If you’re planning to serve multiple concurrent requests, understanding how Ollama handles parallel requests becomes essential for capacity planning.
Detailed Benchmark Results
Models Running 100% on GPU
GPT-OSS 20B — The Speed Champion
ollama run gpt-oss:20b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
gpt-oss:20b 14 GB 100% GPU 19000
eval count: 2856 token(s)
eval duration: 20.410517947s
eval rate: 139.93 tokens/s
At 139.93 tokens/sec, GPT-OSS 20B is the clear winner for speed-critical applications. It uses only 14GB of VRAM, leaving headroom for larger context windows or other GPU workloads.
Qwen3 14B — Excellent Balance
ollama run qwen3:14b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
qwen3:14b 12 GB 100% GPU 19000
eval count: 3094 token(s)
eval duration: 50.020594575s
eval rate: 61.85 tokens/s
Qwen3 14B offers the best instruction-following in my experience, with a comfortable 12GB memory footprint. At 61.85 tokens/sec, it’s responsive enough for interactive use.
For developers integrating Qwen3 into applications, see LLM Structured Output with Ollama and Qwen3 for extracting structured JSON responses.
Ministral 3 14B — Fast and Compact
ollama run ministral-3:14b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
ministral-3:14b 13 GB 100% GPU 19000
eval count: 1481 token(s)
eval duration: 21.11734277s
eval rate: 70.13 tokens/s
Mistral’s smaller model delivers 70.13 tokens/sec while fitting entirely in VRAM. A solid choice when you need Mistral-family quality at maximum speed.
qwen3.5:9b - quick and new
ollama run qwen3.5:9b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia
NAME ID SIZE PROCESSOR CONTEXT
qwen3.5:9b 6488c96fa5fa 9.3 GB 100% GPU 19000
eval count: 3802 token(s)
eval duration: 41.830174597s
eval rate: 90.89 tokens/s
qwen3.5:9b-q8_0 - q8 quant
This quant pushes performance qwen3.5:9b performance down by 30% comparing to q4.
ollama run qwen3.5:9b-q8_0 --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia
NAME ID SIZE PROCESSOR CONTEXT
qwen3.5:9b-q8_0 441ec31e4d2a 13 GB 100% GPU 19000
eval count: 3526 token(s)
eval duration: 57.595540159s
eval rate: 61.22 tokens/s
Models Requiring CPU Offloading
qwen3-coder:30b - fastest from 30b LLM set because of being text-only
ollama run qwen3-coder:30b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia
NAME ID SIZE PROCESSOR CONTEXT
qwen3-coder:30b 06c1097efce0 20 GB 25%/75% CPU/GPU 19000
22%/605%
eval count: 559 token(s)
eval duration: 9.77768875s
eval rate: 57.17 tokens/s
Qwen3-VL 30B — Best Partially-Offloaded Performance
ollama run qwen3-vl:30b-a3b-instruct --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
qwen3-vl:30b-a3b-instruct 22 GB 30%/70% CPU/GPU 19000
eval count: 1450 token(s)
eval duration: 28.439319709s
eval rate: 50.99 tokens/s
Despite 30% of layers on CPU, Qwen3-VL maintains 50.99 tokens/sec—faster than some 100% GPU models. The vision capability adds versatility for multimodal tasks.
Mistral Small 3.2 24B — Quality vs Speed Trade-off
ollama run mistral-small3.2:24b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
mistral-small3.2:24b 19 GB 18%/82% CPU/GPU 19000
eval count: 831 token(s)
eval duration: 44.899859038s
eval rate: 18.51 tokens/s
Mistral Small 3.2 offers superior language quality but pays a steep speed penalty. At 18.51 tokens/sec, it feels noticeably slower for interactive chat. Worth it for tasks where quality matters more than latency.
GLM 4.7 Flash — MoE Thinking Model
ollama run glm-4.7-flash --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
glm-4.7-flash 21 GB 27%/73% CPU/GPU 19000
eval count: 2446 token(s)
eval duration: 1m12.239164004s
eval rate: 33.86 tokens/s
GLM 4.7 Flash is a 30B-A3B Mixture of Experts model—30B total parameters with only 3B active per token. As a “thinking” model, it generates internal reasoning before responses. The 33.86 tokens/sec includes both thinking and output tokens. Despite CPU offloading, the MoE architecture keeps it reasonably fast.
qwen3.5:35b - New model with decent self-hosted performance
ollama run qwen3.5:35b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia
NAME ID SIZE PROCESSOR CONTEXT
qwen3.5:35b 4af949f8bdf0 27 GB 43%/57% CPU/GPU 19000
eval count: 3418 token(s)
eval duration: 2m45.458926548s
eval rate: 20.66 tokens/s
GPT-OSS 120B — The Heavy Hitter
ollama run gpt-oss:120b --verbose
/set parameter num_ctx 19000
NAME SIZE PROCESSOR CONTEXT
gpt-oss:120b 66 GB 78%/22% CPU/GPU 19000
eval count: 5008 token(s)
eval duration: 6m36.168233066s
eval rate: 12.64 tokens/s
Running a 120B model on 16GB VRAM is technically possible but painful. With 78% on CPU, the 12.64 tokens/sec makes interactive use frustrating. Better suited for batch processing where latency doesn’t matter.
qwen3.5:27b - Smart but slow on Ollama
ollama run qwen3.5:27b --verbose
/set parameter num_ctx 19000
compare weather and climate between capital cities of australia
NAME ID SIZE PROCESSOR CONTEXT
qwen3.5:27b 193ec05b1e80 24 GB 43%/57% CPU/GPU 19000
eval count: 3370 token(s)
eval duration: 8m40.087510281s
eval rate: 6.48 tokens/s
I have tested qwen3.5:27b and got extremely good opinion on this model performance with OpenCode. It is very capable, knowlegeable, really good tool calling, hough it is slow on my machine on Ollama. I have tried other LLM self-hosting platforms, and got much higher speeds. I believe it’s time to let Ollama go. Will write about it a bit later.
Practical Recommendations
For Interactive Chat
Use models that fit 100% in VRAM:
- GPT-OSS 20B — Maximum speed (139.93 t/s)
- Ministral 3 14B — Good speed with Mistral quality (70.13 t/s)
- Qwen3 14B — Best instruction-following (61.85 t/s)
For a better chat experience, consider Open-Source Chat UIs for local Ollama.
For Batch Processing
This is again, on my equipment - 14GB VRAM.
When speed is less critical:
- Mistral Small 3.2 24B — Superior language quality
- Qwen3-VL 30B — Vision + text capability
When speed is not critical at all:
- Qwen3.5:35b - Good coding capabilities
- Qwen3.5:27b - Extremely good, but slow on Ollama. I have had quite a success hosting this model on llama.cpp though.
For Development and Coding
If you’re building applications with Ollama:
Alternative Hosting Options
If Ollama’s limitations concern you (see Ollama enshittification concerns), explore other options in the Local LLM Hosting Guide or compare Docker Model Runner vs Ollama.
Conclusion
With 16GB VRAM, you can run capable LLMs at impressive speeds—if you choose wisely. The key findings:
-
Stay within VRAM limits for interactive use. A 20B model at 140 tokens/sec beats a 120B model at 12 tokens/sec for most practical purposes.
-
GPT-OSS 20B wins on pure speed, but Qwen3 14B offers the best balance of speed and capability for instruction-following tasks.
-
CPU offloading works but expect 3-10x slowdowns. Acceptable for batch processing, frustrating for chat.
-
Context size matters. The 19K context used here increases VRAM usage significantly. Reduce context for better GPU utilization.
For AI-powered search combining local LLMs with web results, see self-hosting Perplexica with Ollama.
To explore more benchmarks, VRAM and throughput trade-offs, and performance tuning across Ollama and other runtimes, check our LLM Performance: Benchmarks, Bottlenecks & Optimization hub.
Useful Links
Internal Resources
- Ollama cheatsheet: Most useful Ollama commands
- How Ollama Handles Parallel Requests
- How Ollama is using Intel CPU Performance and Efficient Cores
- Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More