Comparison: Qwen3:30b vs GPT-OSS:20b
Comparing Speed, parameters and performance of these two models
Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed:
Architecture and Parameters
Feature | Qwen3:30b-instruct | GPT-OSS:20b |
---|---|---|
Total Parameters | 30.5 billion | 21 billion |
Activated Parameters | ~3.3 billion | ~3.6 billion |
Number of Layers | 48 | 24 |
MoE Experts per Layer | 128 (8 active per token) | 32 (4 active per token) |
Attention Mechanism | Grouped Query Attention (32Q /4KV) | Grouped Multi-Query Attention (64Q /8KV) |
Context Window | 32,768 native; Up to 262,144 extended | 128,000 tokens |
Tokenizer | BPE-based, 151,936 vocabulary | GPT-based, ≈ 200k vocabulary |
Instruction Following
- Qwen3:30b-instruct is optimized for instruction following with strong human preference alignment. It excels in creative writing, role-playing, multi-turn dialogues, and multilingual instruction following. This variant is fine-tuned specifically to provide more natural, controlled, and engaging responses aligned with user instructions.
- GPT-OSS:20b supports instruction following but is generally rated slightly behind Qwen3:30b-instruct in nuanced instruction tuning. It provides comparable function calling, structured output, and reasoning modes but may lag in conversational alignment and creative dialogue.
Performance and Efficiency
- Qwen3:30b-instruct excels in mathematical reasoning, coding, complex logical tasks, and multilingual scenarios covering 119 languages and dialects. Its “thinking” mode allows enhanced reasoning but comes at higher memory costs.
- GPT-OSS:20b achieves performance comparable to OpenAI’s o3-mini model. It uses fewer layers but wider experts per layer and native MXFP4 quantization for efficient inference on consumer hardware with lower memory requirements (~16GB vs higher for Qwen3).
- GPT-OSS is approximately 33% more memory efficient and faster on certain hardware setups, especially on consumer GPUs, but Qwen3 often provides better alignment and reasoning depth, especially on complex use cases.
- Qwen3 has a longer available extended context length option (up to 262,144 tokens) compared to GPT-OSS 128,000 tokens, benefiting tasks requiring very long context comprehension.
Usage Recommendation
- Choose Qwen3:30b-instruct for use cases demanding superior instruction following, creative generation, multilingual support, and complex reasoning.
- Choose GPT-OSS:20b if memory efficiency, inference speed on consumer hardware, and competitive baseline performance with fewer parameters is the priority.
This comparison highlights Qwen3:30b-instruct as a deeper, more capable model with advanced instruction tuning, while GPT-OSS:20b offers a more compact, efficient alternative with competitive performance on standard benchmarks.
Benchmark scores specifically comparing Qwen3:30b-instruct and GPT-OSS:20b for instruction following and key performance parameters (MMLU, LMEval, HumanEval) are not directly available in the search results. However, based on existing published multilingual and multitask benchmark reports:
MMLU (Massive Multitask Language Understanding)
Hard to find the details, just:
- Qwen3 series models, especially at 30B scale and above, demonstrate strong MMLU scores typically exceeding 89%, indicating very competitive knowledge comprehension and reasoning capabilities across 57 diverse domains.
- GPT-OSS:20b also performs well on MMLU benchmarks but usually scores lower than larger Qwen models due to smaller parameter count and less instruction fine-tuning emphasis.
LMEval (Language Model Evaluation Toolkit)
No much details ATM:
- Qwen3 models show significant improvement in reasoning and code-related tasks within LMEval, with enhanced scores on logic, math reasoning, and general capabilities.
- GPT-OSS:20b provides robust baseline performance on LMEval but generally lags behind Qwen3:30b-instruct on advanced reasoning and instruction following subtasks.
HumanEval (Code Generation Benchmark)
No Much data, just:
- Qwen3:30b-instruct exhibits strong performance on multilingual code generation benchmarks like HumanEval-XL, supporting over 20 programming languages and providing superior cross-lingual code generation accuracy.
- GPT-OSS:20b, while competitive, performs somewhat lower than Qwen3:30b-instruct in HumanEval benchmarks, especially in multilingual and multi-language programming contexts due to less extensive multilingual training.
Summary Table (approximate trends from the literature):
Benchmark | Qwen3:30b-instruct | GPT-OSS:20b | Notes |
---|---|---|---|
MMLU Accuracy | ~89-91% | ~80-85% | Qwen3 stronger in broad knowledge and reasoning |
LMEval Scores | High, advanced reasoning & code | Moderate, baseline reasoning | Qwen3 excels in math and logic |
HumanEval | High multilingual code gen performance | Moderate | Qwen3 better in cross-lingual code generation |
If exact benchmark numbers are needed, specialized multilingual large-scale benchmarks like P-MMEval and HumanEval-XL referenced in recent research papers provide detailed scores for models including Qwen3 and comparable GPT-OSS variants, but these are not publicly streamlined for direct side-by-side score retrieval at this time.
Qwen3:30b and GPT-OSS:20b Speed Comparison
On my hardware (16GB VRAM) I am getting Qwen3:30b and GPT-OSS:20b running with 4000 context window, and they are producing:
- qwen3:30b-a3b => 45.68 tokens/s
- gpt-oss:20b => 129.52 tokens/s
And for comparison I’ve also tested the qwen3:14b and gpt-oss:120b
- qwen3:14b => 60.12 tokens/s
- gpt-oss:120b => 12.87 tokens/s
On longer context windows the speed will be slower, in case of qwen3:30b-a3b probably much slower. That’s again, on my PC. Technical details taken from verbose output and allocated memory is below, commands to try:
- ollama run qwen3:30b-a3b –verbose describe weather difference between state capitals in australia
- ollama ps showing memory allocation on 4K context
qwen3:30b-a3b
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:30b-a3b 19e422b02313 20 GB 23%/77% CPU/GPU 4096 4 minutes from now
total duration: 28.151133548s
load duration: 1.980696196s
prompt eval count: 16 token(s)
prompt eval duration: 162.58803ms
prompt eval rate: 98.41 tokens/s
eval count: 1188 token(s)
eval duration: 26.007424856s
eval rate: 45.68 tokens/s
qwen3:30b-thinking
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:30b-thinking ad815644918f 20 GB 23%/77% CPU/GPU 4096 4 minutes from now
total duration: 1m8.317354579s
load duration: 1.984986882s
prompt eval count: 18 token(s)
prompt eval duration: 219.657034ms
prompt eval rate: 81.95 tokens/s
eval count: 2722 token(s)
eval duration: 1m6.11230524s
eval rate: 41.17 tokens/s
gpt-oss:20b
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 4096 4 minutes from now
total duration: 31.505397616s
load duration: 13.744361948s
prompt eval count: 75 token(s)
prompt eval duration: 249.363069ms
prompt eval rate: 300.77 tokens/s
eval count: 2268 token(s)
eval duration: 17.510262884s
eval rate: 129.52 tokens/s
qwen3:14b
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:14b bdbd181c33f2 10 GB 100% GPU 4096 4 minutes from now
total duration: 36.902729562s
load duration: 38.669074ms
prompt eval count: 18 token(s)
prompt eval duration: 35.321423ms
prompt eval rate: 509.61 tokens/s
eval count: 2214 token(s)
eval duration: 36.828268069s
eval rate: 60.12 tokens/s
gpt-oss:120b
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b f7f8e2f8f4e0 65 GB 78%/22% CPU/GPU 4096 2 minutes from now
49GB RAM + 14.4GB VRAM
total duration: 3m59.967272019s
load duration: 76.758783ms
prompt eval count: 75 token(s)
prompt eval duration: 297.312854ms
prompt eval rate: 252.26 tokens/s
eval count: 3084 token(s)
eval duration: 3m59.592764501s
eval rate: 12.87 tokens/s
Qwen3:30b variants
There are three variants of qwen3:30b model available: qwen3:30b, qwen3:30b-instruct and qwen3:30b-thinking.
Key Differences & Recommendations
- qwen3:30b-instruct is best for conversations where user instructions, clarity, and natural dialogue are prioritized.
- qwen3:30b is the general foundation, suitable if both instruction following and tool usage are important across diverse tasks.
- qwen3:30b-thinking excels when deep reasoning, mathematics, and coding are the main focus. It outperforms the others in tasks that measure logical/mathematical rigor but isn’t necessarily better for creative writing or casual conversations.
Direct Benchmark Comparison
Model | Reasoning (AIME25) | Coding (LiveCodeBench) | General Knowledge (MMLU Redux) | Speed & Context | Ideal Use Case |
---|---|---|---|---|---|
qwen3:30b | 70.9 | 57.4 | 89.5 | 256K tokens; Fast | General language/agents/multilingual |
qwen3:30b-instruct | N/A (Slated close to 30b) | N/A | ~Same as 30b | 256K tokens | Instruction following, alignment |
qwen3:30b-thinking | 85.0 | 66.0 | 91.4 | 256K tokens | Math, code, reasoning, long docs |