Comparison: Qwen3:30b vs GPT-OSS:20b

Comparing Speed, parameters and performance of these two models

Page content

Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed:

7 llamas

Architecture and Parameters

Feature Qwen3:30b-instruct GPT-OSS:20b
Total Parameters 30.5 billion 21 billion
Activated Parameters ~3.3 billion ~3.6 billion
Number of Layers 48 24
MoE Experts per Layer 128 (8 active per token) 32 (4 active per token)
Attention Mechanism Grouped Query Attention (32Q /4KV) Grouped Multi-Query Attention (64Q /8KV)
Context Window 32,768 native; Up to 262,144 extended 128,000 tokens
Tokenizer BPE-based, 151,936 vocabulary GPT-based, ≈ 200k vocabulary

Instruction Following

  • Qwen3:30b-instruct is optimized for instruction following with strong human preference alignment. It excels in creative writing, role-playing, multi-turn dialogues, and multilingual instruction following. This variant is fine-tuned specifically to provide more natural, controlled, and engaging responses aligned with user instructions.
  • GPT-OSS:20b supports instruction following but is generally rated slightly behind Qwen3:30b-instruct in nuanced instruction tuning. It provides comparable function calling, structured output, and reasoning modes but may lag in conversational alignment and creative dialogue.

Performance and Efficiency

  • Qwen3:30b-instruct excels in mathematical reasoning, coding, complex logical tasks, and multilingual scenarios covering 119 languages and dialects. Its “thinking” mode allows enhanced reasoning but comes at higher memory costs.
  • GPT-OSS:20b achieves performance comparable to OpenAI’s o3-mini model. It uses fewer layers but wider experts per layer and native MXFP4 quantization for efficient inference on consumer hardware with lower memory requirements (~16GB vs higher for Qwen3).
  • GPT-OSS is approximately 33% more memory efficient and faster on certain hardware setups, especially on consumer GPUs, but Qwen3 often provides better alignment and reasoning depth, especially on complex use cases.
  • Qwen3 has a longer available extended context length option (up to 262,144 tokens) compared to GPT-OSS 128,000 tokens, benefiting tasks requiring very long context comprehension.

Usage Recommendation

  • Choose Qwen3:30b-instruct for use cases demanding superior instruction following, creative generation, multilingual support, and complex reasoning.
  • Choose GPT-OSS:20b if memory efficiency, inference speed on consumer hardware, and competitive baseline performance with fewer parameters is the priority.

This comparison highlights Qwen3:30b-instruct as a deeper, more capable model with advanced instruction tuning, while GPT-OSS:20b offers a more compact, efficient alternative with competitive performance on standard benchmarks.

Benchmark scores specifically comparing Qwen3:30b-instruct and GPT-OSS:20b for instruction following and key performance parameters (MMLU, LMEval, HumanEval) are not directly available in the search results. However, based on existing published multilingual and multitask benchmark reports:

MMLU (Massive Multitask Language Understanding)

Hard to find the details, just:

  • Qwen3 series models, especially at 30B scale and above, demonstrate strong MMLU scores typically exceeding 89%, indicating very competitive knowledge comprehension and reasoning capabilities across 57 diverse domains.
  • GPT-OSS:20b also performs well on MMLU benchmarks but usually scores lower than larger Qwen models due to smaller parameter count and less instruction fine-tuning emphasis.

LMEval (Language Model Evaluation Toolkit)

No much details ATM:

  • Qwen3 models show significant improvement in reasoning and code-related tasks within LMEval, with enhanced scores on logic, math reasoning, and general capabilities.
  • GPT-OSS:20b provides robust baseline performance on LMEval but generally lags behind Qwen3:30b-instruct on advanced reasoning and instruction following subtasks.

HumanEval (Code Generation Benchmark)

No Much data, just:

  • Qwen3:30b-instruct exhibits strong performance on multilingual code generation benchmarks like HumanEval-XL, supporting over 20 programming languages and providing superior cross-lingual code generation accuracy.
  • GPT-OSS:20b, while competitive, performs somewhat lower than Qwen3:30b-instruct in HumanEval benchmarks, especially in multilingual and multi-language programming contexts due to less extensive multilingual training.
Benchmark Qwen3:30b-instruct GPT-OSS:20b Notes
MMLU Accuracy ~89-91% ~80-85% Qwen3 stronger in broad knowledge and reasoning
LMEval Scores High, advanced reasoning & code Moderate, baseline reasoning Qwen3 excels in math and logic
HumanEval High multilingual code gen performance Moderate Qwen3 better in cross-lingual code generation

If exact benchmark numbers are needed, specialized multilingual large-scale benchmarks like P-MMEval and HumanEval-XL referenced in recent research papers provide detailed scores for models including Qwen3 and comparable GPT-OSS variants, but these are not publicly streamlined for direct side-by-side score retrieval at this time.

Qwen3:30b and GPT-OSS:20b Speed Comparison

On my hardware (16GB VRAM) I am getting Qwen3:30b and GPT-OSS:20b running with 4000 context window, and they are producing:

  • qwen3:30b-a3b => 45.68 tokens/s
  • gpt-oss:20b => 129.52 tokens/s

And for comparison I’ve also tested the qwen3:14b and gpt-oss:120b

  • qwen3:14b => 60.12 tokens/s
  • gpt-oss:120b => 12.87 tokens/s

On longer context windows the speed will be slower, in case of qwen3:30b-a3b probably much slower. That’s again, on my PC. Technical details taken from verbose output and allocated memory is below, commands to try:

  • ollama run qwen3:30b-a3b –verbose describe weather difference between state capitals in australia
  • ollama ps showing memory allocation on 4K context

qwen3:30b-a3b

NAME             ID              SIZE     PROCESSOR          CONTEXT    UNTIL
qwen3:30b-a3b    19e422b02313    20 GB    23%/77% CPU/GPU    4096       4 minutes from now
total duration:       28.151133548s
load duration:        1.980696196s
prompt eval count:    16 token(s)
prompt eval duration: 162.58803ms
prompt eval rate:     98.41 tokens/s
eval count:           1188 token(s)
eval duration:        26.007424856s
eval rate:            45.68 tokens/s

qwen3:30b-thinking

NAME         ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
qwen3:30b-thinking    ad815644918f    20 GB    23%/77% CPU/GPU    4096       4 minutes from now
total duration:       1m8.317354579s
load duration:        1.984986882s
prompt eval count:    18 token(s)
prompt eval duration: 219.657034ms
prompt eval rate:     81.95 tokens/s
eval count:           2722 token(s)
eval duration:        1m6.11230524s
eval rate:            41.17 tokens/s

gpt-oss:20b

NAME         ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gpt-oss:20b    aa4295ac10c3    14 GB    100% GPU     4096       4 minutes from now
total duration:       31.505397616s
load duration:        13.744361948s
prompt eval count:    75 token(s)
prompt eval duration: 249.363069ms
prompt eval rate:     300.77 tokens/s
eval count:           2268 token(s)
eval duration:        17.510262884s
eval rate:            129.52 tokens/s

qwen3:14b

NAME         ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
qwen3:14b    bdbd181c33f2    10 GB    100% GPU     4096       4 minutes from now    
total duration:       36.902729562s
load duration:        38.669074ms
prompt eval count:    18 token(s)
prompt eval duration: 35.321423ms
prompt eval rate:     509.61 tokens/s
eval count:           2214 token(s)
eval duration:        36.828268069s
eval rate:            60.12 tokens/s

gpt-oss:120b

NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:120b    f7f8e2f8f4e0    65 GB    78%/22% CPU/GPU    4096       2 minutes from now
49GB RAM + 14.4GB VRAM
total duration:       3m59.967272019s
load duration:        76.758783ms
prompt eval count:    75 token(s)
prompt eval duration: 297.312854ms
prompt eval rate:     252.26 tokens/s
eval count:           3084 token(s)
eval duration:        3m59.592764501s
eval rate:            12.87 tokens/s

Qwen3:30b variants

There are three variants of qwen3:30b model available: qwen3:30b, qwen3:30b-instruct and qwen3:30b-thinking.

Key Differences & Recommendations

  • qwen3:30b-instruct is best for conversations where user instructions, clarity, and natural dialogue are prioritized.
  • qwen3:30b is the general foundation, suitable if both instruction following and tool usage are important across diverse tasks.
  • qwen3:30b-thinking excels when deep reasoning, mathematics, and coding are the main focus. It outperforms the others in tasks that measure logical/mathematical rigor but isn’t necessarily better for creative writing or casual conversations.

Direct Benchmark Comparison

Model Reasoning (AIME25) Coding (LiveCodeBench) General Knowledge (MMLU Redux) Speed & Context Ideal Use Case
qwen3:30b 70.9 57.4 89.5 256K tokens; Fast General language/agents/multilingual
qwen3:30b-instruct N/A (Slated close to 30b) N/A ~Same as 30b 256K tokens Instruction following, alignment
qwen3:30b-thinking 85.0 66.0 91.4 256K tokens Math, code, reasoning, long docs