What inference parameters matter most for LLM output quality?

Temperature, top_p, and top_k are the most impactful parameters. Temperature controls randomness, top_p limits probability mass, and top_k restricts token selection. Together they define output diversity and stability.

Why do some models perform better with higher temperature?

Some models, especially newer architectures like Gemma 4, benefit from higher temperature because their training favors exploration. This can improve reasoning and coding performance despite traditional expectations.

How should LLMs be configured for coding agents?

Coding agents benefit from lower temperature for deterministic output, stable top_p values, and minimal penalties. Consistency is more important than creativity in tool use and code generation.

What is the difference between dense and MoE models for inference tuning?

Dense models use all parameters per token and tend to be stable with lower penalties. MoE models route tokens across experts and may benefit from presence penalties to reduce repetition and improve diversity.

Are vendor defaults reliable for production systems?

Vendor defaults are a good starting point but often need adjustment. Community benchmarks and real-world testing frequently produce better configurations for specific workloads.

Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4

Agentic LLM tuning reference

Page content

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

It sits alongside the broader LLM performance engineering hub and matches best with a clear LLM hosting and serving story—throughput and scheduling still dominate when the model is starved, but unstable sampling burns retries and output tokens before the GPU does.

This page consolidates:

vendor recommended parameters
embedded defaults from GGUF and APIs
real-world community findings
agentic workflow optimizations

Right now it is focused on:

Qwen 3.6 (dense and MoE)
Gemma 4 (dense and MoE)

If you run terminal agents such as OpenCode, pair this reference with local LLM behavior in OpenCode so workload-level results and sampler defaults stay aligned.

The goal is simple:

Provide a single place to configure models for agent loops, coding, and multi-step reasoning.

TLDR Reference Table - All models (agentic defaults)

Model	Mode	temp	top_p	top_k	presence_penalty
Qwen 3.5 27B	thinking general	1.0	0.95	20	0.0
Qwen 3.5 27B	coding	0.6	0.95	20	0.0
Qwen 3.5 35B MoE	thinking	1.0	0.95	20	1.5
Qwen 3.5 35B MoE	coding	0.6	0.95	20	0.0
Gemma 4 31B	general	1.0	0.95	64	0.0
Gemma 4 31B	coding	1.2	0.95	65	0.0
Gemma 4 26B MoE	general	1.0	0.95	64	0.0
Gemma 4 26B MoE	coding	1.2	0.95	65	0.0

What “Agentic Inference” Actually Means

Most parameter guides assume:

chat
single-shot completion
human interaction

Agentic systems are different.

They require:

multi-step reasoning
tool calling
consistent outputs
low error propagation

This changes tuning priorities.

Core shift

Use case	Priority
Chat	natural language quality
Creative	diversity
Agentic	consistency + reasoning stability

Qwen 3.6 Tuning

Dense vs MoE matters

Qwen is one of the few families where:

MoE requires different penalties

Dense (27B)

stable
predictable
no routing complexity

Recommended:

presence_penalty = 0.0

MoE (35B-A3B)

expert routing per token
risk of repetition loops

Recommended:

presence_penalty = 1.5 (general)
0.0 for coding

Why this matters

MoE models can get stuck reusing the same experts.

Presence penalty helps:

diversify token paths
improve reasoning exploration

Qwen Agentic Coding Setup

This is where most people get it wrong.

Correct setup

temperature = 0.6
top_p = 0.95
top_k = 20
presence_penalty = 0.0

Why low temperature works

Coding agents need:

deterministic outputs
repeatable tool calls
stable formatting

Higher temperature:

breaks JSON
introduces hallucinated APIs
increases retries

Gemma 4 Tuning

Gemma behaves differently.

No official defaults

model cards are empty
configs are implicit
real tuning comes from:
- Google AI Studio
- GGUF defaults
- community benchmarks

The Counter-Intuitive Finding

Gemma 4 performs better with higher temperature.

Observed behavior

Temp	Result
0.5	poor reasoning
1.0	stable baseline
1.2 to 1.5	best coding performance

This contradicts standard advice.

Why high temperature works here

Hypothesis:

training distribution favors exploration
reasoning mode depends on diversity
model compensates for lack of explicit chain-of-thought control

Result:

higher temperature improves solution search space

Gemma Agentic Coding Setup

Recommended:

temperature = 1.2
top_p = 0.95
top_k = 65
penalties = 0.0

Important

Do not apply traditional “low temp for code” rule blindly.

Gemma is an exception.

Thinking Mode and Agent Systems

Both Qwen and Gemma support reasoning modes.

Why it matters

Agent loops require:

intermediate reasoning
error recovery
multi-step planning

Practical rule

Always enable thinking mode for:

coding agents
tool use
multi-step tasks

Parameter Strategy by Use Case

Coding agents

prioritize determinism
minimize penalties
stable sampling

Reasoning agents

moderate temperature
allow exploration
preserve structure

Tool calling

strict formatting
low randomness
consistent token patterns

Schema and JSON tooling are orthogonal to logits; combine these sampling rules with structured output patterns for Ollama and Qwen3 so validators see fewer retries.

Vendor Defaults vs Reality

Vendor defaults are:

safe
generic
not optimized

Community findings often show:

better performance
task-specific tuning
architecture-aware adjustments

Example

Gemma:

official: no guidance
community: high temperature improves coding

Qwen:

official: inconsistent sections
community: standardized values converge

Practical Deployment Notes

Under concurrency, queueing and memory splits interact with retries as much as sampling does—read how Ollama handles parallel requests alongside the presets above.

Ollama

works well for both families
verify GPU compatibility
defaults may differ from reference

vLLM

supports advanced sampling
stable for production
use explicit parameters

llama.cpp

requires sampler ordering
always enable jinja for modern models
incorrect sampler chain reduces output quality

Key Takeaways

there is no universal parameter set
architecture matters more than model size
agentic systems require different tuning than chat
community benchmarks are often ahead of vendors

Final Opinion

Most parameter guides are outdated.

They assume:

chat use
low temperature for code
static configurations

Modern models break those assumptions.

If you are building agentic systems:

treat inference tuning as a first-class system design problem

Not a config file.

Future Direction

This reference will evolve into:

per-model deep dives
agent-specific configs
benchmarking-backed tuning

Because:

inference is where model capability becomes system performance

TLDR Reference Table - All models (agentic defaults)

What “Agentic Inference” Actually Means

Core shift

Qwen 3.6 Tuning

Dense vs MoE matters

Dense (27B)

MoE (35B-A3B)

Why this matters

Qwen Agentic Coding Setup

Correct setup

Why low temperature works

Gemma 4 Tuning

No official defaults

The Counter-Intuitive Finding

Observed behavior

Why high temperature works here

Gemma Agentic Coding Setup

Important

Thinking Mode and Agent Systems

Why it matters

Practical rule

Parameter Strategy by Use Case

Coding agents

Reasoning agents

Tool calling

Vendor Defaults vs Reality

Example

Practical Deployment Notes

Ollama

vLLM

llama.cpp

Key Takeaways

Final Opinion

Future Direction

Subscribe