Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4

Agentic LLM tuning reference

Page content

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

It sits alongside the broader LLM performance engineering hub and matches best with a clear LLM hosting and serving story—throughput and scheduling still dominate when the model is starved, but unstable sampling burns retries and output tokens before the GPU does.

This page consolidates:

  • vendor recommended parameters
  • embedded defaults from GGUF and APIs
  • real-world community findings
  • agentic workflow optimizations

Right now it is focused on:

  • Qwen 3.6 (dense and MoE)
  • Gemma 4 (dense and MoE)

If you run terminal agents such as OpenCode, pair this reference with local LLM behavior in OpenCode so workload-level results and sampler defaults stay aligned.

The goal is simple:

Provide a single place to configure models for agent loops, coding, and multi-step reasoning.


TLDR Reference Table - All models (agentic defaults)

Model Mode temp top_p top_k presence_penalty
Qwen 3.5 27B thinking general 1.0 0.95 20 0.0
Qwen 3.5 27B coding 0.6 0.95 20 0.0
Qwen 3.5 35B MoE thinking 1.0 0.95 20 1.5
Qwen 3.5 35B MoE coding 0.6 0.95 20 0.0
Gemma 4 31B general 1.0 0.95 64 0.0
Gemma 4 31B coding 1.2 0.95 65 0.0
Gemma 4 26B MoE general 1.0 0.95 64 0.0
Gemma 4 26B MoE coding 1.2 0.95 65 0.0

What “Agentic Inference” Actually Means

Most parameter guides assume:

  • chat
  • single-shot completion
  • human interaction

Agentic systems are different.

They require:

  • multi-step reasoning
  • tool calling
  • consistent outputs
  • low error propagation

This changes tuning priorities.

Core shift

Use case Priority
Chat natural language quality
Creative diversity
Agentic consistency + reasoning stability

Qwen 3.6 Tuning

Dense vs MoE matters

Qwen is one of the few families where:

MoE requires different penalties

Dense (27B)

  • stable
  • predictable
  • no routing complexity

Recommended:

  • presence_penalty = 0.0

MoE (35B-A3B)

  • expert routing per token
  • risk of repetition loops

Recommended:

  • presence_penalty = 1.5 (general)
  • 0.0 for coding

Why this matters

MoE models can get stuck reusing the same experts.

Presence penalty helps:

  • diversify token paths
  • improve reasoning exploration

Qwen Agentic Coding Setup

This is where most people get it wrong.

Correct setup

  • temperature = 0.6
  • top_p = 0.95
  • top_k = 20
  • presence_penalty = 0.0

Why low temperature works

Coding agents need:

  • deterministic outputs
  • repeatable tool calls
  • stable formatting

Higher temperature:

  • breaks JSON
  • introduces hallucinated APIs
  • increases retries

Gemma 4 Tuning

Gemma behaves differently.

No official defaults

  • model cards are empty
  • configs are implicit
  • real tuning comes from:
    • Google AI Studio
    • GGUF defaults
    • community benchmarks

The Counter-Intuitive Finding

Gemma 4 performs better with higher temperature.

Observed behavior

Temp Result
0.5 poor reasoning
1.0 stable baseline
1.2 to 1.5 best coding performance

This contradicts standard advice.


Why high temperature works here

Hypothesis:

  • training distribution favors exploration
  • reasoning mode depends on diversity
  • model compensates for lack of explicit chain-of-thought control

Result:

higher temperature improves solution search space


Gemma Agentic Coding Setup

Recommended:

  • temperature = 1.2
  • top_p = 0.95
  • top_k = 65
  • penalties = 0.0

Important

Do not apply traditional “low temp for code” rule blindly.

Gemma is an exception.


Thinking Mode and Agent Systems

Both Qwen and Gemma support reasoning modes.

Why it matters

Agent loops require:

  • intermediate reasoning
  • error recovery
  • multi-step planning

Practical rule

Always enable thinking mode for:

  • coding agents
  • tool use
  • multi-step tasks

Parameter Strategy by Use Case

Coding agents

  • prioritize determinism
  • minimize penalties
  • stable sampling

Reasoning agents

  • moderate temperature
  • allow exploration
  • preserve structure

Tool calling

  • strict formatting
  • low randomness
  • consistent token patterns

Schema and JSON tooling are orthogonal to logits; combine these sampling rules with structured output patterns for Ollama and Qwen3 so validators see fewer retries.


Vendor Defaults vs Reality

Vendor defaults are:

  • safe
  • generic
  • not optimized

Community findings often show:

  • better performance
  • task-specific tuning
  • architecture-aware adjustments

Example

Gemma:

  • official: no guidance
  • community: high temperature improves coding

Qwen:

  • official: inconsistent sections
  • community: standardized values converge

Practical Deployment Notes

Under concurrency, queueing and memory splits interact with retries as much as sampling does—read how Ollama handles parallel requests alongside the presets above.

Ollama

  • works well for both families
  • verify GPU compatibility
  • defaults may differ from reference

vLLM

  • supports advanced sampling
  • stable for production
  • use explicit parameters

llama.cpp

  • requires sampler ordering
  • always enable jinja for modern models
  • incorrect sampler chain reduces output quality

Key Takeaways

  • there is no universal parameter set
  • architecture matters more than model size
  • agentic systems require different tuning than chat
  • community benchmarks are often ahead of vendors

Final Opinion

Most parameter guides are outdated.

They assume:

  • chat use
  • low temperature for code
  • static configurations

Modern models break those assumptions.

If you are building agentic systems:

treat inference tuning as a first-class system design problem

Not a config file.


Future Direction

This reference will evolve into:

  • per-model deep dives
  • agent-specific configs
  • benchmarking-backed tuning

Because:

inference is where model capability becomes system performance

Subscribe

Get new posts on AI systems, Infrastructure, and AI engineering.