Does Oh My Opencode actually improve coding results compared to vanilla OpenCode?

It depends on the task. A community benchmark across 120 plus agent and model combinations found OMO Ultrawork scored 69.2 percent versus 73.1 percent for plain OpenCode on a standard coding eval, while taking 10 minutes longer and making 3.5 times more requests. The overhead pays off on large multi-file tasks where parallelism matters, but adds cost and complexity without benefit for single-file edits or tasks that fit in one context window.

What is the Ultrawork keyword and when should I use it?

Ultrawork or ulw is the trigger that activates full multi-agent orchestration in Oh My Opencode. Without it, Sisyphus often skips delegation and just reads files directly. Use it for large refactors, full-stack features, and complex multi-step tasks. Skip it for anything a single focused prompt would handle in under 30 minutes.

How much does Oh My Opencode add to token costs at startup?

A lot more than you might expect. OMO injects many tools and MCPs into the context window on startup, so even a simple Hello World message consumes 15000 to 25000 tokens before any work happens. Factor this into pricing estimates, especially if you are on a per-token plan.

Is there a risk of runaway billing with Oh My Opencode?

Yes, and it has happened to real users. A confirmed March 2026 bug let a Gemini subagent run 809 tool-call turns over 3.5 hours with no circuit breaker, producing a 438 dollar bill where only 276 was displayed in OpenCode. The root causes were silent category-based model routing, no per-subagent step limit, and a missing cost tier in the Gemini pricing snapshot. A fix is in progress. Until it ships, cap your google provider concurrency to 1.

Does OMO work well without a Claude subscription?

Partially. Sisyphus and the planning agents are tuned for Claude-family models and their performance degrades without one. However the Kimi K2.5 paid tier at under a dollar per month and OpenCode Go at ten dollars per month provide Claude-class orchestration quality at low cost and cover most of the fallback chain. Hephaestus requires GPT-5.3 Codex regardless.

What is a lighter alternative to full Oh My Opencode?

Two community options exist. Oh My Opencode Slim trims the feature set while keeping the key hooks and background execution. Oh My Pi focuses on background task execution and hooks with a minimal prompt surface. Both score better than full OMO on simpler tasks that do not need the full agent team.

Oh My Opencode Review: Honest Results, Billing Risks, and When It's Worth It

What actually happens when you run Ultrawork.

Page content

Oh My Opencode promises a “virtual AI dev team” — Sisyphus orchestrating specialists, tasks running in parallel, and the magic ultrawork keyword activating all of it.

That promise holds up for the right workload. For the wrong one, it adds cost and complexity without improving results. This article covers what actually happened in hands-on testing, and what the broader community has found after months of real usage.

lazy agents working in parallel

If you are new to the stack,

start with the OpenCode quickstart to get the base agent running,
then the Oh My Opencode quickstart
and then the specialised agents deep dive

This article assumes you are already familiar with the system and want to know whether it actually performs. For the wider picture of how OMO fits into the AI coding toolchain, see the AI developer tools overview.

Oh My Opencode Performance: Hands-on Testing Results

I ran the same tests I use for bare OpenCode — the same LLM benchmark tasks I ran against OpenCode with local Ollama and llama.cpp models. This time on the Big Pickle model (GLM 4.6 via OpenCode Zen — the free tier).

The short version: mixed results, and the failure mode was instructive.

Why Sisyphus Skips Delegation Without an Explicit Ultrawork Prompt

The first thing I hit was that Sisyphus did not behave like an orchestrator without explicit prompting. Even with the ulw prefix, he started doing all the work himself — reading files directly, writing code, skipping the research and planning phases entirely. No delegation to Oracle, no parallel Explore runs, no Prometheus interview.

To actually trigger orchestration I had to be explicit in the prompt:

ulw research the codebase first,
create a detailed plan,
delegate implementation tasks,
and only do direct work yourself when delegation is unreasonable

Once I did that, the system behaved as advertised. But the default behaviour without that framing was vanilla OpenCode with a heavier context window. The ultrawork keyword alone is not a guarantee of delegation — it is a prerequisite for it.

Test 1: IndexNow CLI Tool — Where Oh My Opencode Orchestration Pays Off

The task was to build a command-line tool that submits URLs to the IndexNow API for search engine indexing. This is a reasonably scoped multi-step task: read the API spec, design the CLI, implement, test.

With Oh My Opencode and explicit orchestration prompting, the result was noticeably better than bare OpenCode. The final tool correctly extracted URLs from sitemap.xml and sent them in batches, in addition to supporting standard CLI options. The Librarian agent pulling the IndexNow documentation contributed context that the bare-model run missed.

oh my opencode indexnow session log

This is the kind of task OMO is built for: research + implementation + some design decisions. The overhead paid off here.

Test 2: Page Migration Mapping — Same Result, Triple the Tokens

The second test was a document analysis task: sort out where website pages should be migrated based on their slugs and a migration policy document.

The result was identical to the bare OpenCode run — same accuracy, same structure, same decisions. But the OMO run consumed approximately three times the tokens.

This is a single-context-window reasoning task. There is no parallelism to exploit, no specialist knowledge to pull in from a subagent. The orchestration layer added overhead without adding value. For this class of task, vanilla OpenCode is the better choice.

Model Selection: Why Weaker Models Struggle with Orchestration Overhead

Running on Big Pickle (a free-tier model) exposed something the community has also noticed: weaker models are more sensitive to harness quality than strong models. Claude Opus is resilient enough to produce good output even with a heavy orchestration layer around it. A smaller model like Big Pickle benefits more from a clean, focused prompt than from a complex multi-agent setup that adds noise to the context.

If you are running OMO on budget models, start simpler. Use orchestration for research-heavy tasks where Librarian and Explore genuinely add information. Avoid it for pure reasoning tasks where the model just needs clear input and space to think.

Oh My Opencode Community Findings: Benchmarks, Billing, and Real-World Caveats

Before going all-in, it is worth knowing what the community has found through real usage — both the wins and the failure modes.

Oh My Opencode Benchmark: 69% vs 73% Pass Rate, 3× More Requests Than Vanilla OpenCode

One community member ran a systematic benchmark across 120+ agent/model combinations and published the results. With OMO Ultrawork enabled, the pass rate on their coding eval was 69.2% — against 73.1% for plain OpenCode without OMO. The OMO run took 10 minutes longer (55 vs 45 minutes) and made 96 requests instead of 27.

Their conclusion: “it’s literally just Opus with more steps.” Opus is particularly resilient to harness differences — it delivers strong results regardless of what is around it. Weaker models showed more sensitivity to the harness, but not necessarily in OMO’s favour.

This does not mean OMO is useless. For large multi-file tasks, parallel background agents, and complex engineering workflows, the overhead pays off. But for coding tasks that fit in a single context window, vanilla OpenCode with a good model will often match or beat a full orchestration stack.

Startup Token Overhead: 15,000–25,000 Tokens Before Any Work Begins

A recurring community complaint is that OMO injects many tools and MCPs at startup. A simple “Hello world” message can consume 15,000–25,000 tokens just from the context window setup before any actual work happens. The maintainer is aware of this and working on deferred tool loading, but as of early 2026 it is a real cost to factor into pricing estimates.

The $350 Gemini infinite loop — and what to do about it

In March 2026, a confirmed bug (issue #2571, labelled will-fix) caused a user to be billed ~$438 in a single afternoon. Three separate problems compounded:

No circuit breaker: OMO had no max-step limit per subagent. A Gemini model got stuck in a git diff / read verification loop and ran 809 consecutive turns over 3.5 hours without stopping.
Silent model routing: The user’s parent session was on GPT-5.4. But a delegated Compose UI task was routed to Gemini 3.1 Pro via category routing (visual-engineering) — a model the user never intentionally selected and had no visibility into until digging through the SQLite database after the fact.
Incorrect cost display: OpenCode’s pricing snapshot for gemini-3.1-pro-preview was missing the >200K context pricing tier. Google charges 2× for contexts over 200K tokens, but OpenCode calculated everything at the base rate. The displayed cost was less than half the actual Google bill.

A community comment summed it up: “Gemini constantly spins into loops for me, hence I rarely use it as a non-read-only model.”

A fix (PR #2590) is in progress — adding a configurable max step limit and loop detection. Until it ships, protect yourself:

Audit your category config. If any category maps to Gemini (including visual-engineering by default), every background task in that category uses Gemini — silently — even when your foreground session is on a different model.
Set explicit providerConcurrency limits for Gemini in background_task. Keeping it at 1 limits the blast radius.
Check your actual provider billing dashboards, not just OpenCode’s displayed cost, especially for Gemini.

{
  "background_task": {
    "providerConcurrency": {
      "google": 1   // hard cap until the circuit breaker ships
    }
  }
}

Ultrawork Delegation Requires the Keyword — It Is Not Automatic

Early users found that without ultrawork, the main agent often does not delegate to specialist subagents at all — it just starts calling read and grep directly. The maintainer acknowledged this early on: “it seems difficult to achieve calling the appropriate agent solely through system prompt instructions without explicit prompts.”

The ultrawork keyword is what reliably triggers orchestration. Without it, you are often running vanilla OpenCode with a heavier context window. This matches what I found in my own testing.

Lighter Alternatives to OMO: OMO Slim and Oh-My-Pi

If you want the background execution hooks and key OMO improvements without the full agent orchestration overhead, oh-my-opencode-slim is a community fork that trims the feature set. Some users have also moved to oh-my-pi, which focuses on background task execution and hooks while keeping the prompt surface minimal.

One user put it well: “I like OMO for its hooks and background task execution. I think OMO slim is trying to optimise the wrong things. The base OpenCode prompts plus background workers and hooks that auto-prompt the model to continue when it decides to stop would be enough for me.”

The right choice depends on your task profile. Large, multi-step engineering work justifies full OMO. For daily shorter tasks, a leaner harness often produces better results with less cost.

When Oh My Opencode Genuinely Outperforms: Real User Results

To balance the caveats, here is what users report as OMO’s clearest wins:

8,000 ESLint warnings cleared in a day — parallel Explore agents scanning the codebase while worker agents execute fixes simultaneously
45k-line Tauri app converted to a SaaS web app overnight — Prometheus interview mode produced a detailed plan, Ralph Loop executed it start to finish
Full-stack features implemented end-to-end without the user touching the keyboard beyond the initial prompt
Architecture reviews on inherited codebases — Oracle’s read-only advisory role surfaces problems without accidentally making changes

The common thread: tasks that benefit from parallelism and have clear acceptance criteria that Prometheus can verify. Tasks where a single focused model would do fine get little benefit from the orchestration overhead.

Oh My Opencode vs Vanilla OpenCode: When to Use Each

Oh My Opencode is not universally better than vanilla OpenCode. It is a power multiplier for a specific class of work — and overhead for everything else.

Use the full OMO stack (with ultrawork) when:

The task spans multiple files and layers
You need research, planning, implementation, and verification as distinct phases
You benefit from parallel background agents (Explore, Librarian) gathering context while workers run
The scope is large enough that you would otherwise babysit the agent through multiple sequential prompts

Use vanilla OpenCode (or a leaner harness) when:

The task fits in a single context window
The problem is pure reasoning without external research
You are running budget models — they benefit more from a clean focused prompt than from orchestration complexity
You want predictable billing without category-routing surprises

The billing risk is real and underreported. Until the circuit breaker lands, treat OMO with Gemini as requiring active monitoring — not as a “fire and forget” system. For everything else, the system is genuinely impressive when pointed at the right problem.

Sources

Community discussions and issues referenced in this article:

r/opencodeCLI — Oh my opencode vs GSD vs others vs Claude CLI vs Kilo — user comparison of orchestration tools and OMO’s real-world feel vs alternatives
r/opencodeCLI — I tested Opencode on 9 MCP tools, Firecrawl Skills + CLI and Oh My Opencode — systematic benchmark of 120+ agent/model combinations, including the 69.2% vs 73.1% pass rate data
GitHub issue #2571 — Subagent burned $350 in 3.5 hours with ZERO safeguard — detailed incident report: silent model routing, infinite Gemini loop, cost display showing half the real bill
GitHub issue #37 — Share your opinion — maintainer-opened feedback thread, early delegation problems, community use cases
GitHub issue #74 — Memory system opinions — token overhead discussion, AGENTS.md workflow patterns, community memory system proposals
oh-my-opencode-slim — community fork with trimmed feature set for lighter use cases
oh-my-pi — alternative harness focused on background execution and hooks with minimal prompt surface