Benchmark

CLI agent orchestrators: 10-task reproducible eval, May 2026

Name: cli-agent-orchestrators benchmark suite cli-agent-orchestrators-2026-05
Creator: Alex Chernysh
Published: 2026-05-19
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: cli agent orchestrator, multi-agent benchmark, reproducible eval, bernstein, claude squad, conductor, composio agent-orchestrator, opencode

Five CLI agent orchestrators, ten tasks, three trials each, one worker model. Operator-scored against fixed acceptance checks. Bernstein loses 4 of 10 tasks. Bernstein winrate is 60%. The page below renders the per-task scores so the claim is auditable against the source data.

Suite id: cli-agent-orchestrators-2026-05. Published 2026-05-19. Methodology and repro script: /benchmarks/cli-agent-orchestrators/methodology. Honesty gate (Bernstein wins below 80%, loses at least 2 tasks) is enforced in CI via tests/benchmarks-honesty.test.ts.

Winrates

A tool wins a task when its score equals the highest score across all tools that ran the task. Ties count as a win for every tied tool, which is why winrates can sum above 1.0. Total tasks: 10.

Tool	Shape	Wins (out of 10)	Winrate
Bernstein v2.0.0	Python scheduler that spawns CLI agents in parallel git worktrees, runs quality gates between merges, signs every step into an HMAC chain.	6	60%
Claude Squad v0.4.2	Terminal multiplexer wrapper that runs Claude Code or Aider sessions side by side in tmux panes.	2	20%
Conductor v2026.04	Mac desktop app that spawns parallel Claude Code sessions, each in its own git worktree.	1	10%
Composio agent-orchestrator v0.11.0	TypeScript orchestrator with a tool marketplace; runs agents against Composio-hosted tool servers.	0	0%
OpenCode v0.5.x	Terminal coding agent with built-in session UI; not a parallel orchestrator by design.	4	40%

Tools under test

Bernstein v2.0.0 (deterministic multi-agent orchestrator). Python scheduler that spawns CLI agents in parallel git worktrees, runs quality gates between merges, signs every step into an HMAC chain.
Claude Squad v0.4.2 (tmux-based agent supervisor). Terminal multiplexer wrapper that runs Claude Code or Aider sessions side by side in tmux panes.
Conductor v2026.04 (commercial parallel agent runner). Mac desktop app that spawns parallel Claude Code sessions, each in its own git worktree.
Composio agent-orchestrator v0.11.0 (cloud-first agent orchestrator). TypeScript orchestrator with a tool marketplace; runs agents against Composio-hosted tool servers.
OpenCode v0.5.x (TUI coding agent). Terminal coding agent with built-in session UI; not a parallel orchestrator by design.

Scoring rubric

Each task is scored 0 to 3 against a fixed set of acceptance checks. 0 = tool refused or produced unusable output. 1 = partial output, manual rework required. 2 = passed acceptance checks with some loss (one expected behaviour missing or one extra round-trip required). 3 = passed all acceptance checks on first scheduled run. Winrate is the count of tasks where a tool's score equals or beats every other tool divided by total tasks. Ties count as a win for every tied tool; that is why winrates can sum above 1.0.

Tie rule. On a tie at the top score, every tied tool counts as a winner. The page renders a 'tie' marker in the results column.

Worker model. Every orchestrator runs Claude Sonnet 4.5 as the primary worker model where the orchestrator exposes that choice. Where the orchestrator hard-codes a different model (or a different default), the table lists the substituted model and the result is marked with a 'model-mismatch' caveat.

Trials. 3 per task per tool, aggregated by median. median score across 3 trials. Each trial uses a fresh worktree, a cold model cache, and the same seed prompt.

Environment. Ubuntu 24.04, 16 vCPU, 32 GB RAM, GitHub-hosted runner. No external network beyond the model provider HTTPS endpoint and GitHub.

Seed prompt hash. sha256:cd2051f80ab2833d67e9ed486c030402113d9308dcc7ac40fb3860399991b69f

Per-task results

Task	Category	Bernstein	Claude Squad	Conductor	Composio agent-orchestrator	OpenCode
Refactor three independent modules in parallel and merge only what passes tests	parallel orchestration	3 (best)	1	3 (best)	2	0
Test-first refactor: write a failing regression test, then refactor until it passes	single-task workflow	2	3 (best)	2	2	3 (best)
Add a Rust binding to a Python package and keep the test matrix green	multi-language coordination	3 (best)	2	2	1	1
Reproduce a 4-step run after the fact from the orchestrator's own log	audit and replay	3 (best)	0	1	2	0
Run a 12-step plan and keep total spend under a fixed budget	cost management	3 (best)	0	1	2	1
Triage a flaky pytest case and decide whether to fix or quarantine	decision under uncertainty	2	3 (best)	2	1	3 (best)
Resolve five inline review comments on a PR without rewriting unrelated lines	scoped editing	2	2	2	1	3 (best)
Detect that a config change to one package breaks tests in three other packages	cross-package reasoning	3 (best)	1	2	2	1
Resume a 6-step plan after killing the orchestrator at step 3	resilience	3 (best)	1	1	2	0
Run a 3-step plan against a locally-served model with no outbound DNS	air-gapped execution	2	2	0	0	3 (best)

Task-by-task narrative

Refactor three independent modules in parallel and merge only what passes tests

Category: parallel orchestration.

Acceptance checks:

all three modules run concurrently (wall time near max of single-module time, not sum)
failed module is held back from the merged branch
passing modules merged into one final commit per module

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 1
Conductor: 3 (best)
Composio agent-orchestrator: 2
OpenCode: 0

Notes: Conductor matched bernstein on wall time and merge-isolation. Claude Squad ran them in tmux panes but had no automatic gate so the failing one polluted the working tree. OpenCode is single-session; not designed for this shape.

Test-first refactor: write a failing regression test, then refactor until it passes

Category: single-task workflow.

Acceptance checks:

test file added in first iteration
test fails on the original code
refactor lands and the test passes

Scores (0 to 3):

Bernstein: 2
Claude Squad: 3 (best)
Conductor: 2
Composio agent-orchestrator: 2
OpenCode: 3 (best)

Notes: OpenCode and Claude Squad both excel at solo interactive sessions. Bernstein's worktree dispatch added one round-trip because the test-first contract is implicit; the operator had to phrase the task twice to keep the agent from refactoring before writing the test. This is an honest loss for bernstein.

Add a Rust binding to a Python package and keep the test matrix green

Category: multi-language coordination.

Acceptance checks:

Rust crate compiles
Python tests still pass
cargo and pytest both invoked from the orchestrator without manual intervention

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 2
Conductor: 2
Composio agent-orchestrator: 1
OpenCode: 1

Notes: Bernstein routed Rust to a different role with `cargo` in the gate set and Python to the default backend role. The role/gate split is the win condition here.

Reproduce a 4-step run after the fact from the orchestrator's own log

Category: audit and replay.

Acceptance checks:

log contains all model calls, all tool calls, all gate outcomes
log is tamper-evident (modifying any row is detectable)
log is parseable by a separate tool

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 0
Conductor: 1
Composio agent-orchestrator: 2
OpenCode: 0

Notes: Bernstein writes HMAC-chained jsonl under .sdd/. Composio AO writes a cloud-side trace that is structured but not tamper-evident. Conductor writes a transcript per run; not chained. Claude Squad and OpenCode rely on the underlying CLI agent's transcript.

Run a 12-step plan and keep total spend under a fixed budget

Category: cost management.

Acceptance checks:

orchestrator tracks per-step model cost
orchestrator escalates to a smaller model when budget is tight
final spend reported and verifiable from logs

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 0
Conductor: 1
Composio agent-orchestrator: 2
OpenCode: 1

Notes: Bernstein's bandit router demoted to Haiku on cheap steps. Composio AO has provider-side cost reporting but no automatic demotion.

Triage a flaky pytest case and decide whether to fix or quarantine

Category: decision under uncertainty.

Acceptance checks:

diagnoses cause from test logs
recommends fix or quarantine with a clear justification
produces a tracking issue stub if quarantined

Scores (0 to 3):

Bernstein: 2
Claude Squad: 3 (best)
Conductor: 2
Composio agent-orchestrator: 1
OpenCode: 3 (best)

Notes: Claude Squad and OpenCode both kept the operator in the loop and asked better clarifying questions. Bernstein over-eagerly dispatched a refactor agent before the diagnosis was settled. This is the second honest loss for bernstein and the credibility check.

Resolve five inline review comments on a PR without rewriting unrelated lines

Category: scoped editing.

Acceptance checks:

each comment marked as resolved
diff stays within the lines flagged by each comment
no unrequested refactors land

Scores (0 to 3):

Bernstein: 2
Claude Squad: 2
Conductor: 2
Composio agent-orchestrator: 1
OpenCode: 3 (best)

Notes: OpenCode's single-session focus kept it in scope. Bernstein and Conductor and Claude Squad all behaved similarly: one out of five comments triggered a wider rewrite that had to be reverted.

Detect that a config change to one package breaks tests in three other packages

Category: cross-package reasoning.

Acceptance checks:

lists the affected packages
proposes a fix that touches all of them
no test left red after the patch

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 1
Conductor: 2
Composio agent-orchestrator: 2
OpenCode: 1

Notes: Bernstein's task dependency graph made the blast radius explicit. Conductor required the operator to nudge it to the other packages.

Resume a 6-step plan after killing the orchestrator at step 3

Category: resilience.

Acceptance checks:

remaining steps run after restart
completed steps are not re-run
state is consistent (no orphan worktrees, no half-merged branches)

Scores (0 to 3):

Bernstein: 3 (best)
Claude Squad: 1
Conductor: 1
Composio agent-orchestrator: 2
OpenCode: 0

Notes: Bernstein's on-disk task state under .sdd/ was the cleanest restart. Claude Squad lost the partially-edited pane content. Conductor had to be restarted from scratch.

Run a 3-step plan against a locally-served model with no outbound DNS

Category: air-gapped execution.

Acceptance checks:

orchestrator boots without any internet call
all 3 steps complete against the local model
no telemetry leaves the box

Scores (0 to 3):

Bernstein: 2
Claude Squad: 2
Conductor: 0
Composio agent-orchestrator: 0
OpenCode: 3 (best)

Notes: OpenCode is the cleanest air-gap fit: a single static binary, no required network. Bernstein needed `--no-telemetry --offline` and one initial install. Conductor and Composio AO both require their service to be reachable.

Sources verified against

https://github.com/sipyourdrink-ltd/bernstein (checked 2026-05-19)
https://github.com/smtg-ai/claude-squad (checked 2026-05-19)
https://conductor-ai.com (checked 2026-05-19)
https://github.com/ComposioHQ/agent-orchestrator (checked 2026-05-19)
https://github.com/opencode-ai/opencode (checked 2026-05-19)

FAQ

Is this benchmark independent?

No. It is operator-published, scored, and refreshed by the same author who maintains Bernstein. That is why the page bakes an honesty gate into CI: Bernstein must lose at least two tasks and Bernstein winrate must stay below 0.80, or the test suite fails and the page does not publish. The scoring rubric, the seed prompt hash, and the repro script are linked so any reader can re-run the eval and check the numbers.

Why these five tools and not others?

The suite covers CLI agent orchestrators: tools that decide which CLI agent runs each step and merge the work. Bernstein, Claude Squad, Conductor, Composio agent-orchestrator, and OpenCode each represent a distinct shape: deterministic Python scheduler, tmux supervisor, parallel desktop app, cloud-first orchestrator, single-session TUI. Adding a sixth would test the same shape twice.

How can I reproduce the results?

The benchmark harness wraps each tool with its documented invocation, runs the 10 tasks 3 times each against a fixed worker model, and writes a fresh data file. The seed prompts are hashed in the methodology block. Total wall time on a 16-vCPU runner is roughly 4 hours, total model spend roughly USD 20.

How often is this refreshed?

Quarterly. The dated suite id keeps the URL stable while archived runs stay under /data/benchmarks/. If a tool ships a major version that materially changes its shape, the suite is refreshed mid-cycle and the changelog at the bottom of the page notes the bump.