Skip to main content

Methodology

CLI agent orchestrators benchmark methodology

How the 10-task eval at /benchmarks/cli-agent-orchestrators is scored, run, and audited. The rubric is fixed up-front and is the same across every refresh; the seed prompt hash is the contract.

Suite id: cli-agent-orchestrators-2026-05. Published 2026-05-19.

Scoring rubric

Each task is scored 0 to 3 against a fixed set of acceptance checks. 0 = tool refused or produced unusable output. 1 = partial output, manual rework required. 2 = passed acceptance checks with some loss (one expected behaviour missing or one extra round-trip required). 3 = passed all acceptance checks on first scheduled run. Winrate is the count of tasks where a tool's score equals or beats every other tool divided by total tasks. Ties count as a win for every tied tool; that is why winrates can sum above 1.0.

On a tie at the top score, every tied tool counts as a winner. The page renders a 'tie' marker in the results column.

Worker model

Every orchestrator runs Claude Sonnet 4.5 as the primary worker model where the orchestrator exposes that choice. Where the orchestrator hard-codes a different model (or a different default), the table lists the substituted model and the result is marked with a 'model-mismatch' caveat.

The point of fixing the model is to isolate orchestrator effects. A different model would shift every row of the table uniformly; running the eval at two model tiers doubles the cost and rarely changes the relative ranking, so the suite picks one tier and labels it clearly.

Trials and aggregation

Each (task, tool) pair runs 3 times. The final cell on the results table is the median score across those trials. median score across 3 trials. Each trial uses a fresh worktree, a cold model cache, and the same seed prompt.

Three trials is the smallest count where a single flaky-model outlier cannot move the median. The data file ships the median, not the per-trial scores, because three integers in 0..3 are too small to support inferential statistics; reporting them as if they were a distribution would be cargo-cult science.

Environment

Ubuntu 24.04, 16 vCPU, 32 GB RAM, GitHub-hosted runner. No external network beyond the model provider HTTPS endpoint and GitHub.

The suite explicitly avoids local-laptop runs: laptop power management, network jitter, and IDE background processes are too noisy to reproduce across operators. A GitHub-hosted runner is the cheapest clean environment.

Seed prompt hash

The exact wording of every task prompt is committed under scripts/benchmark-prompts/ and hashed into the data file under methodology.seedPromptHash. Editing a prompt changes the hash, which is why a refresh that touches prompts bumps the suite id.

Current hash: sha256:cd2051f80ab2833d67e9ed486c030402113d9308dcc7ac40fb3860399991b69f.

Reproduction script

The script at scripts/run-benchmark.mjs in the bernstein-landing repo runs the suite end-to-end against a local checkout of each tool. It:

  • installs each tool at the version pinned in the data file
  • seeds three sandbox git repos per task
  • spawns the orchestrator with the same prompt, three trials per (task, tool) cell
  • writes per-trial transcripts under tmp/benchmark-runs/
  • computes median scores and writes a fresh data file with a new suiteId

The operator then reviews each transcript by hand before opening a refresh PR. Scoring is not automated: a model will not consistently score "0 = refused" the same way an operator will.

Honesty gate

The page is operator-published, which is a conflict of interest. The CI test at tests/benchmarks-honesty.test.ts guards against the obvious failure mode (a refresh that hands Bernstein every task) with two assertions:

  • Bernstein must lose at least 2 of the 10 tasks (loss = score strictly below the top score on the task)
  • Bernstein winrate (ties counted as wins) must stay below 0.80

Either assertion failing means the build fails and the data file does not deploy. A refresh that genuinely wants to publish a result above 0.80 has to change the rubric and document the change, not just edit scores.

What the rubric does not cover

The eval is narrow on purpose. Documented exclusions:

  • Code quality. Acceptance checks are pass/fail; readability is not graded. A tool that ships ugly-but-passing code beats a tool that ships pretty-but-broken code.
  • Setup ergonomics. Install time and onboarding flow are not in the rubric. A tool that requires 30 minutes to install but then runs the task cleanly scores the same as a one-line install.
  • Community size. Stars, forks, and Discord activity are not scored. The suite reads only the running tool, not its ecosystem.
  • License. Scoring is license-blind. Commercial and OSS tools are scored the same.

Readers who care about those dimensions should pair this page with the per-tool comparison pages at /compare and /vs.

Refresh cadence

Quarterly. The suite id encodes the refresh date so a cited URL never silently changes content. Mid-cycle refreshes only happen if a tool ships a major version that materially changes its shape (for example: a single-agent tool ships parallel execution). When that happens the changelog at the bottom of the main page notes the bump.

Sources

Every tool's documented invocation, version pin, and installation command was checked against the upstream repo on the date in the data file: