Benchmarks

Bernstein benchmarks

Three things, each reproducible and each honest about its limits: a head-to-head quality eval against 4 other CLI agent orchestrators, a simulated parallel-scheduling speedup model, and a community cost benchmark you can add your own run to. Every number below is either computed from a published data file or quoted from the repo, and the page tells you which.

No leaderboard sheet where Bernstein wins everything. The head-to-head suite is gated in CI to lose at least two tasks and stay under an 80% winrate; the scheduling figures are a labelled simulation, not measured runs.

Head-to-head: CLI agent orchestrators

A reproducible 10-task eval comparing Bernstein against 4 other CLI agent orchestrators on a fixed worker model, three trials each, operator-scored against fixed acceptance checks. Bernstein loses 4 of 10 tasks; its winrate is 60%. Ties count as a win for every tied tool, which is the honest way to score a shared top result.

Read the full per-task scores or the methodology and repro script. The honesty gate lives in tests/benchmarks-honesty.test.ts.

Parallel scheduling speedup (simulation)

This is a simulation, not measured agent runs. It models a greedy list scheduler over 10 tasks with realistic dependency graphs, so treat it as a capacity-planning estimate rather than a leaderboard claim.

Configuration	Avg speedup vs single agent	Cost effect
3 agents	1.78x	model mixing reduces cost ~23%
5 agents	2.18x	same mix, more parallel slots

Full task table, scheduling model, and cost model are in docs/benchmarks/BENCHMARKS.md. Reproduce locally with uv run python benchmarks/run_benchmark.py.

Community cost benchmark

What does a Bernstein session actually cost on your setup? This benchmark collects real bernstein cost --json runs so you can sanity-check the bill before you install, across different models and hardware rather than from one machine.

To add a row, run bernstein cost --json after a session and submit the task count, model breakdown, total cost, cost-per-task, wall-clock time, and hardware. Two paths:

Comment your output on the cost-benchmark issue (#787).
Open a PR adding a row to docs/benchmarks/BENCHMARKS.md.

Submitted runs are published with attribution to the setup they came from. The schema is whatever bernstein cost --json emits, so there is nothing to fill in by hand.