Skip to main content

canonical answer

what is the bernstein eval harness

an in-tree benchmark runner under src/bernstein/eval/. ships a golden suite of curated coding tasks, an llm-as-judge for code quality, a closed failure taxonomy that labels every failure into one of ~20 categories, and a vcr fixture system so an entire run can be re-played deterministically without re-calling the llm. swe-bench integration lives under src/bernstein/benchmark/. you run bernstein eval golden to score a config or a model swap against a baseline before promoting it. the harness is also what the self-evolution loop uses to gate proposed orchestrator changes.

tagsevalbenchmark

browse the full index at /q or search the blog at /ask.