Shipping the orchestrator onto someone else's box

This is the on-prem / regulated-deployment notes for 1.10 — mTLS cluster mode, signed lineage records, air-gapped install, a capability gate against the lethal-trifecta exfiltration class. If you're looking for "how to install it on my laptop," that's the curl-pipe install post instead. A laptop tool and an on-prem deployment have almost nothing in common: the first answers to the developer who started it, the second answers to a security architect, a compliance officer, a network team that hates outbound traffic, and a procurement reviewer with a checklist. The batch below is what it took to stop pretending those were the same thing.

the questions

Customers who want the orchestrator inside their perimeter ask a fairly predictable sequence. How do nodes authenticate when the network isn't yours. How do we prove, six months later, which agent wrote which line. How do we install when the box has no outbound internet. What stops a clever prompt from chaining a database read into a public webhook.

We had partial answers to all four. None were the answer you'd hand a security review without flinching. The five PRs that landed today close that gap by replacing "we sort of do that" with concrete, demonstrable behaviour. None of these are revolutionary on their own. The cumulative effect is a 1.10 build we're willing to drop into a regulated customer's box and walk through with their auditor.

cluster mode without ambient trust

Cluster mode used to assume the network was safe. Acceptable for a developer machine talking to itself, unacceptable everywhere else. Worker–central traffic now runs over mTLS by default, with cert issuance done locally and pinned to the cluster's own CA.

bernstein cluster bootstrap-ca --out .sdd/cluster/ca/
bernstein cluster issue-cert --role worker --node-id worker-01 \
  --out .sdd/cluster/worker-01/
bernstein cluster start --tls .sdd/cluster/worker-01/

The CLI generates a private CA, issues short-lived node certs with role and node-id baked in, refuses connections whose chain doesn't terminate at the cluster CA (PR #1019). Rotation is a re-issue, not a rebuild. The central trusts any cert with a valid chain and a non-expired NotAfter. There is no shared symmetric secret to leak.

Underneath, we replaced the in-process happy-path tests with a real two-process harness: central and worker as separate subprocess.Popen invocations, walked through six chaos scenarios — worker SIGKILL mid-task, central restart with in-flight claims, network partition, token expiry across a claim boundary, two workers racing for the same task, certificate revocation (PR #1020). Bugs the in-process harness silently swallowed — claim re-entry, token clock skew, an off-by-one in the partition healer — surface in seconds. CI runs the matrix on every push.

Operators get five new Prometheus counters and gauges (bernstein_cluster_claims_total, bernstein_cluster_token_rejections_total, bernstein_cluster_partition_seconds, bernstein_cluster_central_restarts_total, bernstein_cluster_workers_active) and six audit event types covering token issuance, claim transitions, certificate rotation, disconnects. Grafana dashboard JSON in observability/dashboards/cluster.json (PR #1021). Plain dashboard, accurate fields.

For deployments that can't expose a port to the worker fleet, there's a tested pattern using Cloudflare Tunnel for the central edge plus Tailscale for the worker mesh. Documented end-to-end. A nightly CI job stands up the topology in ephemeral containers and runs a smoke task through it (PR #1024). The pattern works without modifying firewall rules at the customer site, which is usually the difference between "scheduled for next quarter" and "deploy this week."

What's still missing: MESH and HIERARCHICAL coordinator topologies are stubs. Multi-tenant isolation inside a single cluster — separate quotas, audit chains, model budgets per tenant — is deferred to 1.11. If you need either today, run one cluster per tenant. We're naming the limit out loud because pretending it isn't there is how trust evaporates on the second deployment.

lineage that survives a regulator

Auditing a code change six months after the fact is a problem of information loss. By the time anyone asks "which agent wrote this," the prompt is gone, the cost ledger has rolled, the producing model version has been replaced. The new lineage subsystem keeps the answer.

Every agent write emits a signed lineage record linking the output (file path, byte range, sha256) to its inputs (the files the agent read), producer (agent id, role, model, effort), prompt SHA, cost, token count, wall-clock timestamp. Records chain via HMAC the same way the existing audit log does, so tampering with one record invalidates the chain past it.

bernstein lineage src/auth/middleware.py:74
# wrote by:    backend / claude-sonnet / effort=high
# prompt sha:  3f9a…b421  (template: roles/backend.md@v17)
# inputs:      src/auth/__init__.py  src/auth/jwt.py  tests/test_auth.py
# producer:    session 7c4f1a3b9d22, task #412
# cost:        $0.0214   tokens: 11,983
# signed:      ed25519 (cluster-key)  customer-key: aporia-prod-2026-05

Schema v2 adds two fields: a regulatory_class tag (pii, phi, dora, nis2, none) inferred from the file's policy zone, plus a customer-key signature attached after the cluster-key signature so customers can revoke trust without re-issuing the cluster CA. Combination is what makes a DORA or NIS2 evidence package mechanical to assemble: filter by regulatory_class, walk the chain, hand the bundle to the auditor.

The janitor verifies the chain on every gate run. A tamper hit forwards to a configurable SIEM webhook with the broken record and the surrounding window. Default surface is loud. We'd rather wake an operator than miss a forged signature.

What's still missing: the customer-key signing path uses ed25519 in software. FIPS-140 hardware keys are on the 1.11 roadmap. If FIPS-140 is a hard procurement requirement, you cannot ship today.

distribution without outbound internet

The first sovereign deployment we did, the customer's box could not reach pypi.org, github.com, or any registry we'd ever heard of. Install procedure was an engineer carrying a USB drive through a security checkpoint. Not a problem we wanted to solve twice.

bernstein wheelhouse build produces a self-contained tarball: pinned wheels for the orchestrator and every transitive dependency, embedded model weights for offline classifiers, a manifest with sha256 per file, a detached GPG signature over the manifest. bernstein wheelhouse verify checks both the signature and every per-file hash before any installer logic runs.

# on a connected build host
bernstein wheelhouse build --out bernstein-1.10.0-airgap.tar.gz \
  --sign-with [email protected]
 
# on the customer box
bernstein wheelhouse verify bernstein-1.10.0-airgap.tar.gz
bernstein wheelhouse install bernstein-1.10.0-airgap.tar.gz \
  --target /opt/bernstein
 
bernstein --profile airgap doctor airgap

--profile airgap flips the orchestrator into explicit egress-deny: any code path that opens a socket to a non-loopback, non-cluster address fails closed with an error naming the offender. doctor airgap runs ten self-checks — DNS lookup for a poisoned hostname, plaintext HTTP attempt, MCP reachability, model-weight integrity — and returns a single pass/fail line that procurement can paste into a runbook.

This was the piece we expected to be smallest and that ate the most time. Deciding what counts as "egress" inside a complex Python process is a research project. Deciding what to do when a transitive dependency tries to phone home for telemetry is a policy question. We landed on "fail closed, name the caller, document the override" because every other choice creates a quiet failure mode.

a capability gate against the lethal trifecta

The lethal trifecta — private data, untrusted input, external communication — is the prompt-injection escape hatch every multi-agent system eventually trips over. An agent that reads a customer's database, ingests a webhook body crafted by an attacker, and is allowed to call out to a public URL has, by construction, an exfil path. The mitigation in the literature is to refuse the chain, not the individual capabilities.

Tools and adapters now declare capability tags in their manifest:

# src/bernstein/adapters/postgres_query.py
CAPABILITIES = frozenset({"PRIVATE_DATA"})
 
# src/bernstein/adapters/webhook_ingest.py
CAPABILITIES = frozenset({"UNTRUSTED_INPUT"})
 
# src/bernstein/adapters/http_post.py
CAPABILITIES = frozenset({"EXTERNAL_COMM"})

At spawn, the orchestrator unions the tags of every tool the prospective agent has access to. If the union contains all three of PRIVATE_DATA, UNTRUSTED_INPUT, and EXTERNAL_COMM, the spawn fails with a refusal naming the offending capability set and the tools that contributed to each tag. Operators can override per-task with --allow-trifecta, which is logged to the audit chain and surfaced in the lineage record.

The gate catches the architectural mistake — assembling a tool belt that shouldn't exist together — before any prompt runs. It cannot prevent a capable insider from manually wiring around it. The default failure mode is now refusal, which is the right default for a tool you don't fully control.

what this batch isn't

A few things are honestly not done. MESH/HIERARCHICAL cluster topologies are stubs. FIPS-140 hardware keys for the lineage signer are on the 1.11 roadmap, not this release. Multi-tenant isolation inside a single cluster is deferred. The capability matrix covers the trifecta and nothing else; finer-grained gates like "no PII into models without a BAA" are next quarter's work. None of these are blockers for the deployments we have lined up. All of them will be eventually.

This isn't a "version 2.0." It's the unglamorous list a procurement reviewer asks about before the technical evaluation has even started. Cluster auth, signed audit, offline install, capability isolation. Table stakes for getting the orchestrator dropped onto a regulated customer's box instead of staying a thing that runs on someone's laptop.

We picked the second option for two years because the first option is mostly paperwork. The batch shipped today is the paperwork.

If you got here from the README and want the codebase view, GitHub is the canonical place. If the regulated-deployment angle is what you came for, open an issue describing the air-gap or compliance gap you're hitting; the next batch is shaped by what blocks real deployments.