All writingWriting

Evaluation is the contract for shipping

If a system cannot be measured, it should not merge. The eval suite, traces, and replay corpus that gate every system we hand over.

Field note
January 22, 2026
6 min read

The most expensive sentence in an AI engagement is 'it worked when I tried it.' The second is 'we will add evaluation later.' Both are usually said in good faith; both usually mean the team has not yet drawn the line between a successful demo and a shippable system.

Our shipping bar is simple: every system leaves with a replayable evaluation suite, a baseline measurement on that suite, and a documented merge gate that points at the suite. If any of those three is missing, the engagement is not done — even if the surface is live.

What a real eval suite contains

An eval suite worth shipping has four properties. They are unglamorous, easy to skip individually, and load-bearing as a set.

A held-out corpus the model has not seen during development — sourced from real traffic, not synthetic.
Labels or judgements applied by someone with domain authority, not by the model itself and not by the engineer who wrote the prompt.
A metric that maps to the production failure mode you actually fear, not to a paper benchmark you read about.
A way to run the suite from a single command, in CI, on every change to the retrieval, the prompt, or the model.

If it cannot be measured, it does not merge.

Replay, gates, and the merge bar

We pair the eval suite with a replay corpus drawn from production traffic. Where the eval suite tells us whether a change is on bar, replay tells us whether it shifts behaviour on real cases — including cases nobody thought to write down. A regression on replay that does not show up in the eval suite is itself a finding: the suite has a blind spot.

The merge bar is quantitative and visible. A change that drops the eval suite below baseline does not merge. A change that holds the suite but shifts replay behaviour gets a written review before it merges. Either way, the decision is on a number, not on a feeling.

We walk through one such suite in our engagement note on a real-time risk triage agent — the agent only merged to primary after holding bar on a held-out month of triaged transactions, with a two-week shadow run gating the cutover.

Who owns the suite after handover

The eval suite is the most under-owned artifact in most AI engagements. Engineers feel responsible for the system; product feels responsible for the outcome; nobody feels responsible for the contract that connects them. We close the gap explicitly: the named handover owner is responsible for keeping the eval suite current as the work evolves, with a documented rhythm — usually a quarterly review and a refresh whenever the upstream model or domain shifts.

When a team adopts this discipline, the conversation around AI shipping changes. 'Is it good enough?' becomes 'is it on bar?' That shift is small in writing and large in operation — it is the difference between a system that holds up under traffic and a demo that moved into production.

Back to all writing