All engagementsEngagement note

Real-time risk triage agent

Replaced a stale rules engine with a traced, eval-gated agent loop. Manual review queue shrank; reviewer confidence rose.

Series B fintech
Agents
10 weeks

Duration10 weeks

Team1 senior engineer + their reviewer rotation

HandoverRisk platform team, named owner

Disciplines

Agent loop
Policy guardrails
OTel tracing
Eval harness

Decide

Best fit when.

01An existing rules or policy engine is becoming impossible to reason about and a parallel-run cutover is on the table.
02Decisions must remain explainable to compliance, reviewers, or regulators — not just to engineers.
03There is a non-trivial latency budget the agent must hold, and you have a recent baseline of decisions to seed an eval suite.

Context

What was happening.

A long-tail rules engine made decisions on transactions flagged by an upstream classifier. Rule churn had outpaced the ownership model — analysts could no longer reason about why a transaction surfaced for review. The team needed an agent loop that could explain its decisions in the same language reviewers used, while keeping the existing policy controls intact.

Constraints

What we were holding to.

Every decision had to map to a policy clause that compliance had already approved.
Latency budget was set by the upstream service: a hard ceiling per transaction.
The existing rules engine had to remain authoritative until the agent cleared a parallel-run window.

Approach

How we built it.

Policy-bounded tool surface

Rather than letting the model reason freely over policy text, the agent calls into a small set of typed tools — each one wrapping a policy primitive. The tool surface is the policy contract; the model picks the route, not the rule.

Eval-gated parallel run

We built a replayable eval suite from a held-out month of triaged transactions before the agent shipped. The agent ran in shadow mode for two weeks, with disagreements routed to a review queue. The merge-to-primary gate was a quantitative bar on that suite, not a launch date.

Tracing wired before the first request

OpenTelemetry spans cover the model call, every tool invocation, and the policy resolution path. Reviewers can pull a trace from any decision and walk the agent's reasoning step by step — without reading model output.

This is the eval-suite discipline we describe in our writing on evaluation as the contract for shipping — the merge bar was a number, and the agent stayed in shadow mode until the number was met.

Handover

What we left with the client.

Typed tool contracts versioned alongside policy changes.
Replayable eval suite wired into CI with a documented merge bar.
Tracing dashboards and alert routes provisioned in the client's tenant.
Decision log and runbook owned by a named platform engineer.