Agent Reliability

Catch the regression, and the change that caused it

When your AI agent gets worse in production, Tessary tells you exactly what caused it, fast enough to fix it before it does damage. Detection reads your live traffic; attribution names the change, whether it shipped in a PR or never touched your repo. The same attribution, run before the merge, flags the next risky PR, so either way you get the cause, not just the score that moved.

Find what broke your agent See how it works

Detects: The regression in production, as a trend across grader verdicts.
Attributes: The failing quality dimension, traced back to the change that caused it.
Gates: The next PR: a trouble report posts and a GitHub Action can hold the merge.

The Problem

The diff that broke production was already merged when you found out.

The pain is not that evals are hard to write. It is that the agent layer moves constantly, a release bundles a prompt edit, a code change, and a dependency bump into one diff, and breakage rarely announces which of them caused it. A score that moved is easy to see. The change that moved it is the work.

0.8%

of requests were misrouted on August 5, 2025, the first day of a routing bug Anthropic later documented in a public postmortem. The drop started small.

Anthropic engineering postmortem, 2025

16%

of Sonnet 4 requests were misrouted at the worst-impacted hour on August 31, almost a month later. The postmortem is blunt: "The evaluations we ran simply didn't capture the degradation users were reporting."

Anthropic engineering postmortem, 2025

How It Works

Detect the drop, name the cause, gate the next change

Tessary watches the agent in production, traces a drop to the change behind it, and runs the same check on the next PR before it merges.

Watch the agent in production
Push traces over OTLP with the TypeScript or Python SDK, or pull a slice from Langfuse or Braintrust. Tessary runs graders over that traffic, natural-language checks and extracted tool errors, so what gets scored is what your users actually send.
Catch the drop as a trend
Graders score the live traffic and Tessary watches how their verdicts move over time. A regression surfaces as a shift in the trend, visible before a support ticket names it.
Trace it to the change that caused it
A failing grader names the quality dimension that broke. Tessary traces that dimension back through the agent chain to the change behind it: a prompt edit, a tool update, a dependency bump, an upstream agent whose output shifted, or a model provider update that never touched your repo.
Run the same check before the next merge
The same attribution engine runs before the merge. A diff classifier reads the PR, maps the touched surface to the failure families it endangers (tool-call errors, hallucinated citations, refusal spikes), and scores it against what broke last time. A trouble report posts on the PR and the GitHub Action can hold the merge.

What You Get

From a moving score to the change behind it

What a catch looks like

A PR swaps the retrieval tool client. Risk routing flags the citation-accuracy family, which broke twice in the last quarter. The trouble report posts on the PR and the merge holds until someone looks.

A gate that lives in CI

The trouble report posts as a PR comment, in the place you already review code. One boundary to know: a gate only fires on changes that ship through a PR, so production detection stays on for everything that does not.

Grounded in production traffic

Graders run on real traces, ingested over OTLP or pulled from Langfuse or Braintrust, not on synthetic prompts that miss the inputs that actually break. Langfuse and Braintrust hold the traces and track the scores; Tessary adds which change moved them.

Lineage for every change, diff or not

Commit-SHA lineage ties each run to the deploy that was live, which rules the diff in or out. Attribution also covers the causes that never appear in a diff at all.

Where the graders come fromGraders drafted from observed behavior, sharpening as bugs are fixed

FAQ

Questions about regression detection

Graders do not need to be perfect. An imperfect grader misfires at a stable rate, so the trend across its verdicts stays meaningful: a failure rate that moves from 3% to 11% signals a real change even if some individual verdicts are wrong. Tessary alerts on that movement, not on any single score.

An upstream agent's output shifts, a tool starts returning different data, or the model provider updates the model underneath you. Nothing in your repo changed, so no CI gate could have fired. Production detection catches the drop as a trend, and attribution traces it back through the agent chain to the node that moved. This is why the pre-deploy check is the second act, not the whole product.

It is a different step. Langfuse and Braintrust hold your traces, run evaluators, and track scores well, and Tessary can pull a slice from either as a source. What they tell you is that a score moved. Tessary tells you why, and which change moved it.

Tests check the cases you wrote. The pre-deploy check scores the PR against graders that watch your live traffic, so it covers the inputs users actually send. And because plenty of regressions arrive without a PR, the production side keeps watching after the merge.

No. Tessary drafts graders from your agent's observed behavior, and you correct them rather than write them from scratch. That intent lives outside the agent's code and sharpens as bugs are fixed and existing golden datasets are folded in.

Keep exploring

Get Started

Find the change behind the next regression

The next time a grader trend drops, you get the change that caused it, not just the alert, and the next risky PR gets a trouble report before it merges. No credit card required to start.

Find what broke your agentBring your own provider keys. Connect a trace source in minutes.

Catch the regression, and the change that caused it

The diff that broke production was already merged when you found out.

Detect the drop, name the cause, gate the next change

Watch the agent in production

Catch the drop as a trend

Trace it to the change that caused it

Run the same check before the next merge