Regression detection

Catch the regression before your users do

An agent breaks in unexpected ways from changes that look unrelated: a refactor, a model version bump, an infra update. Tessary judges your agent’s output against a baseline after every change, so a regression shows up as a failed grader on a real trace instead of a support ticket a week later.

Catches
Regressions from prompt edits, model bumps, code refactors, and infra updates.
Baseline
Every change is judged against your product’s last known-good behavior.
Signal
A failed grader on a real trace, not a user complaint three weeks later.
The Problem

Agents break from changes nobody connected to the agent.

The pain is not that evals are hard to write. It is that the agent layer moves constantly, and breakage rarely announces which change caused it. Static checks that someone has to maintain go stale, so teams ship and find out later.

Unrelated

changes break the agent in unexpected ways: a refactor, a dependency bump, or a model version that shifts behavior on inputs you never touched

~1 day

of engineering time per change today, spent uploading traces, reading them by hand, and spot-checking outputs

Reactive

the first signal a change broke something is usually a user complaint or a support ticket, after it already shipped

How It Works

Baseline, then judge every change against it

Regression detection is the narrowest, most concrete use of the eval pipeline: a baseline of correct behavior, re-checked on every change.

  1. Establish a baseline from real traffic

    Tessary synthesizes graders from your codebase and calibrates them on your production traces. That baseline captures what correct behavior looks like for each call your agent makes.

  2. Run graders on every change

    After a prompt edit, a model bump, a refactor, or an infra update, the same graders judge the agent’s output against the baseline. Coverage spans every call site, not just the line you edited.

  3. See the regression as a failed verdict

    When behavior drifts, the relevant grader fails and the regression shows up in the run results, attributed to the call site and failure mode, before it reaches a user.

What You Get

Regression signal tied to a real failure mode

Change-type agnostic

Prompt edits, model version bumps, orchestration changes, code refactors, and infra updates are all judged the same way: against the baseline.

Per-call-site coverage

Graders attach to each place your product calls a model, so a change in one part of the system surfaces breakage in another.

Grounded in production traces

Regression checks run against real spans pulled from Braintrust or Langfuse, not synthetic prompts that miss the inputs that actually break.

Verdict per failure mode

Each regression is tied to a specific failure mode in your taxonomy, so you know what broke and why it matters, not just that a number moved.

FAQ

Questions about regression detection

Any change in your agent’s behavior that makes a previously acceptable output fail a grader. That includes wrong answers, dropped constraints, format breaks, and policy violations, caught against a baseline rather than a fixed threshold someone has to maintain.
Graders run on the agent’s actual output, not on the diff. A code refactor, a dependency bump, or a model version change all flow through the same call sites, so if any of them shifts behavior, the relevant grader fails the same way a prompt regression would.
No. Tessary reads your codebase and synthesizes the graders and the failure taxonomy. You curate them by accepting, rejecting, or editing, and the baseline updates as you confirm new correct behavior.
Connect Braintrust or Langfuse as a trace source. Each run pulls real spans and judges them, so regression detection reflects the inputs your users actually send.
The plugin produces a first set of graders in minutes. With traces connected, the goal is catching your first real regression within the first week.
Get Started

Ship the change. Catch the regression.

Connect your repo, establish a baseline from your traces, and judge every change against it. Catch the next regression before your users do.

Start with your repoBring your own provider keys. The graders are yours to keep.