Catch the regression before your users do
An agent breaks in unexpected ways from changes that look unrelated: a refactor, a model version bump, an infra update. Tessary judges your agent’s output against a baseline after every change, so a regression shows up as a failed grader on a real trace instead of a support ticket a week later.
- Catches
- Regressions from prompt edits, model bumps, code refactors, and infra updates.
- Baseline
- Every change is judged against your product’s last known-good behavior.
- Signal
- A failed grader on a real trace, not a user complaint three weeks later.
Agents break from changes nobody connected to the agent.
The pain is not that evals are hard to write. It is that the agent layer moves constantly, and breakage rarely announces which change caused it. Static checks that someone has to maintain go stale, so teams ship and find out later.
- Unrelated
changes break the agent in unexpected ways: a refactor, a dependency bump, or a model version that shifts behavior on inputs you never touched
- ~1 day
of engineering time per change today, spent uploading traces, reading them by hand, and spot-checking outputs
- Reactive
the first signal a change broke something is usually a user complaint or a support ticket, after it already shipped
Baseline, then judge every change against it
Regression detection is the narrowest, most concrete use of the eval pipeline: a baseline of correct behavior, re-checked on every change.
Establish a baseline from real traffic
Tessary synthesizes graders from your codebase and calibrates them on your production traces. That baseline captures what correct behavior looks like for each call your agent makes.
Run graders on every change
After a prompt edit, a model bump, a refactor, or an infra update, the same graders judge the agent’s output against the baseline. Coverage spans every call site, not just the line you edited.
See the regression as a failed verdict
When behavior drifts, the relevant grader fails and the regression shows up in the run results, attributed to the call site and failure mode, before it reaches a user.
Regression signal tied to a real failure mode
Change-type agnostic
Prompt edits, model version bumps, orchestration changes, code refactors, and infra updates are all judged the same way: against the baseline.
Per-call-site coverage
Graders attach to each place your product calls a model, so a change in one part of the system surfaces breakage in another.
Grounded in production traces
Regression checks run against real spans pulled from Braintrust or Langfuse, not synthetic prompts that miss the inputs that actually break.
Verdict per failure mode
Each regression is tied to a specific failure mode in your taxonomy, so you know what broke and why it matters, not just that a number moved.
Questions about regression detection
Related pages
Evals for agentic products
The overview: synthesized eval pipelines that evolve with your agent.
LLM evals that evolve
Why auto-evolving graders beat static golden datasets.
Evals for the CTO
For the person who owns agent quality and gets paged when it breaks.
The agent evals SDK
Wire graders into your agentic workflow and run them on every change.
Ship the change. Catch the regression.
Connect your repo, establish a baseline from your traces, and judge every change against it. Catch the next regression before your users do.