Agent Reliability

LLM evals that evolve with your agent

A golden dataset is a snapshot from launch week; two refactors later it describes a product that no longer exists. Tessary drafts graders from what your agent actually does in production, so you correct them rather than write them from scratch, and they sharpen as bugs are fixed and existing golden sets fold in. When a grader trend drops, Tessary traces the drop to the change that caused it.

Find what broke your agent See how it works

Drafted: Graders start as a reading of the agent's observed behavior; you correct them, not write them.
Sharpening: Intent lives outside the code and improves as bugs are fixed and golden sets fold in.
Payoff: When a grader trend drops, Tessary traces it to the change that caused it.

The Problem

A golden dataset describes the product you shipped last quarter.

The eval playbook most teams inherited assumes the thing under test changes slowly. Agents do not: prompts, models, and orchestration shift weekly, and a dataset that lives off to the side of the code falls behind it.

52.4%

of 1,340 respondents run offline evaluations on test sets at all. The playbook most teams inherited assumes the thing under test changes slowly, and half of them have quietly stopped following it.

LangChain, State of Agent Engineering, 2025

Agent framework adoption nearly doubled in a year, from over 9% of orgs in early 2025 to almost 18% by early 2026. The stack under an agent moves fast, and a hand-maintained dataset falls behind it.

Datadog, State of AI Engineering, 2026

How It Works

Draft, correct, run on live traffic, trace the cause

Graders start from observed behavior and sharpen over time. When one catches a drop, the cause comes with it.

Land production traffic in one place
Push traces over OTLP in gen_ai OpenTelemetry conventions, or pull a slice from Langfuse or Braintrust. Sessions, turns, tool calls, and messages land in one model, the same one your graders run against.
Correct drafted graders instead of writing evals
Tessary drafts graders from the agent's observed behavior, and you correct them rather than write them from scratch. Need one more? Describe the failure in plain language and it becomes a lightweight classifier or a regex check; tool errors are extracted automatically.
One set of graders for tests and live traffic
The grader that judges a test sample is the same grader that flags live traffic, so there is no second eval system drifting out of sync with monitoring. The place you run evals is the place production gets judged.
When a trend drops, trace the cause
A failing grader names the quality dimension that broke, and Tessary follows it back through the agent chain to the change that moved it. Commit lineage rules the diff in or out, and the causes that never appear in a diff, like a provider model update or an upstream agent's shift, get traced too.

What You Get

Checks that track the product instead of describing an old version of it

Graders on live traffic, not a frozen sample

A grader runs on the traffic your users actually send, so it reflects what the agent does today. The golden sets you already have are not wasted; they fold in as one more input to intent.

Drafted from behavior, corrected by you

Intent starts as a reading of the agent's current behavior and is stored apart from its code. It gets sharper with every fixed bug and shared edge case, with far less input than writing and maintaining evals by hand.

One model for monitoring and evals

A verdict on a test sample and a verdict on live traffic are the same kind of record, so nothing needs reconciling between an eval suite and a monitoring stack.

Change lineage, diff or not

Commit-SHA lineage ties each run to the deploy that was live, which rules the diff in or out. It is one input to attribution, not the whole answer; plenty of drops arrive with no deploy behind them.

Detection that survives imperfect graders

An imperfect grader misfires at a stable rate, so a failure rate that moves from 3% to 11% signals a real change. Tessary reads that movement across verdicts over time, not any single score.

Works with your existing stack

Push over OTLP in gen_ai conventions, or pull a slice from Langfuse or Braintrust on demand. They hold the traces and track the scores; Tessary adds the step after, which change moved them.

The payoffWhen a grader trend drops, Tessary names the change that caused it

FAQ

Questions about evals that evolve

The graders get smarter, not just re-tagged to a commit. Intent starts as a reading of the agent's observed behavior and is stored apart from its code. Each fixed bug, shared edge case, and folded-in golden dataset sharpens it, so the checks track what the agent is supposed to do now, with far less work than maintaining an eval suite by hand.

A golden dataset is a snapshot from when you built it. Once the product moves, it stops matching what the agent actually does. Graders run on live traffic instead, so they reflect what the agent does now, and an existing golden set is folded in as an input rather than left to go stale on the side.

Mostly you do not. Tessary drafts graders from observed behavior and you correct the drafts. When you want a specific check, describe the failure mode in plain language and it becomes a lightweight classifier or a regex check. Tool errors are extracted automatically.

They are the same system. One grader produces both the eval verdicts and the production ones, so there is no separate suite drifting away from what monitoring watches. When a verdict trend moves in production, that same grader history is what attribution reads to find the change behind it.

No. It works across model providers, and you can push traces over OTLP or pull a slice from Langfuse or Braintrust. If you already run one of those, keep it; the traces stay where they are, and Tessary reads them to answer the question they leave open, which change moved the score.

Keep exploring

Get Started

Put graders on your live traffic

Tessary drafts the graders, you correct them, and they sharpen from there. The next time quality drops, you get the change that caused it, not just a score that moved. No credit card required to start.

Find what broke your agentWorks across model providers. Push over OTLP or connect Langfuse or Braintrust.