Why do agent evals go stale?

Agent evals go stale because they capture a snapshot of the product at write time while agents change continuously. Prompt edits, code refactors, model version bumps, and shifts in what a tool returns all change what the agent does without triggering an eval update. The eval suite ends up grading a version of the product that no longer exists.

How do I know if my eval suite is stale?

A stale eval suite consistently passes while production issues surface through customer escalations. Specific signals: graders that reference behavior or output patterns no longer present in the current agent, rubrics written against an older prompt version, and no new test cases added since the eval suite was first written.

How much engineering time does manual eval maintenance take?

For teams maintaining eval suites manually, each significant agent change carries a real recurring cost: reviewing production traces, deciding which graders need updating, and running a calibration pass. That cost compounds as the agent evolves and as the gap between the eval suite and production behavior widens.

How can an eval suite evolve automatically with the agent?

An eval suite stays current when it is synthesized from the codebase and production traces rather than written once and maintained manually. When synthesis re-runs after a change, new call sites, failure modes, and graders are generated to match the current product. Curation decisions persist across re-syntheses, so the team's judgment compounds without ongoing maintenance work.

Why Agent Evals Go Stale and Miss the Regressions That Matter

Q: What kinds of changes cause agent regressions that static evals miss?

Code changes outside the prompt layer are the most common: a refactor that moves logic into a tool call, or a model version bump that shifts output format behavior. Some causes never appear in a diff at all. A tool API starts returning a new response schema, or an upstream agent's output shifts. A frozen eval suite catches none of these, because the graders were written before the changes happened.

Galileo’s State of AI Evaluation Engineering Report, based on surveys of more than 500 AI practitioners, found that 84.9% of organizations experienced an AI incident within six months of launch. Only 51.7% consistently added those failures back to their eval suite.

Eval staleness follows the same pattern, but more quietly. The eval suite reports pass. Quality is already drifting.

Why Eval Suites Go Stale After Launch

Most eval suites are written in the weeks before first deployment. They capture the call sites, the failure modes, and the grading rubrics as of that day. Then the agent ships. Prompts get edited, code around the agent changes, the model version gets bumped, orchestration logic shifts. The eval suite does not follow.

User behavior shifts over time, requirements evolve, and new failure modes appear that the original evals did not anticipate. The suite needs to keep pace, but nothing triggers an update when this happens.

Three months after launch, the eval suite is a snapshot from a product several versions back. A team can run the suite, see all green, and ship a change that regresses a call site the suite no longer models correctly. A passing eval score says the agent is doing what it used to do. It does not say the agent is doing the right thing. The cost shows up later: trace review after a customer complaint, a day spent rebuilding graders to reflect what the product actually does now.

How Changes Outside the Prompt Break Eval Coverage

Prompt changes are at least visible. Someone edits the system message, knows a change happened, and may think to update the graders.

Changes outside the prompt are not. A refactor that moves business logic into a tool call, or a model version bump that shifts output format behavior: neither triggers an eval update by convention. And some causes never appear in a diff at all. A tool API starts returning a new response schema, an upstream agent’s output shifts, or the provider updates the model underneath you. The graders were written against the agent as it existed at launch, and nothing about these changes tells them to update.

The most concrete version of this: a refactor that had nothing to do with the prompt broke a flow the eval suite had no grader for. Nothing flagged it because the graders did not cover the new code path. CIQ describes this pattern: the model functions correctly, but the behavior the workflow depends on has shifted. Drift becomes apparent only at the customer escalation, the parsing failure, or the compliance flag, after the damage is already done.

What Keeps an Eval Suite Current Without Manual Maintenance

The maintenance problem is structural: any approach that requires a human to decide which evals need updating after a code refactor, a prompt edit, or a model version bump will fall behind. The agent changes faster than the decision loop.

One approach: convert production traces that fail an online scorer into eval cases automatically, so the suite grows from real failures rather than a frozen snapshot. Eval coverage tracks the product because it is fed by the product.

Tessary’s approach goes a step further. The eval pipeline is synthesized from the codebase and production traces, then re-synthesized as the product changes, so regression detection runs against graders that reflect the current agent, not the one from launch day. The graders start as a reading of the agent’s observed behavior, so the team corrects drafts rather than writing checks from scratch. Curation decisions persist across re-syntheses: accept, reject, or edit a grader once, and that judgment carries forward, a record of intent that sharpens with every bug fixed. The evals evolve with the product rather than requiring someone to audit what changed after every deploy.

A current suite is still only half the answer. Score-tracking platforms like Langfuse and Braintrust will faithfully report that a number moved on whatever suite you maintain; the question of what caused the move stays with you. When a re-synthesized grader flags a drop, Tessary reads the trend across its verdicts (an imperfect grader misfires at a stable rate, so a moving rate marks a real change) and traces the failing dimension back through the agent’s chain to the change that caused it: the prompt edit, the refactor, the model version bump, or the tool that started returning different data. The answer to “what changed” comes from the system, in time to fix it before it does real damage.

When the model version bumps next quarter, or when the orchestration layer gets refactored, does the eval suite update automatically or does someone spend an afternoon deciding what changed and rebuilding graders?

Getting started takes a repo connection, no credit card, and the next time a score drops you get the cause, not just the number. Start with your repo

Why Agent Evals Go Stale and Miss the Regressions That Matter

Why Eval Suites Go Stale After Launch

How Changes Outside the Prompt Break Eval Coverage

What Keeps an Eval Suite Current Without Manual Maintenance

Frequently asked questions