All articles
agent evalsregression detectioneval pipelineagentic products

Agent Evals Maintenance Regression: Why Static Suites Drift

By Akhil Varma ·

Short answer

Agent evals go stale because eval suites capture a snapshot of the product at write time, while agents change continuously. Every prompt edit, code refactor, model version bump, or infra change shifts what the agent does without updating the graders that judge it. Coverage drifts until the eval score reflects a product that no longer exists.

Galileo’s State of AI Evaluation Engineering Report, based on surveys of more than 500 AI practitioners, found that 84.9% of organizations experienced an AI incident within six months of launch. Only 51.7% consistently added those failures back to their eval suite.

Agent evals maintenance regression follows the same pattern, but more quietly. The eval suite reports pass. Quality is already drifting.

Why Agent Evals Maintenance Regression Happens After Launch

Most eval suites are written in the weeks before first deployment. They capture the call sites, the failure modes, and the grading rubrics as of that day. Then the agent ships. Prompts get edited, code around the agent changes, the model version gets bumped, orchestration logic shifts. The eval suite does not follow.

User behavior shifts over time, requirements evolve, and new failure modes appear that the original evals did not anticipate. The suite needs to keep pace, but nothing triggers an update when this happens.

Three months after launch, the eval suite is a snapshot from a product several versions back. A team can run the suite, see all green, and ship a change that regresses a call site the suite no longer models correctly. The cost shows up later: trace review after a customer complaint, a day spent rebuilding graders to reflect what the product actually does now.

How Code Changes Outside the Prompt Break Eval Coverage

Prompt changes are at least visible. Someone edits the system message, knows a change happened, and may think to update the graders.

Code changes outside the prompt are not. A refactor that moves business logic into a tool call, an infrastructure update that changes how spans are batched, a model version bump that shifts output format behavior: none of these trigger an eval update by convention. The graders were written against the agent as it existed at launch. They did not update when prompts, code, or the model changed.

The most concrete version of this: a refactor that had nothing to do with the prompt broke a flow the eval suite had no grader for. Nothing flagged it because the graders did not cover the new code path. CIQ describes this pattern: the model functions correctly, but the behavior the workflow depends on has shifted. Drift becomes apparent only at the customer escalation, the parsing failure, or the compliance flag, after the damage is already done.

Why Passing Eval Scores Coexist With Production Regressions

Passing eval scores coexist with production regressions when the graders are calibrated to an earlier product state. A passing eval score says the agent is doing what it used to do. It does not say the agent is doing the right thing.

Stable test cases provide consistent regression detection. Stale cases that no longer represent real usage patterns create false confidence. A team that runs evals regularly and sees green can still be shipping regressions that their graders are not positioned to catch.

What Keeps an Eval Suite Current Without Manual Maintenance

The maintenance problem is structural: any approach that requires a human to decide which evals need updating after a code refactor, a prompt edit, or a model version bump will fall behind. The agent changes faster than the decision loop.

One approach: convert production traces that fail an online scorer into eval cases automatically, so the suite grows from real failures rather than a frozen snapshot. Eval coverage tracks the product because it is fed by the product.

Tessary’s approach goes a step further. The eval pipeline is synthesized from the codebase and production traces, then re-synthesized as the product changes. Regression detection runs against graders that reflect the current agent, not the one from launch day. Curation decisions persist across re-syntheses: accept, reject, or edit a grader once, and that judgment carries forward. The evals evolve with the product rather than requiring someone to audit what changed after every deploy.

When the model version bumps next quarter, or when the orchestration layer gets refactored, does the eval suite update automatically or does someone spend an afternoon deciding what changed and rebuilding graders?

Getting started takes a repo connection, no credit card. Try Tessary Evals

Frequently asked questions

Why do agent evals go stale?
Agent evals go stale because they capture a snapshot of the product at write time while agents change continuously. Prompt edits, code refactors, model version bumps, and infra changes all shift what the agent does without triggering an eval update. The eval suite ends up grading a version of the product that no longer exists.
What kinds of changes cause agent regressions that static evals miss?
Code changes outside the prompt layer are the most common. A refactor that moves logic into a tool call, a model version bump that shifts output format behavior, or an infrastructure update that changes how spans are batched can all produce regressions a frozen eval suite won't catch, because the graders were written before these changes happened.
How do I know if my eval suite is stale?
A stale eval suite consistently passes while production issues surface through customer escalations. Specific signals: graders that reference behavior or output patterns no longer present in the current agent, rubrics written against an older prompt version, and no new test cases added since the eval suite was first written.
How much engineering time does manual eval maintenance take?
For teams maintaining eval suites manually, the cost is roughly one engineering day per significant agent change: reviewing production traces, deciding which graders need updating, and running a calibration pass. This compounds as the agent evolves and as the gap between the eval suite and production behavior widens.
How can an eval suite evolve automatically with the agent?
An eval suite stays current when it is synthesized from the codebase and production traces rather than written once and maintained manually. When synthesis re-runs after a change, new call sites, failure modes, and graders are generated to match the current product. Curation decisions persist across re-syntheses, so the team's judgment compounds without ongoing maintenance work.

Written by

· Founder, Tessary

Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.