What is an agent regression from a non-prompt code change?

An agent regression from a non-prompt code change is a degradation in agent behavior caused by a code change outside the prompt layer. Common examples include library updates that change tool response schemas, call chain refactors that alter context assembly, and infrastructure changes that modify reasoning budgets or session state. The prompt stays identical, but the model receives different inputs or its outputs hit different downstream code.

Why do my evals miss regressions caused by code changes?

Prompt-only evals compare outputs to expected results given the same prompts. When a code change causes a regression, the prompt fed to the model is identical to what ran before. The eval sees the same prompt and a plausible-looking output, without knowing that the context assembly, tool response schema, or call chain changed upstream. No comparison to the old prompt will surface a change that happened elsewhere in the system.

How do grader-based evals detect non-prompt code change regressions?

Graders evaluate specific failure modes at specific call sites against production traces. A grader checking whether an expected field is present in a tool response will fail if a library update changed the field name. A grader checking whether output references a retrieved document will fail if the retrieval layer stopped returning it. Because graders run against real spans, they capture what the full system delivered to the model at call time, not just what the prompt said.

How often should I run my eval suite after code changes?

The eval suite should run after every deploy, not only when prompts change. Connect your trace source to run graders against spans from each deploy and compare pass and fail counts to the prior baseline. Any grader that flips from pass to fail after a non-prompt code change indicates a regression, and the next step is tracing which change in that deploy caused it before more traffic runs through it.

How is an agent eval suite different from a unit test for the same code change?

Unit tests verify that code does what it is supposed to do in isolation. An agent eval suite verifies that the model behaves correctly given what the code delivers to it. A unit test might pass after a schema change because the code returns the new schema correctly, while an eval grader fails because the model produces incorrect output given a schema it was not calibrated for. Both are needed; they test different layers.

All articles

agent evalsregression testingLLM engineering

Agent Regressions from Non-Prompt Code Changes

By Akhil Varma · June 3, 2026

Short answer

Agent regressions from non-prompt code changes happen when a library update, schema change, or call chain refactor alters what the model receives or how its output is consumed. Prompt-only evals miss these. Graders catch them by evaluating the full call site on each run, not just the prompt, and the failing quality dimension can then be traced back through the chain to the specific change that caused it.

Anthropic’s April 2026 engineering postmortem documented two infrastructure changes that degraded Claude Code for weeks. Neither touched a prompt. A session caching bug cleared reasoning context every turn instead of once. A configuration change dropped the default reasoning effort from high to medium. Neither change appeared as a measurable failure in the existing evals, and users experienced both as the model getting worse.

Part of why both stayed invisible is that nothing in the traces captured session state or the reasoning-effort setting. A gap like that is informative: when a trace goes dark partway to a cause, the gap itself is the next node to instrument, and the next regression is easier to trace.

This is the class of problem: agent regression from non-prompt code changes.

Why Non-Prompt Code Changes Cause Agent Regressions

A prompt can stay identical across two deploys and the agent can still break. The cause is upstream from the prompt.

A library update alters the schema of what your code passes to the model. A refactored call chain changes what context reaches the prompt assembly step. A dependency update shifts the format of a tool response the model is expected to parse. An infrastructure change modifies how much reasoning context is preserved across turns.

In each case, the model is receiving different inputs or producing outputs that land in different parsing code. The prompt is not the variable. The surrounding code is.

This is why the regression detection problem for agentic products is broader than most eval tooling assumes. Eval pipelines built around prompt-to-output comparison cover one slice of the failure surface. They leave non-prompt code changes uncovered by design.

Why Prompt-Only Evals Miss Non-Prompt Regressions

Standard eval workflows compare outputs against expected results, given the same prompts. That approach catches prompt regressions: if you change the prompt and the behavior degrades, the eval surfaces it.

It does not catch non-prompt regressions.

When a call chain is refactored, the prompt fed to the model might be identical to what ran last week. The surrounding infrastructure is different. A grader that only looks at prompt-to-output consistency does not see the refactor as a variable. It sees the same prompt and an output that looks plausible, without knowing that the context assembly step now produces different content.

Manual trace review has the same gap. An engineer reviewing traces after a library update will compare outputs to expectations. Without knowing the tool response schema changed, they have no baseline for what correct looks like given the new inputs. They are reviewing outputs in isolation from the code change that produced them.

How Graders Catch Agent Regressions from Code Changes

A grader evaluates a specific failure mode at a specific call site. It runs against production traces, which means it operates on real inputs as the system assembled them at call time, not on synthetic prompts constructed outside the production environment.

When a library update changes a tool response schema, the grader sees the new schema in the trace. A grader checking whether an expected field is present in the tool response will fail if the field name changed. A grader checking whether the model output references a retrieved document will fail if the retrieval layer stopped returning it. These graders catch the regression not by comparing to the old prompt, but by checking what the model did with what it received.

This is why trace-grounded graders detect non-prompt regressions where prompt comparisons do not: they evaluate the full call, including the surrounding code’s contribution to the inputs.

Braintrust’s CI/CD evaluation guide recommends the same cadence: run graders after every deploy, not only when prompts change. That cadence gets you detection. Running graders per deploy tells you a score moved after the deploy; it does not tell you which change moved it. A real release bundles a prompt tweak, a dependency bump, and a refactor into one deploy, so a flipped grader still leaves every change in that release as a suspect. The next section is about closing that gap.

Running Graders After Every Deploy, Not Just Prompt Changes

The practical implication is that graders need to run on production traces after every deploy, regardless of whether a prompt changed.

A working setup connects your trace source, runs the eval suite against spans from the latest deploy, and compares pass and fail counts to the prior deploy baseline. Read the movement across verdicts rather than any single flip: a grader that captures intent imperfectly still misfires at a steady rate, so a fail rate that holds at 4% for weeks and climbs to 12% after a deploy signals a real regression to investigate before more traffic runs through it.

Investigation is where the manual version stalls, because the deploy that moved the rate rarely contains one change. This is the attribution step: when a grader starts failing, it names the quality dimension that broke, and Tessary traces that dimension back through the agent’s chain, through the tool response, the context assembly step, the dependency that changed, to the specific change that caused it. A regression to investigate becomes a regression with a named cause. The same engine can run against a change before it merges; pre-deploy failure prediction covers that case.

One maintenance problem remains: keeping graders pointed at the right call sites as the codebase changes. When the call chain is refactored, graders that referenced the old structure become stale, and the suite keeps reporting pass on a code path it is no longer correctly checking. Tessary drafts graders from the agent’s observed behavior, you correct them rather than write them, and it re-drafts as the code changes, so coverage tracks the current call structure rather than the one that existed at launch.

The maintenance story matters, but it is not the point of the tool. The point is the question this article started with: when your agent gets worse after a deploy and no prompt changed, Tessary tells you which library update, schema change, or refactor caused it, fast enough to fix it before it does damage.

Connect your traces and find out which change in your last deploy moved your scores. No credit card needed to start. Find what broke your agent

Frequently asked questions

What is an agent regression from a non-prompt code change?: An agent regression from a non-prompt code change is a degradation in agent behavior caused by a code change outside the prompt layer. Common examples include library updates that change tool response schemas, call chain refactors that alter context assembly, and infrastructure changes that modify reasoning budgets or session state. The prompt stays identical, but the model receives different inputs or its outputs hit different downstream code.
Why do my evals miss regressions caused by code changes?: Prompt-only evals compare outputs to expected results given the same prompts. When a code change causes a regression, the prompt fed to the model is identical to what ran before. The eval sees the same prompt and a plausible-looking output, without knowing that the context assembly, tool response schema, or call chain changed upstream. No comparison to the old prompt will surface a change that happened elsewhere in the system.
How do I find which code change caused an agent regression?: Detection alone leaves the question open, because a deploy usually ships several changes at once and a flipped grader leaves all of them as suspects. Attribution is the step after detection. When a grader starts failing, it names the quality dimension that broke, and that dimension gets traced back step by step, from the failing output through the layers that fed it, until the trace reaches the change responsible, whether that is a library update, a schema change, or a refactor.
How do grader-based evals detect non-prompt code change regressions?: Graders evaluate specific failure modes at specific call sites against production traces. A grader checking whether an expected field is present in a tool response will fail if a library update changed the field name. A grader checking whether output references a retrieved document will fail if the retrieval layer stopped returning it. Because graders run against real spans, they capture what the full system delivered to the model at call time, not just what the prompt said.
How often should I run my eval suite after code changes?: The eval suite should run after every deploy, not only when prompts change. Connect your trace source to run graders against spans from each deploy and compare pass and fail counts to the prior baseline. Any grader that flips from pass to fail after a non-prompt code change indicates a regression, and the next step is tracing which change in that deploy caused it before more traffic runs through it.
How is an agent eval suite different from a unit test for the same code change?: Unit tests verify that code does what it is supposed to do in isolation. An agent eval suite verifies that the model behaves correctly given what the code delivers to it. A unit test might pass after a schema change because the code returns the new schema correctly, while an eval grader fails because the model produces incorrect output given a schema it was not calibrated for. Both are needed; they test different layers.

Written by

Akhil Varma · Founder, Tessary

Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.