Agent Regressions from Non-Prompt Code Changes
Short answer
Agent regressions from non-prompt code changes happen when a library update, schema change, or call chain refactor alters what the model receives or how its output is consumed. Prompt-only evals miss these. Graders catch them by evaluating the full call site on each run, not just the prompt.
Anthropic’s April 2026 engineering postmortem documented two infrastructure changes that degraded Claude Code for weeks. Neither touched a prompt. A session caching bug cleared reasoning context every turn instead of once. A configuration change dropped the default reasoning effort from high to medium. Neither change appeared as a measurable failure in the existing evals, and users experienced both as the model getting worse.
This is the class of problem: agent regression from non-prompt code changes.
Why Non-Prompt Code Changes Cause Agent Regressions
A prompt can stay identical across two deploys and the agent can still break. The cause is upstream from the prompt.
A library update alters the schema of what your code passes to the model. A refactored call chain changes what context reaches the prompt assembly step. A dependency update shifts the format of a tool response the model is expected to parse. An infrastructure change modifies how much reasoning context is preserved across turns.
In each case, the model is receiving different inputs or producing outputs that land in different parsing code. The prompt is not the variable. The surrounding code is.
This is why the regression detection problem for agentic products is broader than most eval tooling assumes. Eval pipelines built around prompt-to-output comparison cover one slice of the failure surface. They leave non-prompt code changes uncovered by design.
Why Prompt-Only Evals Miss Non-Prompt Regressions
Standard eval workflows compare outputs against expected results, given the same prompts. That approach catches prompt regressions: if you change the prompt and the behavior degrades, the eval surfaces it.
It does not catch non-prompt regressions.
When a call chain is refactored, the prompt fed to the model might be identical to what ran last week. The surrounding infrastructure is different. A grader that only looks at prompt-to-output consistency does not see the refactor as a variable. It sees the same prompt and an output that looks plausible, without knowing that the context assembly step now produces different content.
Manual trace review has the same gap. An engineer reviewing traces after a library update will compare outputs to expectations. Without knowing the tool response schema changed, they have no baseline for what correct looks like given the new inputs. They are reviewing outputs in isolation from the code change that produced them.
How Graders Catch Agent Regressions from Code Changes
A grader evaluates a specific failure mode at a specific call site. It runs against production traces, which means it operates on real inputs as the system assembled them at call time, not on synthetic prompts constructed outside the production environment.
When a library update changes a tool response schema, the grader sees the new schema in the trace. A grader checking whether an expected field is present in the tool response will fail if the field name changed. A grader checking whether the model output references a retrieved document will fail if the retrieval layer stopped returning it. These graders catch the regression not by comparing to the old prompt, but by checking what the model did with what it received.
This is why trace-grounded graders detect non-prompt regressions where prompt comparisons do not: they evaluate the full call, including the surrounding code’s contribution to the inputs.
Braintrust’s CI/CD evaluation guide shows the same principle: run graders after every deploy, not only when prompts change.
Running Graders After Every Deploy, Not Just Prompt Changes
The practical implication is that graders need to run on production traces after every deploy, regardless of whether a prompt changed.
A working setup connects your trace source, runs the eval suite against spans from the latest deploy, and compares pass and fail counts to the prior deploy baseline. Any grader that flips from pass to fail after a non-prompt code change indicates a regression to investigate before more traffic runs through it.
The part that does not scale is manually maintaining which graders cover which call sites as the codebase changes. When the call chain is refactored, graders that referenced the old structure become stale. If nobody notices, the suite keeps reporting pass on a code path the grader is no longer correctly checking.
Tessary synthesizes graders from your codebase and re-runs synthesis as the code changes, so the eval suite tracks the current call structure rather than the one that existed at launch. When a call chain is refactored, re-synthesis updates the graders to reflect the new call site. Coverage stays current without a manual maintenance step.
No credit card needed to start. Start with your repo
Frequently asked questions
- What is an agent regression from a non-prompt code change?
- An agent regression from a non-prompt code change is a degradation in agent behavior caused by a code change outside the prompt layer. Common examples include library updates that change tool response schemas, call chain refactors that alter context assembly, and infrastructure changes that modify reasoning budgets or session state. The prompt stays identical, but the model receives different inputs or its outputs hit different downstream code.
- Why do my evals miss regressions caused by code changes?
- Prompt-only evals compare outputs to expected results given the same prompts. When a code change causes a regression, the prompt fed to the model is identical to what ran before. The eval sees the same prompt and a plausible-looking output, without knowing that the context assembly, tool response schema, or call chain changed upstream. No comparison to the old prompt will surface a change that happened elsewhere in the system.
- What types of code changes most commonly cause agent regressions?
- The most common sources are library updates that alter tool response schemas, database schema changes that affect what data the model ingests, call chain refactors that change context assembly, dependency updates that shift response formats, and infrastructure changes to reasoning budget defaults or session handling. Any change that modifies what the model receives or how its outputs are consumed can cause a regression without touching the prompt.
- How do grader-based evals detect non-prompt code change regressions?
- Graders evaluate specific failure modes at specific call sites against production traces. A grader checking whether an expected field is present in a tool response will fail if a library update changed the field name. A grader checking whether output references a retrieved document will fail if the retrieval layer stopped returning it. Because graders run against real spans, they capture what the full system delivered to the model at call time, not just what the prompt said.
- How often should I run my eval suite after code changes?
- The eval suite should run after every deploy, not only when prompts change. Connect your trace source to run graders against spans from each deploy and compare pass and fail counts to the prior baseline. Any grader that flips from pass to fail after a non-prompt code change indicates a regression worth investigating before more traffic runs through it.
- How is an agent eval suite different from a unit test for the same code change?
- Unit tests verify that code does what it is supposed to do in isolation. An agent eval suite verifies that the model behaves correctly given what the code delivers to it. A unit test might pass after a schema change because the code returns the new schema correctly, while an eval grader fails because the model produces incorrect output given a schema it was not calibrated for. Both are needed; they test different layers.
Written by
Akhil Varma · Founder, Tessary
Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.