Why do golden datasets go stale for LLM evaluation?

Golden datasets go stale because they capture the product as it was when the dataset was built. As prompts change, models get updated, and code evolves, the dataset no longer reflects real production behavior. Graders running against a stale dataset can pass while the live product has already drifted.

What do production traces catch that golden datasets miss?

Production traces surface edge cases that were not anticipated when the golden dataset was built. Real traffic includes unusual user inputs, provider drift, and behavior changes from code updates that pre-collected datasets cannot contain. Graders running on traces also catch regressions from non-prompt changes because they operate on spans the production system actually generated.

How does trace-grounded evaluation work in practice?

A trace-grounded eval pipeline samples production spans from your trace source, runs graders against each span, and records a pass or fail verdict per grader. Regressions appear as a grader that was passing starting to fail after a deploy. Because the traces come from live traffic, the test surface reflects what the agent is actually doing today, and the failing grader identifies which quality dimension broke so it can be traced back to the change that caused it.

Can I use both golden datasets and production traces for LLM evals?

Yes. Golden datasets are useful for calibration and for testing specific regression cases you want to ensure never reappear. Production traces are better for ongoing coverage. Teams that use both typically maintain a small golden set for known regression checks and run graders continuously against sampled production spans for coverage of real traffic.

What is dataset drift in LLM evaluation?

Dataset drift is the growing gap between what a golden dataset contains and what the production system actually does. As the product evolves, the inputs, outputs, and failure modes in the dataset no longer represent real traffic. Graders calibrated to the old dataset continue to pass while the production system behaves differently.

Production Traces vs Golden Dataset LLM Evals

Q: What is the difference between production traces and golden datasets for LLM evals?

A golden dataset is a curated, static collection of inputs and expected outputs used to test model behavior. Production traces are spans captured from live traffic. Golden datasets go stale as the product changes; production traces are always current because they come from the system as it runs today.

Anthropic’s engineering postmortem describes a routing bug that ran from early August into September 2025, misrouting Sonnet 4 requests: 0.8% of them on August 5 and, at the worst-impacted hour on August 31, 16%. “The evaluations we ran simply didn’t capture the degradation users were reporting,” the report states. One structural cause: privacy controls blocked engineers from examining production interactions, so evaluations ran without access to real traffic.

The real difference between production traces and golden-dataset LLM evals is when the evaluation data was collected. A golden dataset captures the product at the time the dataset was built. Traces capture it as it runs today.

Why Golden Datasets Lose Coverage Over Time

A golden dataset captures the agent as it was when the dataset was curated, not as it operates after months of prompt edits, model version bumps, and code changes. Graders calibrated against that snapshot continue reporting pass against inputs that no longer represent the live product.

Hamel Husain’s eval engineering FAQ names the outcome: “Teams that skip the practice of regularly sampling production queries and adding interesting ones to their dataset will wake up three months later with a stale eval suite.” The issue is not that the dataset becomes wrong about a single item but that coverage drifts away from what the agent actually does.

Research on benchmark staleness puts numbers on the mechanism. A study published in October 2025 (arXiv:2510.07238) found that between 24% and 64% of time-sensitive evaluation samples across major benchmarks were factually outdated. In that setting, graders penalize models for giving the currently correct answer because the dataset was built when a different answer was correct.

The gap compounds as time passes. A team running weekly evals on a frozen golden dataset can build genuine confidence in a score that stopped tracking production quality months earlier.

What Production Traces Surface That Golden Datasets Don’t

Production traces surface edge cases that did not exist when the golden dataset was built. Real traffic includes inputs users actually send, behavioral shifts from model version updates, and regressions from code changes outside the prompt layer. No pre-collected dataset can contain failure modes that arrive after the dataset was frozen.

One observability writeup, Digital Applied’s agent observability guide, describes what happens without production trace integration: “your eval dataset slowly drifts away from what users actually do, and your green CI stops meaning anything.” It documents a case in which a system showed essentially perfect metrics for three months while the judge’s actual agreement with domain-expert review sat at a Cohen’s kappa of 0.31, well below the 0.6 threshold for usable evaluation. The team was seeing green while their graders had stopped tracking quality that domain experts could verify.

Imperfect graders are unavoidable. What keeps detection reliable is reading the trend across grader verdicts over time rather than trusting any single score: a grader that captures intent imperfectly misfires at a roughly stable rate, so a sustained move in its trend tracks a real change in the product.

The structural reason traces catch more is that graders running on production spans see what the full system delivered to the model at call time. A grader checking for a specific field in a tool response fails if a library update changed that field name, even when no prompt changed. Traces from the deploy that introduced that library update contain inputs reflecting the new schema. A frozen dataset built before the update contains no entry for this failure mode.

How Production Trace Graders Work Compared to Golden Datasets

A trace-grounded eval pipeline samples production spans from a connected trace source, runs graders against each span, and records a pass or fail verdict per grader. A regression surfaces when a grader that was passing starts failing after a deploy. That flip names which dimension of quality broke, and it is where most tools stop. The next step is tracing that dimension back through the agent’s chain to the change that caused it: a prompt edit, a dependency bump, a tool returning different data, or a provider model update that never appeared in your repo.

This is the structural difference from evaluating against a golden dataset. Production traces carry a built-in freshness guarantee: each run pulls spans from the system as it exists after the latest deploy. A golden dataset requires a deliberate update cycle to maintain current coverage, and that cycle typically falls behind the rate of product change.

A golden dataset you already maintain is not wasted in this model. It gets folded in as calibration and as known-regression checks, part of the stored record of what the agent is supposed to do, while sampled traces carry the ongoing coverage. That record lives outside the agent’s code and sharpens as bugs are fixed and edge cases are added.

Tessary synthesizes an eval pipeline from your codebase and runs graders against production traces from Braintrust or Langfuse, which is how teams start evals without a labeled dataset. And when a grader starts failing, it tells you which change caused it. Braintrust and Langfuse hold the traces and track the scores; Tessary answers why a score moved.

Point Tessary at your Braintrust or Langfuse trace source and the first graders run on real production spans before you pay anything.

Find what broke your agent

Production Traces vs Golden Dataset LLM Evals

Why Golden Datasets Lose Coverage Over Time

What Production Traces Surface That Golden Datasets Don’t

How Production Trace Graders Work Compared to Golden Datasets

Frequently asked questions