All articles
llm evalsproduction tracesgolden datasetregression detectioneval pipeline

Production Traces vs Golden Dataset LLM Evals

By Akhil Varma ·

Short answer

Running LLM graders against a golden dataset means grading the product as it was when the dataset was built. Production traces eliminate dataset drift because the evaluation data is always current: every span is from real traffic, edge cases arrive automatically, and coverage evolves as the product does.

Anthropic’s August 2025 engineering postmortem found that 16% of Sonnet 4 requests were misrouted for over six weeks. “The evaluations we ran simply didn’t capture the degradation users were reporting,” the report states. One structural cause: privacy controls blocked engineers from examining production interactions, so evaluations ran without access to real traffic.

Production traces vs golden dataset LLM evals: the difference is when the evaluation data was collected. A golden dataset captures the product at the time the dataset was built. Traces capture it as it runs today.

Why Golden Datasets Lose Coverage Over Time

A golden dataset captures the agent as it was when the dataset was curated, not as it operates after months of prompt edits, model version bumps, and code changes. Graders calibrated against that snapshot continue reporting pass against inputs that no longer represent the live product.

Hamel Husain’s eval engineering FAQ names the outcome: “Teams that skip the practice of regularly sampling production queries and adding interesting ones to their dataset will wake up three months later with a stale eval suite.” The issue is not that the dataset becomes wrong about a single item but that coverage drifts away from what the agent actually does.

Research on benchmark staleness puts numbers on the mechanism. A study published in October 2025 (arXiv:2510.07238) found that between 24% and 64% of time-sensitive evaluation samples across major benchmarks were factually outdated. In that setting, graders penalize models for giving the currently correct answer because the dataset was built when a different answer was correct.

The gap compounds as time passes. A team running weekly evals on a frozen golden dataset can build genuine confidence in a score that stopped tracking production quality months earlier.

What Production Traces Surface That Golden Datasets Don’t

Production traces surface edge cases that did not exist when the golden dataset was built. Real traffic includes inputs users actually send, behavioral shifts from model version updates, and regressions from code changes outside the prompt layer. No pre-collected dataset can contain failure modes that arrive after the dataset was frozen.

Digital Applied’s agent observability guide for 2026 describes what happens without production trace integration: “your eval dataset slowly drifts away from what users actually do, and your green CI stops meaning anything.” They document a real case in which a system showed essentially perfect metrics for three months while the judge’s actual agreement with domain-expert review sat at a Cohen’s kappa of 0.31, well below the 0.6 threshold for usable evaluation. The team was seeing green while their graders had stopped tracking quality that domain experts could verify.

The structural reason traces catch more is that graders running on production spans see what the full system delivered to the model at call time. A grader checking for a specific field in a tool response fails if a library update changed that field name, even when no prompt changed. Traces from the deploy that introduced that library update contain inputs reflecting the new schema. A frozen dataset built before the update contains no entry for this failure mode.

How Production Trace Graders Work Compared to Golden Datasets

A trace-grounded eval pipeline samples production spans from a connected trace source, runs graders against each span, and records a pass or fail verdict per grader. A regression surfaces when a grader that was passing starts failing after a deploy. The test surface reflects what the agent is doing after each change because the spans come from that change’s production traffic.

This is the structural difference from evaluating against a golden dataset. Production traces carry a built-in freshness guarantee: each run pulls spans from the system as it exists after the latest deploy. A golden dataset requires a deliberate update cycle to maintain current coverage, and that cycle typically falls behind the rate of product change.

Tessary synthesizes an eval pipeline from your codebase and runs graders against production traces from Braintrust or Langfuse. Regression detection runs against real traffic spans, so the test surface reflects the product as it runs after each change, not as it was when the eval suite was written.

Point Tessary at your Braintrust or Langfuse trace source and the first graders run on real production spans before you pay anything.

Run graders on your traces

Frequently asked questions

What is the difference between production traces and golden datasets for LLM evals?
A golden dataset is a curated, static collection of inputs and expected outputs used to test model behavior. Production traces are spans captured from live traffic. Golden datasets go stale as the product changes; production traces are always current because they come from the system as it runs today.
Why do golden datasets go stale for LLM evaluation?
Golden datasets go stale because they capture the product as it was when the dataset was built. As prompts change, models get updated, and code evolves, the dataset no longer reflects real production behavior. Graders running against a stale dataset can pass while the live product has already drifted.
What do production traces catch that golden datasets miss?
Production traces surface edge cases that were not anticipated when the golden dataset was built. Real traffic includes unusual user inputs, provider drift, and behavior changes from code updates that pre-collected datasets cannot contain. Graders running on traces also catch regressions from non-prompt changes because they operate on spans the production system actually generated.
How does trace-grounded evaluation work in practice?
A trace-grounded eval pipeline samples production spans from your trace source, runs graders against each span, and records a pass or fail verdict per grader. Regressions appear as a grader that was passing starting to fail after a deploy. Because the traces come from live traffic, the test surface reflects what the agent is actually doing today.
Can I use both golden datasets and production traces for LLM evals?
Yes. Golden datasets are useful for calibration and for testing specific regression cases you want to ensure never reappear. Production traces are better for ongoing coverage. Teams that use both typically maintain a small golden set for known regression checks and run graders continuously against sampled production spans for coverage of real traffic.
What is dataset drift in LLM evaluation?
Dataset drift is the growing gap between what a golden dataset contains and what the production system actually does. As the product evolves, the inputs, outputs, and failure modes in the dataset no longer represent real traffic. Graders calibrated to the old dataset continue to pass while the production system behaves differently.

Written by

· Founder, Tessary

Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.