From Tessary
Notes on evaluating agents.
Regression detection, failure taxonomies, and keeping LLM evals current as your agent evolves.
Agent Regressions from Non-Prompt Code Changes
Agent regressions from non-prompt code changes are the class prompt-only evals miss. Here is how grader-based evals detect them before users notice.
Agent Evals Maintenance Regression: Why Static Suites Drift
Agent evals maintenance regression: why static eval suites go stale after every code change, and how to keep coverage current as an agent evolves.