Grade the session, not each turn
Per-turn metrics measure response quality but miss whether the agent completed the task. Tessary runs session-level graders against your production traffic, so task completion failures surface before users accumulate them.
Passing turns, failing tasks
In February 2026, Langfuse documented a specific failure: an agent completed every individual CLI tool call with acceptable quality scores but never finished the overall task. The failure was invisible in per-turn metrics and only appeared at the session level. That case is not unusual. The same pattern appears in customer support agents, coding assistants, and document-processing pipelines.
Per-turn
scores passed for every step in the failed session
Session-level
grading required to detect the task was never completed
Per-turn metrics answer the wrong question
Most eval setups check whether each reply is relevant, grounded, and well-formed. That question is answerable per turn but wrong for agents working across multiple turns: it measures output quality, not task completion.
- 0 of N
per-turn failures needed for a session to fail: an agent can score well on every individual turn while the overall task remains unfinished
LangChain evaluation documentation
- 3 failure modes
that per-turn scoring cannot detect: context loss across turns, tool call loops with no forward progress, and sessions that stop short of the goal state
- 1 question
session-level graders answer that per-turn metrics cannot: did the agent complete the task the session was opened for?
Per-turn vs session-level eval
Per-turn evaluation is not wrong. It is measuring the right thing for the wrong unit. Session-level graders answer the question that matters for agents that work across multiple turns.
Per-turn evaluation
Scores each response in isolation
- Grades relevance, groundedness, and coherence per response
- Cannot detect context loss between turns
- Misses tool call loops that never advance the task
- No signal when the session ends without reaching the goal
Session-level evaluation
Grades whether the agent completed the task
- Assesses the full sequence: turns, tool calls, and results as a unit
- Detects context loss that persists across turns
- Flags tool call loops with no forward progress toward the goal
- Catches sessions that end before reaching the intended outcome
Four steps to session-level coverage
Tessary ingests production traffic, builds the agent-native session model, and runs outcome graders against the same data structure as your turn-level signals.
Connect your source
Pull sessions from Langfuse or Braintrust, or push directly via OTLP. No re-instrumentation required if your agent is already sending traces to an existing source.
Build the session model
Tessary assembles the agent-native model across your production traffic: sessions, turns, tool calls, and observations as first-class nodes in the same graph as your verdicts.
Define outcome graders
Describe the task completion criterion in natural language. The signal engine translates it to a grader that runs against session records from your production traffic, not synthetic examples.
Receive session-level alerts
When graders detect task completion failures on live traffic, alerts route to Slack, Sentry, Linear, PagerDuty, or a webhook. Every failure links to the commit SHA that introduced it.
Assessing any single turn in isolation misses whether the agent actually solved the user's problem. Engineers reach for single-response metrics because they map to familiar LLM benchmarks, not because they fit agent workflows that span multiple turns.
Session-level signals on live traffic
Tessary's session graders run against production sessions, not synthetic examples. Every signal connects to the commit that changed session outcomes.
Task completion grading on production traffic
Catches failures that per-turn scoring misses, using the same sessions users are already generating. No synthetic test suite required.
Outcome-vs-output distinction
Graders check whether the session goal was reached, not whether each reply scored well individually. The distinction is what makes session-level evaluation meaningful.
Commit-SHA lineage
Every verdict links to the exact deploy that changed session outcomes. Any session failure maps directly to the commit that introduced it.
Langfuse and Braintrust connectors
Import production sessions from your existing source without changing your instrumentation. Connect as an upstream pull connector in minutes.
Agent-native data model
Sessions, turns, tool calls, and observations are first-class nodes. A verdict is a node in the same graph, so outcome graders run against the same structure as turn-level signals.
Regression detection integration
When a session failure reveals a behavioral gap, convert that session record into a grader that runs on every subsequent deploy. Session-level signals connect directly to regression detection.
Questions about session-level evals
Grade sessions, not just turns.
Wire your agent sessions into Tessary and run session-outcome graders against your production traffic. Connect at no cost. No billing setup required.