Agent Evals

Grade the session, not each turn

Per-turn metrics measure response quality but miss whether the agent completed the task. Tessary runs session-level graders against your production traffic, so task completion failures surface before users accumulate them.

Grade your first agent session See the evals SDK

The signal

Passing turns, failing tasks

In February 2026, Langfuse documented a specific failure: an agent completed every individual CLI tool call with acceptable quality scores but never finished the overall task. The failure was invisible in per-turn metrics and only appeared at the session level. That case is not unusual. The same pattern appears in customer support agents, coding assistants, and document-processing pipelines.

Per-turn

scores passed for every step in the failed session

Session-level

grading required to detect the task was never completed

The Problem

Per-turn metrics answer the wrong question

Most eval setups check whether each reply is relevant, grounded, and well-formed. That question is answerable per turn but wrong for agents working across multiple turns: it measures output quality, not task completion.

0 of N

per-turn failures needed for a session to fail: an agent can score well on every individual turn while the overall task remains unfinished

LangChain evaluation documentation

3 failure modes

that per-turn scoring cannot detect: context loss across turns, tool call loops with no forward progress, and sessions that stop short of the goal state

1 question

session-level graders answer that per-turn metrics cannot: did the agent complete the task the session was opened for?

Two approaches

Per-turn vs session-level eval

Per-turn evaluation is not wrong. It is measuring the right thing for the wrong unit. Session-level graders answer the question that matters for agents that work across multiple turns.

Per-turn evaluation

Scores each response in isolation

Grades relevance, groundedness, and coherence per response
Cannot detect context loss between turns
Misses tool call loops that never advance the task
No signal when the session ends without reaching the goal

Tessary

Session-level evaluation

Grades whether the agent completed the task

Assesses the full sequence: turns, tool calls, and results as a unit
Detects context loss that persists across turns
Flags tool call loops with no forward progress toward the goal
Catches sessions that end before reaching the intended outcome

How It Works

Four steps to session-level coverage

Tessary ingests production traffic, builds the agent-native session model, and runs outcome graders against the same data structure as your turn-level signals.

Connect your source
Pull sessions from Langfuse or Braintrust, or push directly via OTLP. No re-instrumentation required if your agent is already sending traces to an existing source.
Build the session model
Tessary assembles the agent-native model across your production traffic: sessions, turns, tool calls, and observations as first-class nodes in the same graph as your verdicts.
Define outcome graders
Describe the task completion criterion in natural language. The signal engine translates it to a grader that runs against session records from your production traffic, not synthetic examples.
Receive session-level alerts
When graders detect task completion failures on live traffic, alerts route to Slack, Sentry, Linear, PagerDuty, or a webhook. Every failure links to the commit SHA that introduced it.

Why this gap exists

Assessing any single turn in isolation misses whether the agent actually solved the user's problem. Engineers reach for single-response metrics because they map to familiar LLM benchmarks, not because they fit agent workflows that span multiple turns.

LangChain evaluation documentationLLM Evals resource

What you get

Session-level signals on live traffic

Tessary's session graders run against production sessions, not synthetic examples. Every signal connects to the commit that changed session outcomes.

Task completion grading on production traffic

Catches failures that per-turn scoring misses, using the same sessions users are already generating. No synthetic test suite required.

Outcome-vs-output distinction

Graders check whether the session goal was reached, not whether each reply scored well individually. The distinction is what makes session-level evaluation meaningful.

Commit-SHA lineage

Every verdict links to the exact deploy that changed session outcomes. Any session failure maps directly to the commit that introduced it.

Langfuse and Braintrust connectors

Import production sessions from your existing source without changing your instrumentation. Connect as an upstream pull connector in minutes.

Agent-native data model

Sessions, turns, tool calls, and observations are first-class nodes. A verdict is a node in the same graph, so outcome graders run against the same structure as turn-level signals.

Regression detection integration

When a session failure reveals a behavioral gap, convert that session record into a grader that runs on every subsequent deploy. Session-level signals connect directly to regression detection.

FAQ

Questions about session-level evals

Multi-turn agent evaluation measures whether an LLM agent completed its task across an entire conversation, not just whether individual responses were high quality. It requires session-level graders that assess the full sequence of turns, tool calls, and results as a unit, rather than scoring each response in isolation.

Per-turn evaluation scores each response individually for relevance, groundedness, and coherence. Session-level evaluation scores whether the agent achieved the goal the session was opened for. An agent can score well on every turn while still failing to complete the task, which only session-level grading detects.

An outcome grader checks whether the session goal was reached, not whether each reply sounded appropriate. It inspects the full session record for indicators like task completion state, whether the agent closed the goal loop, or whether it exited without reaching the intended result. This differs from output graders, which rate individual response quality.

Per-turn metrics evaluate each response in isolation. A multi-turn agent can produce coherent, well-scored replies at every step while losing context across turns, looping on a failed tool call, or simply stopping short of completing the task. Those failure modes are only visible when you evaluate the full session sequence as a unit.

Connect your existing Langfuse or Braintrust source as an upstream pull connector. Production sessions land in the agent-native model as first-class Session, Turn, Trace, and Observation nodes. Session-level graders run against those nodes directly, so you do not need to re-instrument your agent or change how you collect traces.

Get started

Grade sessions, not just turns.

Wire your agent sessions into Tessary and run session-outcome graders against your production traffic. Connect at no cost. No billing setup required.

Grade your first agent sessionNo re-instrumentation required if your agent is already on Langfuse or Braintrust.