Agent evals SDK

Graders that live in your agent, not a dashboard

Evals belong in the codebase, so they evolve with the product instead of in a separate tool that goes stale. The SDK captures traces from your running agent, generates eval cases from real traffic, and runs graders on every deploy or PR, across frameworks and model providers.

Captures
Traces from your agent automatically, in your own environment.
Generates
Eval cases from real production traffic, not hand-written fixtures.
Runs
On every deploy or PR, so changes are checked before they ship.
The Problem

Agentic workflows break in places single-output checks miss.

An agent is a chain of model calls across some framework, talking to some provider. Hand-written fixtures drift, and single-output assertions miss the failures that only appear between steps. Eval cases need to come from the real traffic the agent handles.

Multi-call

agentic workflows chain several model calls, and some failures only appear in the handoff between them, where single-output checks miss them

Hand-written

fixtures and one-off scripts drift from how the agent actually behaves and rarely cover the inputs that break it

Many stacks

agents are built on different frameworks and providers, so eval tooling that assumes one stack does not fit

How It Works

Instrument, generate, gate, and keep it current

The SDK turns real traffic into a working eval pipeline and wires it into the workflow you already use to ship.

  1. Instrument lightly

    Add the SDK to the agent. It captures traces from real traffic automatically, with minimal instrumentation, because a long integration is what kills adoption.

  2. Generate graders from traces

    Tessary uses the captured traffic to synthesize call sites, failure modes, and graders, including checks that span multiple calls in a chain rather than a single output.

  3. Gate every change

    Run the graders on every deploy or PR. A change that shifts behavior fails a grader before it reaches production, instead of after.

  4. Keep it current

    As the agent evolves, the SDK keeps capturing traffic and the pipeline re-synthesizes, so coverage tracks the agent rather than the version you instrumented on day one.

What You Get

Eval infrastructure that fits an agent codebase

Lightweight instrumentation

Enough to capture quality traces without a heavy integration project. Time-to-value is the constraint that matters.

Trace-driven eval cases

Eval cases are generated from real production traffic, so they exercise the inputs your agent actually receives.

Multi-call chain coverage

Graders can span a chain of calls, catching failures that only show up in the handoff between steps.

Deploy and PR gates

Run graders in your existing workflow so regressions are caught at the change, not in production.

Framework and provider agnostic

Built to work across agent frameworks and multiple model providers, not a single stack.

Transparent and portable

Every grader is readable, runs on your own keys, and stays with you if you leave.

FAQ

Questions about the agent evals SDK

It instruments your agent to capture production traces automatically, then feeds that traffic into Tessary so the eval pipeline is generated and calibrated from real usage. From there, graders can run on every deploy or PR.
As little as possible. Long SDK integrations kill adoption, so the design target is minimal instrumentation that still captures trace quality good enough to synthesize and calibrate graders.
Yes. Agentic workflows chain several model calls, and some failures only appear in the handoff between them. Graders can span a detected chain, not just a single call’s output.
The SDK is built to work across frameworks and multiple model providers rather than assuming one stack, so it fits agents built on different tooling.
Yes. The graders are meant to run in your existing workflow, including on every deploy or pull request, so a change that breaks behavior is caught before it ships.
Get Started

Put the evals where the agent lives.

Instrument your agent, generate graders from real traffic, and gate every change. Coverage that evolves with the codebase instead of decaying in a dashboard.

Start with your repoBring your own provider keys. The graders are yours to keep.