Graders that live in your agent, not a dashboard
Evals belong in the codebase, so they evolve with the product instead of in a separate tool that goes stale. The SDK captures traces from your running agent, generates eval cases from real traffic, and runs graders on every deploy or PR, across frameworks and model providers.
- Captures
- Traces from your agent automatically, in your own environment.
- Generates
- Eval cases from real production traffic, not hand-written fixtures.
- Runs
- On every deploy or PR, so changes are checked before they ship.
Agentic workflows break in places single-output checks miss.
An agent is a chain of model calls across some framework, talking to some provider. Hand-written fixtures drift, and single-output assertions miss the failures that only appear between steps. Eval cases need to come from the real traffic the agent handles.
- Multi-call
agentic workflows chain several model calls, and some failures only appear in the handoff between them, where single-output checks miss them
- Hand-written
fixtures and one-off scripts drift from how the agent actually behaves and rarely cover the inputs that break it
- Many stacks
agents are built on different frameworks and providers, so eval tooling that assumes one stack does not fit
Instrument, generate, gate, and keep it current
The SDK turns real traffic into a working eval pipeline and wires it into the workflow you already use to ship.
Instrument lightly
Add the SDK to the agent. It captures traces from real traffic automatically, with minimal instrumentation, because a long integration is what kills adoption.
Generate graders from traces
Tessary uses the captured traffic to synthesize call sites, failure modes, and graders, including checks that span multiple calls in a chain rather than a single output.
Gate every change
Run the graders on every deploy or PR. A change that shifts behavior fails a grader before it reaches production, instead of after.
Keep it current
As the agent evolves, the SDK keeps capturing traffic and the pipeline re-synthesizes, so coverage tracks the agent rather than the version you instrumented on day one.
Eval infrastructure that fits an agent codebase
Lightweight instrumentation
Enough to capture quality traces without a heavy integration project. Time-to-value is the constraint that matters.
Trace-driven eval cases
Eval cases are generated from real production traffic, so they exercise the inputs your agent actually receives.
Multi-call chain coverage
Graders can span a chain of calls, catching failures that only show up in the handoff between steps.
Deploy and PR gates
Run graders in your existing workflow so regressions are caught at the change, not in production.
Framework and provider agnostic
Built to work across agent frameworks and multiple model providers, not a single stack.
Transparent and portable
Every grader is readable, runs on your own keys, and stays with you if you leave.
Questions about the agent evals SDK
Related pages
Evals for agentic products
The overview: synthesized eval pipelines that evolve with your agent.
Regression detection
Catch breakage from any change, judged against a baseline.
LLM evals that evolve
Why auto-evolving graders beat static golden datasets.
Evals for the CTO
For the person who owns agent quality and gets paged when it breaks.
Put the evals where the agent lives.
Instrument your agent, generate graders from real traffic, and gate every change. Coverage that evolves with the codebase instead of decaying in a dashboard.