All articles
agent evalsCI pipelineLLM testingagent reliability

Agent Evals CI Pipeline: What to Run on Every PR

By Akhil Varma ·

Short answer

Wire agent evals into CI/CD using a two-tier structure: run a fast 10-20 case subset on every PR with a hard 5-minute target and 8-minute ceiling; defer the full suite to nightly batches. Set absolute score floors before enabling the gate, and scope graders only to the routes a PR actually touches.

A November 2025 industry survey of 1,340 AI engineering teams found that 89% already instrument observability for their production agents, but only 52.4% run any kind of automated evaluations. The gap between those two numbers is the agent evals CI pipeline: teams that have traces but no quality gate on the deploy path.

How to Structure an Agent Evals CI Pipeline

The core structure is two-tier. Every PR triggers a fast subset; the full eval suite runs on a schedule.

For the PR gate, run a golden subset of 10-20 cases per affected route. Compose it as roughly 60% happy path, 20% edge cases, and 20% split between historical failures and refusal checks. Run only the routes the PR actually touches: a change to the retrieval layer does not need to re-run graders for the summarization flow. DigitalApplied’s 2026 eval methodology guide describes the standard pattern: the agent executes against a fixed dataset, the grader scores outputs, and the build passes if the aggregate meets the threshold.

Target under 5 minutes to verdict on the PR gate, with a hard 8-minute timeout. Use classifier-based or deterministic graders here rather than frontier-model judges. An LLM-as-judge call adds latency and cost that a classifier cascade does not.

The full suite belongs in a nightly batch: 500-2,000 examples per route, LLM-judge scoring across all rubrics, comparison against a versioned dataset. Run it after each deploy window, not only after prompt changes. Code refactors and infrastructure updates cause regressions that a prompt-only trigger misses.

How to Set Pass/Fail Thresholds Without Blocking Every Deploy

Start in reporting mode: collect automated scores alongside human judgments for several weeks, tune thresholds to match the human signal, then flip the gate to blocking. Hardcoding thresholds before calibrating the judges produces false alarms and trains the team to ignore them.

Common absolute floor values in practice: groundedness around 0.85, context relevance at 0.80, completeness at 0.75. DigitalApplied’s methodology puts 0.85 as “common, tunable by risk tolerance” for aggregate scores.

A statistical regression gate layers on top of the absolute floors. Rather than asking “did this grader drop below 0.85?”, it asks “did this PR cause a statistically significant drop from the last baseline?”

The mechanism: Welch’s t-test on per-example score arrays for continuous rubrics, a two-proportion z-test for binary rubrics, fail when p < 0.05 with an effect size above the rubric’s noise floor. This structure catches the small-but-consistent regressions that absolute thresholds alone miss.

One calibration check before gating on any LLM judge: compute Cohen’s kappa between the judge’s scores and domain-expert labels on a held-out set. DigitalApplied’s guide specifies that a kappa below 0.41 means the judge’s agreement with humans is weak enough that CI verdicts from it are not reliable. Fix the judge before enabling the gate.

Keeping Eval Run Time and Cost Inside the CI Budget

Route scoping is the most effective lever for keeping run time down. A PR that touches one tool call out of five flows does not need to evaluate all five. Map the PR diff to its affected call sites before dispatching the eval run. The same principle governs cost: run only the graders relevant to the touched routes, and use classifier-based checks for the PR gate rather than frontier-model judges.

A useful ceiling from a 2026 agent observability guide: keep judge costs below 10-15% of production LLM costs. If eval costs approach 25%, the evaluation architecture needs redesigning rather than just cutting example counts.

Connecting a CI Gate to Your Trace Source

Connect your Langfuse or Braintrust trace source; the regression detection engine scores spans from each new deploy, so graders stay calibrated to what users are currently sending.

The eval suite needs inputs from current traffic, not a fixed dataset written before the last three deploys. Each span in the connected trace source reflects what the agent is doing today, so graders run against the version users are actually hitting.

The connected gate reads the PR diff, maps the touched call sites to the failure signal families they most affect, and posts a pre-deploy report on the PR before the merge. A GitHub Action can gate the merge or run in report-only mode. Commit-SHA lineage ties each verdict to the exact change that caused it.

No credit card required to connect your trace source. Add the CI gate to your repo

Frequently asked questions

What should I run in my agent evals CI pipeline on every PR?
Run a fast subset of 10-20 core cases per affected route, covering happy paths, edge cases, and historical failures. Use classifier-based or deterministic graders on the PR gate; reserve expensive LLM-as-judge scoring for nightly batches. Target under 5 minutes to verdict, with a hard 8-minute timeout.
How do I set pass/fail thresholds for agent evals in CI without blocking every deploy?
Start in reporting mode: collect automated scores alongside human judgments before making the gate blocking. Common starting floors are 0.85 for groundedness and 0.80 for context relevance. Then layer a statistical regression gate on top, failing only when a grader drops with statistical significance above the rubric's noise floor.
How do I keep agent eval run time under the CI ceiling?
Scope the eval run to routes the PR actually touches, not the full agent. Keep the PR gate to 10-20 examples per affected route, and use cheap classifier-based checks rather than frontier-model judges. Defer the full suite of 500-2,000 examples per route to a nightly batch.
What is the cost of running agent evals in CI per PR?
A PR gate covering 100-200 examples with classifier-based graders costs a fraction of a dollar per run. A full cascade across many rubrics runs a few dollars. Frontier LLM judge calls add roughly $0.015 each; treat them as triggered escalation rather than a default step in every PR gate.
Why don't most engineering teams have an agent evals CI gate yet?
Most teams have observability but not automated evaluations. A November 2025 industry survey of 1,340 AI engineering teams found 89% instrument observability for their production agents, but only 52.4% run any offline evals. The gap is that traces do not automatically become quality gates.

Written by

· Founder, Tessary

Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.