Usability Testing for AI Features in B2B SaaS

The standard usability test script assumes the same input produces the same output. AI features do not work that way. Two users asking a copilot the same question can get different answers, the result shifts with prior context, and the interface rarely signals what the model knows. Usability testing for AI features needs different tasks, different personas, and different signals from the script most B2B SaaS teams already use.

The Maze Future of User Research 2026 report says 69% of product and research teams now use AI in at least some of their research workflows. The static-flow protocol most of those teams kept on the shelf does not surface the failure modes that show up once an AI feature is in front of a real user.

Why the standard script breaks

A test script written for an onboarding form expects deterministic outputs. The participant clicks Save, the same modal opens, the script knows what to ask next. AI features remove that anchor. The output varies, the surface area is conversational, and there is no fixed end state to evaluate against. A script that worked for a settings page returns thin notes and confident sounding screenshots that do not say much.

The User Interviews 2025 Research Budget Report puts the practical constraint in numbers: 29% of research teams operate on under $25,000 a year. A two-week recruit and a $200 incentive per participant is most of that budget for one study. Most teams skip the AI feature test for the iteration and ship on assumption, which is why activation drop-off and support tickets are doing the work the test was supposed to do.

Three failure modes worth scripting separately

Copilots, smart search, and summarization fail differently. Running one task list across all three misses the friction that drives users to abandon the feature.

Copilots: capability boundaries and trust

The most common copilot failure is not knowing what to ask. The user invokes the assistant for something it cannot do, gets a confident response that misses the goal, and stops trying. A second pattern: the response is useful, but the user verifies it manually anyway, which makes the copilot a slower path than the manual one.

What to look for: does the persona find the invocation surface without prompting? After the response lands, does the persona act on it or open a second tab to check it? Where is the point in the session where the persona stops asking the copilot and goes back to the menu?

Smart search: relevance and mental model

Smart search fails when the ranking does not match how the user thinks about the query. Semantic ranking puts a result on top that the user reads as wrong even when it is correct. The user rewrites the query a few times, then falls back to manual filters.

What to look for: does the persona understand why the top result is the top result? Do they reformulate when results look wrong, or do they abandon search and navigate by tree?

Summarization: verification behavior

Summarization fails quietly. The user reads the summary, treats it as the source, and acts on it. If the summary drops a clause or paraphrases inaccurately, the user does not catch it unless the interface gives them a reason to look.

What to look for: does the persona open the underlying document after reading the summary? What cue, if any, prompts that check?

How to configure personas for AI feature testing

For AI features, the failure mode tracks the user’s prior knowledge as much as their goal. A generic persona returns generic friction. The persona context that matters here is not just role.

For a copilot, configure the persona’s familiarity with what the AI can do. A new user invokes a copilot differently than someone who has used it for a month. Both surface real failures, but different ones.

For smart search, configure the query habit. A user who defaults to exact-phrase queries reads semantic ranking as broken. A user who writes natural language queries does not.

For summarization, configure skepticism level. A persona who trusts AI outputs by default behaves differently than a persona who is cautious about automated content.

For the underlying persona work, see How to Write Usability Test Personas for B2B SaaS Products.

Two signals that do not appear in static-flow tests

Trust calibration. Is the persona placing the right amount of trust in the AI output? Over-trust looks like acting on a wrong answer without checking. Under-trust looks like ignoring a correct answer because it reads as suspicious. Both are usability failures, and both can show up in the same session.

Error recovery. When the AI returns a poor response, what does the persona do next? Retry with a different input? Switch to a manual path? Abandon the task? The recovery move is the one that decides whether the feature has a second chance with this user.

Static-flow protocols do not produce variable outputs, so these signals never come up. A moderated session with a real participant could surface them, but not on a sprint cadence.

Usability testing for AI features without a recruiting cycle

Tessary runs domain-aware AI personas on a Figma prototype or a live URL in a real browser. You configure the persona’s AI familiarity, query habit, and skepticism, and the persona walks the feature and returns structured findings: where it hesitated, where it verified, where it gave up. Findings come back in minutes, which is the part that lets the test run at the start of the sprint instead of after the support tickets.

Try Tessary on your AI feature