All articles
usability testingusability test resultsB2B SaaS

How to Analyze Usability Test Results for B2B SaaS Teams

By Akhil Varma ·

Short answer

Analyzing usability test results has three steps: rate each issue on Nielsen's 0-to-4 severity scale, distinguish patterns (two or more participants hitting the same issue) from outliers, and write findings with three fields per issue: what happened, severity, and a specific fix. That format is what engineering teams can act on without a research background.

Nielsen and Molich developed the severity rating scale for usability problems in 1994 after observing that evaluators reviewing the same session regularly disagreed on which findings warranted a fix. That disagreement is still the default outcome when teams analyze session results without a shared framework.

Knowing how to analyze usability test results is mostly a prioritization problem, not a data problem. The observations are usually there. The structure for moving from raw notes to a priority list is what tends to be missing.

Three steps handle most of the analysis work: rating each issue by severity, separating patterns from one-off observations, and writing a summary the engineering team can act on without a research background.

How to analyze usability test results: start with severity

Nielsen and Molich’s severity rating framework, documented at NNGroup, uses a 0 to 4 scale: 0 for observations that are not real problems, 1 for cosmetic issues, 2 for minor friction, 3 for significant friction that affects task completion regularly, and 4 for blockers that prevent task completion.

The scale is useful because it forces a decision per observation rather than a feeling. “Participant seemed confused on the confirmation screen” is not a severity rating. “Participant could not complete checkout without assistance: severity 4” is. Going through the full observation list and assigning a number to each one converts a notes document into a priority queue.

For B2B products specifically, adding a business relevance column helps. An issue on the user permissions screen that affects admin setup ranks higher than the same severity issue on a rarely visited billing page. Frequency, impact, and product importance all belong in the rating.

How to tell a pattern from a one-off finding

Two of three participants hitting the same issue qualifies as a pattern; a single participant is an outlier. Counting before putting something on the priority list is the discipline that converts a debrief discussion into a ranked queue. A single participant who missed a navigation label is worth noting. Two out of three who missed the same label is worth fixing.

The threshold scales with session size. With three participants, two out of three qualifies. With five, three out of five. The exact number matters less than applying it consistently. One-off observations belong in an appendix.

One exception is worth noting for B2B products: a single participant can surface a critical issue if they closely match your primary user type. A finding from a participant who does not represent your core user stays in the appendix. A finding from someone who is a close match to your ICP warrants a second look before it gets filed as an outlier.

This is why session design matters as much as session count. Recruiting and running sessions when your users are hard to find is a different problem than analysis. But a poorly matched participant makes analysis harder because you cannot apply the pattern threshold cleanly.

How to write a findings summary engineers can use without reading the full report

The format that moves fastest into a sprint has three fields per finding: what happened (a single observable fact, not an interpretation), severity and frequency (how bad, how often), and a recommended fix (specific enough to act on).

“Navigation unclear” does not meet that bar. “Three of four participants clicked Settings looking for the billing page, which is currently under Account. Recommended: add a redirect or move billing to Settings. Severity 3” does.

The summary does not need to explain research methodology. Engineers do not need to know how many sessions ran or what the task script looked like. They need to know what is broken, how broken it is, and what to do about it. That is a short document with a severity column.

One thing that shortens this step considerably: starting from structured output rather than raw session notes. Tessary returns findings in that format automatically. Each session produces severity-rated findings with screenshots at each hesitation point and step-by-step reasoning traces that show what the persona saw and how it interpreted the interface. The write-up is mostly done before the debrief starts.

Tessary returns findings ready for sprint planning. No credit card required. Run your first session at Tessary.

Frequently asked questions

How do you prioritize usability test findings?
Use a severity scale (0 to 4, from Nielsen's framework) to rate each issue by how much it blocks task completion, then weight by frequency (how many participants hit the same problem). Issues that score high on both go to the top of the sprint backlog. For B2B products, add a business relevance column: a blocker on a core flow ranks above the same-severity issue on a rarely visited screen.
What is a severity rating in usability testing?
A severity rating measures how serious a usability problem is. Nielsen's 0-to-4 scale is the standard: 0 means the finding is not a real issue, 4 means the issue prevents task completion and must be fixed before launch. Severity is a combination of frequency (how often users hit the issue), impact (how hard it is to work around), and persistence (whether it recurs across sessions).
How many participants do you need to identify a usability pattern?
With three participants, two out of three hitting the same issue is enough to call it a pattern. With five participants, three out of five. With fewer than three, findings are directional but not confirmable. One-off observations often reflect individual context rather than a real flow problem and belong in an appendix, not the priority list. The exception: a solo finding from a participant who closely matches your ICP warrants a second look.
What should a usability findings summary include?
Three fields per finding: what happened (a specific observable fact, not an interpretation), severity and frequency (from your rating scale), and a recommended fix specific enough for an engineer to act on. 'Navigation unclear' is not a finding. 'Three of four participants clicked Settings looking for billing, which is under Account. Recommended: add a redirect or move billing to Settings. Severity 3' is.
How do you explain usability findings to stakeholders who did not watch the sessions?
Skip the methodology. Stakeholders who did not watch the sessions do not need to know the participant count or the task structure. They need to know what is broken, how often, and what it costs to leave unfixed. A one-page summary with three to five critical findings and a recommended fix for each is more useful than a 40-slide deck. Link each finding to sprint-visible work.
How does Tessary structure usability test results?
Tessary returns findings as a prioritized list: each issue has a severity rating, a screenshot showing exactly where the persona hesitated or failed, and a reasoning trace explaining what the persona saw and how it interpreted the interface. The structure is designed so a team can move from session output to sprint ticket without an intermediate synthesis step.

Written by

· Founder, Tessary

Akhil builds Tessary — AI personas that run real-browser usability tests on B2B SaaS products. Previously shipped product at multiple early-stage startups; writes about usability testing, AI personas, and the economics of B2B research.