Usability Testing AI Generated Code Before You Open a PR

A feature built in two hours with Cursor compiles, lints, and renders in local dev. Tests pass. Nobody has tried to use it. The PR description is the next thing on the list, and the only validation between “it works” and “merge it” is whoever clicks through it during review. That gap is the whole subject of usability testing AI generated code.

The gap was always there. AI coding tools made it bigger by compressing the build step. Microsoft’s FY26 Q2 earnings put GitHub Copilot at 4.7 million paid subscribers as of January 2026. Bloomberg reported Cursor at 1 million daily active users by April 2025. The build half of the loop sped up. The validation half stayed where it was.

What AI coding tools actually skip

The model writes code that runs. It does not simulate someone using the code. A 2025 arXiv survey of bugs in AI-generated code finds 60% of faults are silent logic failures: the program executes, the output is wrong, and only an actual run with an actual user reveals it.

Accessibility shows the same pattern. A 2024 ACM Web for All paper found that 84% of AI-generated websites contained accessibility violations by default, including missing ARIA attributes and broken keyboard navigation. The defaults are not safe, and they ship with the feature unless someone checks.

Four things tend to be wrong in AI-built UI, and none of them are caught by tests:

Empty states. The model built against sample data, so the zero-data view is broken or confusing.
Task flow. The screens are coherent in isolation, but completing a real goal requires backtracking.
Copy. Button labels and error messages read like placeholders because they are placeholders.
Domain context. A control that makes sense to the model means something different to a procurement lead, a clinician, or a security admin.

What “tested” usually means before the PR opens

Most teams do not have a formal usability step between local dev and code review. The closest thing is informal: ping a designer in Slack, or wait for a reviewer to notice. UserTesting’s State of UX survey reports that 47% of researchers cite recruiting as the hardest phase of a study, which is why the formal version of this almost never runs on a single feature inside a sprint. The cost-benefit does not work below the quarterly research line.

So the AI-built feature ships through the same review channel as a hand-written feature, except now there are more of them, written faster, with less of the developer’s own usage feedback baked in. Reviewers catch what they catch. The rest reaches users.

A workflow for usability testing AI generated code before the PR

The point of this step is not statistical confidence. It is one structured pass against the flow before code review starts.

Deploy to a preview URL or staging. The persona needs an interactive page, not a screenshot. Vercel preview URLs and similar branch deploys are enough.
Configure a persona that matches your actual user. For a settings flow inside a developer tool: a senior backend engineer at a mid-sized SaaS company, manages integrations, has used similar tools, moderate patience for new interfaces. The closer the persona context is to the real user, the more useful the friction signal.
Write the task as user intent. “Connect a webhook integration and verify it is active” beats “test the settings page.” A task in user language exposes flow breaks. A task in feature language tends to confirm what the developer already knew.
Read the findings before writing the PR description. Where the persona hesitated, what label caused a re-read, what path it tried first and abandoned. Fix the obvious ones. Note the rest in the PR for the reviewer.

The pass takes minutes. The PR description ends up writing itself, because the persona run names the parts of the flow worth flagging.

What the findings actually catch

Three patterns show up most often in AI-built UI when a persona walks the flow:

Navigation friction. The persona looks for a control where it expects one and does not find it. The screens were built in the order the model generated them, which is not always the order a user would scan.

Empty-state breakage. Findings tied to a zero-data state are the clearest signal that the generated code was tested against sample data only. These tend to be quick fixes once surfaced and very expensive once a real user hits them.

Copy hesitation. Time spent on a label or button that should be self-explanatory. Placeholder copy reads fine to the developer who saw the model write it. It reads ambiguous to a user encountering it cold.

For a wider view of engineer-led usability work past the pre-PR step, the engineer usability testing guide covers the flow from staging through post-launch.

Where this fits and where it does not

A persona run gives directional evidence on a single flow in minutes. It does not stand in for a moderated session with a real user on a sensitive workflow, and it does not replace the parts of code review that are about correctness, security, or architecture. What it does is move the question of “is this navigable” from the reviewer’s gut feel to a structured artifact attached to the PR.

The honest version of the pitch is narrower than the marketing version. Persona findings agree with what a moderated session would surface most of the time on flow-level questions. They are weaker on lived-experience questions, on emotional response, on anything that requires the user’s own data to test. Use them for the iteration loop. Schedule the human session for the rest.

Try it on the next AI-built feature

Paste the preview URL, configure a persona that matches the user, and read the findings before you open the PR. Tessary is free to start, no credit card.

Start a usability test