The Structural Failure of "Build First, Test Later"
In web service and application development, many projects treat user testing as something done after completion — letting users interact with the finished product to check for problems once the UI is polished and features are working. This sequence feels intuitive, but from a cost structure perspective, it is the least efficient approach possible.
In one contracted development project, a beta version built over five months was shown to five real users for the first time. Within ten minutes, a fundamental problem in the navigation hierarchy became clear: users could not understand where they were in the system and could not reach the pages they needed. This required a full information architecture redesign, adding three weeks of unplanned development.
In another case, a checkout flow for an e-commerce site was tested after implementation, revealing that users consistently became confused by the order of fields in the input form. The sequence of "billing address" and "shipping address" was reversed relative to users' natural thought flow. This was a problem that could have been discovered in five minutes of wireframe testing, but fixing it after implementation cost half a day of engineering time.
The core of this structure is the principle that the cost of correcting a design decision increases exponentially with each later phase. The cost of changing a paper sketch and the cost of changing a live system differ by orders of magnitude. The value of integrating user testing early in the process is not to "discover problems" — it is to break design hypotheses at the point when doing so costs the least.
Designing a User Test: What to Ask, Whom to Ask, and How
The quality of a user test is not determined by the number of participants — it is determined by the precision of the test plan. Research from Nielsen Norman Group has shown that approximately 85% of major usability problems can be discovered with just five participants. What matters is the design of "whom to ask, what to ask, and how."
Defining the Objective
The first step in a test plan is to define in a single sentence what the test will reveal. "Confirming usability" does not function as an objective. State the design hypothesis to be validated concretely: for example, "Can users correctly understand the account type options and make a selection during the new registration flow?" Tests with vague objectives produce scattered observations, and the interpretation of the resulting data lacks consistency. Projects that end up asking "what did we actually learn?" after testing usually have insufficient definition at this stage.
Task Design
The core of user testing lies in task design — the instructions given to participants for what to actually do. The most important thing to avoid in task design is leading the participant.
A bad task: "Please purchase a product." If the word "purchase" matches a button label in the system, finding that button is not a useful test.
A good task: "Please choose one thing you'd like as a treat for yourself on your birthday and arrange to have it delivered to your home." By providing context aligned with the user's real motivation, this naturally draws out a sequence of behaviors — searching, comparing, adding to cart, entering an address.
Tasks should be written as narrative scenarios (stories) reflecting real-world usage contexts, and should not contain any proprietary names or UI element labels from within the system.
Selecting Participants
Participants should be selected based on the persona. A selection of "anyone is fine" produces feedback that feels real to no one. For B2B services or tools intended for specialized roles in particular, work experience and job title strongly influence behavior patterns — so the precision of participant attributes directly affects the reliability of validation.
Using acquaintances or internal staff as test participants requires caution, as controlling the psychological safety environment is difficult. Using contacts of the client tends to suppress critical feedback.
Matching Prototype Fidelity to Test Precision
For user testing to work effectively, the fidelity of the prototype used in testing must be aligned with the type of design hypothesis being examined. High-fidelity prototypes elicit reactions to visual details, while sometimes obscuring structural problems.
Low-Fidelity Prototypes (Paper / Sketches)
Best suited for validating information architecture, navigation structure, and content priority. Paper sketches or hand-drawn wireframes are sufficient, enabling validation at the stage where modification costs are minimal.
Tests at this stage should focus on two questions: "Can users navigate in a flow consistent with their mental model?" and "Does the content categorization align with users' cognitive structure?"
Mid-Fidelity Prototypes (Wireframes / Clickable)
Best suited for flow validation and confirming major interactions. Using prototype functions in tools like Figma or Adobe XD, users interact with a clickable version to identify points of confusion or misalignment in understanding.
At this stage, recording "task completion rate" and "number of clicks and time to task completion" is important for quantifying the impact of design complexity on behavior.
High-Fidelity Prototypes (Near-Implementation State)
Tests conducted when color, typography, and copywriting are close to final are suited to evaluating visual impressions and validating fine-grained interactions. However, there is little room for fundamental design changes at this stage, so the types of problems that can be addressed are limited.
The higher the prototype fidelity, the more participant reactions tend to be drawn toward "visual evaluation." Task design must work to maintain the observation axis on "can the user accomplish the intended task?" rather than "is the design attractive or not?"
Running the Test and Observing: The Moderator's Role
During test execution, the role of the moderator (facilitator) can be summarized in three principles: don't give away answers, don't impose interpretations, and don't fill silences.
The Think-Aloud Protocol
Asking participants to "please say out loud what you're thinking as you interact" — the think-aloud protocol — is a standard user testing technique. Making internal thought visible allows observation not just of "where someone got stuck" but "why they got stuck."
Participants unfamiliar with think-aloud tend to fall silent. It is effective to begin with a practice task (using a website unrelated to the test subject) before starting the actual test.
Controlling Moderator Intervention
The trap moderators most commonly fall into is wanting to help. When a participant is struggling, offering a hint destroys the opportunity to observe how the system actually behaves for a real user. The threshold for intervention should be clearly defined as "only when the participant's ability to continue the test itself is impaired."
Use neutral probes to draw out participant interpretation: "What did you think that meant?" "What were you expecting to see on that screen?" Statements that provide answers — "You should have pressed that button" — are strictly prohibited, as they have a significant effect on subsequent tasks.
Designing the Recording
Ideal test recording uses three layers: screen recording plus facial expression, audio, and observer field notes. Recorded data can be reviewed repeatedly, but field notes written in real-time by observers contain contextual information about "what was happening at this moment" — an important complement to video.
When multiple observers are present, divide observation perspectives in advance. Assigning different observers to "flow of interaction," "verbal output," and "changes in emotional expression" reduces missed observations.
Analyzing Results and Feeding Back into Design
The analysis phase after testing is a step that many teams undervalue. The simple approach of "listing the problem areas and fixing them" addresses surface symptoms and misses underlying design problems.
Extracting Insights via Affinity Analysis
Write each observation recorded during testing on a separate sticky note (or online whiteboard card), then group them by theme and behavior pattern through affinity analysis.
The important move in grouping is inferring from "what happened (phenomenon)" to "why it happened (cause)." "The user could not find the button" is a phenomenon. "The visual weight of the CTA button is weak compared to surrounding elements" is the cause. "The design decision to prioritize the visual prominence of the conversion point was not executed consistently" is the flaw in the design hypothesis. Analyzing through these three layers means that fixes are not concentrated in one location, and the recurrence of similar problems is prevented.
Severity Assessment and Prioritization
Assess the severity of identified problems. Using Nielsen's severity scale as a reference, evaluate on three axes: "frequency of occurrence (how many out of how many participants)," "impact on task completion (did it prevent completion, or only slow it down)," and "ease of workaround (could the user find an alternative path)."
Work from high-severity problems first, but also consider the tradeoff between "cost to fix" and "scope of impact." Fundamental structural design problems with broad impact become the highest priority for the next iteration, even if their cost to fix is high.
Connecting to the Next Iteration
Document analysis results under four headings: "findings," "hypotheses about root causes," "candidate remediation approaches," and "hypotheses to validate in the next test." Testing is not something that concludes in a single round — it is one part of the design → test → improve → test cycle. Defining what the next test will verify maintains the continuity of validation.
Test Design and Responsibility Sharing Between Clients and Contractors
User testing is neither something contractors should execute alone, nor something clients should judge alone. Clarifying the respective roles of client and contractor in each stage — test design, execution, and integration of results — raises the quality of the project as a whole.
The Client's Role
Clients are in the strongest position to contribute to participant recruitment. They are the ones with access to existing and potential customers. A division of labor that works well in practice is: client and contractor jointly define participant criteria, the client handles identifying and contacting candidates, and the contractor manages test logistics.
Prioritizing the problems revealed by test results also requires business judgment. Context like "this feature prioritizes long-term LTV over short-term conversion" is information only the client possesses. It is important to design a process that incorporates client judgment into the interpretation of test results.
The Contractor's Role
Contractors contribute expertise in test planning, moderation, and analysis. Task design and maintaining moderator neutrality in particular are technical domains where accumulated experience directly affects quality.
When reporting test results, findings (what happened) and interpretations (why it is believed to have happened) should be presented separately. A report that mixes interpretation with observation narrows the client's space to apply their own business judgment, making alignment more difficult.
Aligning on Test Budget and Resource Estimates
The cost of running a user test follows this structure: test design (0.5–1 day) + participant recruitment (preparation 1–2 weeks in advance) + execution (60–90 minutes per participant × 5 participants) + analysis and reporting (1–2 days).
Which party bears this cost should be clarified at the proposal and estimation stage, before the project begins. When tests are proposed as a separate budget item, there is a risk that the client will later decide to cut them to reduce costs. Including testing costs in the project estimate from the start as a QA phase is the rational approach for both parties.
User testing is not a procedure for "checking whether something is easy to use" — it is a design act of "verifying that what is being built is worth building." Teams that integrate testing early in the process, and consistently feed discoveries into the next iteration, rarely experience large-scale rework after launch. What makes the difference is not technical capability, but the timing of validation and the precision of its design.
References
Usability Testing 101 (2019)
Why You Only Need to Test with 5 Users (2000)
Writing Tasks for Qualitative and Quantitative Usability Studies (2023)