Guide
Game playtesting and QA explained
Harbor Arcade's Coin Foundry build looked ready: crisp art, a tight upgrade loop, and a tutorial that the team had replayed a hundred times. Then six strangers sat down for a one-hour playtest. Four never found the furnace toggle because it sat behind a prop with no highlight. Two thought the “smelt” button was decorative. One soft-locked by selling their last ore before buying fuel. None of this appeared in the bug tracker because internal QA knew the layout by heart. Playtesting is structured observation of real players using an unfinished build — not a substitute for automated tests, not a popularity poll, but the fastest way to learn whether your core loop reads clearly to someone who has never seen your GDD. This guide covers playtest types, how to run sessions without biasing players, bug triage by player impact, a Harbor Arcade Coin Foundry worked example, a playtest decision table, common pitfalls, and a production checklist. It complements game analytics and software testing fundamentals without replacing either.
Playtesting vs QA vs analytics
These three feedback loops overlap but answer different questions:
- Playtesting — qualitative: Can a newcomer understand goals, controls, and failure? Where do they hesitate, rage-quit, or misread UI?
- QA (quality assurance) — systematic verification: repro steps, regression passes, platform certification checklists, soak tests for crashes and memory leaks.
- Analytics — quantitative at scale: funnel drop-offs, retention cohorts, economy sinks. Needs instrumentation; playtests often run before telemetry is trustworthy.
A healthy pipeline uses playtests early and often (when designs are still cheap to change), QA gates on every milestone build, and analytics after wider release. Skipping playtests and relying only on team dogfooding is how tutorials ship that only veterans can parse.
Types of playtests
Match the session format to what you need to learn this week:
Vertical slice / milestone playtest
A polished slice of one level, biome, or feature set. Goal: validate that the intended experience lands — pacing, difficulty, readability. Best when art and audio are representative enough that feedback is about design, not placeholder shock.
Usability and FTUE playtest
Focus on first-time user experience: control discovery, HUD comprehension, goal clarity. Often limited to 15–30 minutes. Pairs directly with tutorial design iteration.
Balance and economy playtest
Longer sessions with players who understand basics. Observe currency flow, upgrade pacing, and whether dominant strategies emerge. Export session logs when possible; cross-check with balancing spreadsheets.
Soak and stability QA
Engineers leave builds running overnight or across platform matrix (PC, Steam Deck, mobile tier). Finds memory leaks, desync, and edge-case crashes that short usability sessions miss.
Compliance and certification QA
Platform holders (Sony, Microsoft, Nintendo, Apple) publish technical requirement checklists. Treat these as non-negotiable gates before submission, separate from creative playtests.
Running a session: prep, observe, debrief
Before the session
- Write one primary research question (“Do players understand the crafting loop without text?”) and two secondary questions.
- Recruit target players, not only friends who know your genre. Include at least one person outside your core audience if you want accessibility signal.
- Prepare a build with logging: timestamped checkpoints, optional screen capture consent, and a known version hash in the title screen.
- Script a neutral intro: explain controls at a high level only if the game is not meant to teach them; otherwise say “figure it out as you would at home.”
During observation
Two common protocols:
- Think-aloud — player narrates confusion aloud. Rich qualitative data; can slow pacing and feel unnatural in action games.
- Silent play + post interview — observer takes notes on stuck points, deaths, and UI mis-clicks; debrief afterward with open questions (“What were you trying to do when you opened the map?”).
Observers stay quiet unless the player is hard-blocked for more than a few minutes. Helping too early destroys the signal. Record behavior (paused 40 seconds at inventory) separately from interpretation (probably did not see stack limit).
After the session
Within 24 hours, cluster notes into themes: clarity, difficulty, technical blockers, delight moments. Translate themes into tickets with severity, repro steps, and a link to timestamped video if available. Share a one-page summary with design, engineering, and production — not a 40-slide deck nobody reads.
Bug triage: severity vs priority
Playtests surface both crashes and “feel” issues. Use a simple matrix so engineering does not fix cosmetic typos while players soft-lock:
| Severity | Player impact | Example |
|---|---|---|
| S1 — Blocker | Cannot progress or data loss | Save corrupts; required quest NPC falls through floor |
| S2 — Major | Core loop broken or severely degraded | Multiplayer desync every match; economy exploit duplicates currency |
| S3 — Minor | Workaround exists; occasional annoyance | Tooltip overlaps button on 1280×720; rare audio pop |
| S4 — Trivial | Cosmetic or polish | Typo in credits; z-fighting on distant prop |
Priority adds schedule context: an S3 on the main menu before launch may be P1; an S2 in a bonus mode post-launch may be P3. Playtest findings that affect level readability or pacing are often classified as design debt, not bugs — but they still need owners and sprint slots.
Worked example: Harbor Arcade Coin Foundry playtest
Harbor's team scheduled a 45-minute usability playtest on a vertical slice: mine ore, smelt ingots, buy furnace upgrades, reach tier-2 tools. Six participants (three genre fans, three casual mobile players), silent-play protocol, screen capture with consent.
Findings:
- Four of six missed the furnace interaction prompt (S2 clarity) — fixed by moving the prompt to the HUD and adding a one-time camera pan.
- Two sold all ore before purchasing fuel, soft-locking progress (S1) — fixed with a minimum ore reserve and a shop warning modal.
- Genre fans finished in 28 minutes; casuals averaged 41 with two asking if the session was “over” at tier-1 (pacing too slow for FTUE) — design ticket to front-load one dramatic upgrade.
- Zero crashes; one audio stutter on Steam Deck (S3, logged for soak QA).
The debrief produced four tickets merged before the next public demo. Total calendar cost: one evening, pizza, and build prep. Compare that to a week of forum posts after launch asking “how do I smelt?”
Playtest type decision table
| When you need… | Run this | Typical length | Who attends |
|---|---|---|---|
| Validate core loop fun | Vertical slice playtest | 30–60 min | Designer observer + note-taker |
| Fix tutorial and HUD confusion | FTUE usability session | 15–30 min | UX + design; no engineers coaching |
| Tune economy and difficulty | Balance playtest + telemetry export | 2+ hours or multiple sessions | Systems designer + analyst |
| Ship on console / mobile store | Certification QA checklist | Days–weeks per platform | QA lead + build engineer |
| Find crashes at scale | Soak test + automated regression | Overnight / CI nightly | Engineering + QA automation |
| Measure live changes | A/B test in production | Weeks (statistical power) | Live ops + data; see A/B testing guide |
Common pitfalls
- Testing only insiders — teammates unconsciously skip broken tutorials because they know the shortcut.
- Leading the player — “Try clicking the furnace” destroys the usability signal you came for.
- No written goals — sessions become vague hangouts; nothing ships to the backlog.
- Confusing opinion with behavior — “I hate green UI” is preference; “I could not find the health bar” is actionable.
- Ignoring positive signal — note what players do unprompted (experimenting with combos) and protect those moments in redesign.
- One playtest per milestone — a single six-person session is a directional hint, not proof. Iterate and re-test.
- Skipping accessibility observers — color-only cues, tiny text, and no remapping show up only when diverse players participate; see game accessibility.
Production checklist
- Define one primary research question and success criteria before recruiting.
- Use a build with version ID, logging, and known-scope limitations documented.
- Recruit players outside the dev team; mix genre familiarity levels.
- Obtain recording consent; store data per privacy policy (GDPR/COPPA if minors).
- Pick think-aloud or silent-play protocol and stick to it for the whole session.
- Take timestamped notes on behavior, not just post-game opinions.
- Debrief within 24 hours; cluster themes and file tickets with severity.
- Separate design clarity issues from engineering bugs in the tracker.
- Re-test after fixes; close the loop before calling the milestone done.
- Pair qualitative playtests with quantitative retention metrics once the build is public.
Key takeaways
- Playtesting finds confusion QA cannot — experts blind themselves to onboarding gaps.
- Behavior beats opinions — watch where players stall, not only what they say they liked.
- Match format to question — FTUE, balance, and certification need different session designs.
- Triage by player impact — soft-locks outrank shader z-fighting every time.
- Iterate in small loops — one playtest is a start; fixed builds plus re-tests are how quality compounds.
Related reading
- Game tutorial and onboarding explained — FTUE patterns that playtests most often stress-test
- Game analytics and player retention explained — quantitative funnel data after wider release
- Game balancing explained — spreadsheets and telemetry that complement balance playtests
- Software testing fundamentals explained — regression, unit, and integration testing alongside play sessions