Guide

Game playtesting and QA explained

Harbor Arcade's Coin Foundry build looked ready: crisp art, a tight upgrade loop, and a tutorial that the team had replayed a hundred times. Then six strangers sat down for a one-hour playtest. Four never found the furnace toggle because it sat behind a prop with no highlight. Two thought the “smelt” button was decorative. One soft-locked by selling their last ore before buying fuel. None of this appeared in the bug tracker because internal QA knew the layout by heart. Playtesting is structured observation of real players using an unfinished build — not a substitute for automated tests, not a popularity poll, but the fastest way to learn whether your core loop reads clearly to someone who has never seen your GDD. This guide covers playtest types, how to run sessions without biasing players, bug triage by player impact, a Harbor Arcade Coin Foundry worked example, a playtest decision table, common pitfalls, and a production checklist. It complements game analytics and software testing fundamentals without replacing either.

Playtesting vs QA vs analytics

These three feedback loops overlap but answer different questions:

Playtesting — qualitative: Can a newcomer understand goals, controls, and failure? Where do they hesitate, rage-quit, or misread UI?
QA (quality assurance) — systematic verification: repro steps, regression passes, platform certification checklists, soak tests for crashes and memory leaks.
Analytics — quantitative at scale: funnel drop-offs, retention cohorts, economy sinks. Needs instrumentation; playtests often run before telemetry is trustworthy.

A healthy pipeline uses playtests early and often (when designs are still cheap to change), QA gates on every milestone build, and analytics after wider release. Skipping playtests and relying only on team dogfooding is how tutorials ship that only veterans can parse.

Types of playtests

Match the session format to what you need to learn this week:

Vertical slice / milestone playtest

A polished slice of one level, biome, or feature set. Goal: validate that the intended experience lands — pacing, difficulty, readability. Best when art and audio are representative enough that feedback is about design, not placeholder shock.

Usability and FTUE playtest

Focus on first-time user experience: control discovery, HUD comprehension, goal clarity. Often limited to 15–30 minutes. Pairs directly with tutorial design iteration.

Balance and economy playtest

Longer sessions with players who understand basics. Observe currency flow, upgrade pacing, and whether dominant strategies emerge. Export session logs when possible; cross-check with balancing spreadsheets.

Soak and stability QA

Engineers leave builds running overnight or across platform matrix (PC, Steam Deck, mobile tier). Finds memory leaks, desync, and edge-case crashes that short usability sessions miss.

Compliance and certification QA

Platform holders (Sony, Microsoft, Nintendo, Apple) publish technical requirement checklists. Treat these as non-negotiable gates before submission, separate from creative playtests.

Running a session: prep, observe, debrief

Before the session

Write one primary research question (“Do players understand the crafting loop without text?”) and two secondary questions.
Recruit target players, not only friends who know your genre. Include at least one person outside your core audience if you want accessibility signal.
Prepare a build with logging: timestamped checkpoints, optional screen capture consent, and a known version hash in the title screen.
Script a neutral intro: explain controls at a high level only if the game is not meant to teach them; otherwise say “figure it out as you would at home.”

During observation

Two common protocols:

Think-aloud — player narrates confusion aloud. Rich qualitative data; can slow pacing and feel unnatural in action games.
Silent play + post interview — observer takes notes on stuck points, deaths, and UI mis-clicks; debrief afterward with open questions (“What were you trying to do when you opened the map?”).

Observers stay quiet unless the player is hard-blocked for more than a few minutes. Helping too early destroys the signal. Record behavior (paused 40 seconds at inventory) separately from interpretation (probably did not see stack limit).

After the session

Within 24 hours, cluster notes into themes: clarity, difficulty, technical blockers, delight moments. Translate themes into tickets with severity, repro steps, and a link to timestamped video if available. Share a one-page summary with design, engineering, and production — not a 40-slide deck nobody reads.

Bug triage: severity vs priority

Playtests surface both crashes and “feel” issues. Use a simple matrix so engineering does not fix cosmetic typos while players soft-lock:

Severity	Player impact	Example
S1 — Blocker	Cannot progress or data loss	Save corrupts; required quest NPC falls through floor
S2 — Major	Core loop broken or severely degraded	Multiplayer desync every match; economy exploit duplicates currency
S3 — Minor	Workaround exists; occasional annoyance	Tooltip overlaps button on 1280×720; rare audio pop
S4 — Trivial	Cosmetic or polish	Typo in credits; z-fighting on distant prop

Priority adds schedule context: an S3 on the main menu before launch may be P1; an S2 in a bonus mode post-launch may be P3. Playtest findings that affect level readability or pacing are often classified as design debt, not bugs — but they still need owners and sprint slots.

Worked example: Harbor Arcade Coin Foundry playtest

Harbor's team scheduled a 45-minute usability playtest on a vertical slice: mine ore, smelt ingots, buy furnace upgrades, reach tier-2 tools. Six participants (three genre fans, three casual mobile players), silent-play protocol, screen capture with consent.

Findings:

Four of six missed the furnace interaction prompt (S2 clarity) — fixed by moving the prompt to the HUD and adding a one-time camera pan.
Two sold all ore before purchasing fuel, soft-locking progress (S1) — fixed with a minimum ore reserve and a shop warning modal.
Genre fans finished in 28 minutes; casuals averaged 41 with two asking if the session was “over” at tier-1 (pacing too slow for FTUE) — design ticket to front-load one dramatic upgrade.
Zero crashes; one audio stutter on Steam Deck (S3, logged for soak QA).

The debrief produced four tickets merged before the next public demo. Total calendar cost: one evening, pizza, and build prep. Compare that to a week of forum posts after launch asking “how do I smelt?”

Playtest type decision table

When you need…	Run this	Typical length	Who attends
Validate core loop fun	Vertical slice playtest	30–60 min	Designer observer + note-taker
Fix tutorial and HUD confusion	FTUE usability session	15–30 min	UX + design; no engineers coaching
Tune economy and difficulty	Balance playtest + telemetry export	2+ hours or multiple sessions	Systems designer + analyst
Ship on console / mobile store	Certification QA checklist	Days–weeks per platform	QA lead + build engineer
Find crashes at scale	Soak test + automated regression	Overnight / CI nightly	Engineering + QA automation
Measure live changes	A/B test in production	Weeks (statistical power)	Live ops + data; see A/B testing guide

Common pitfalls

Testing only insiders — teammates unconsciously skip broken tutorials because they know the shortcut.
Leading the player — “Try clicking the furnace” destroys the usability signal you came for.
No written goals — sessions become vague hangouts; nothing ships to the backlog.
Confusing opinion with behavior — “I hate green UI” is preference; “I could not find the health bar” is actionable.
Ignoring positive signal — note what players do unprompted (experimenting with combos) and protect those moments in redesign.
One playtest per milestone — a single six-person session is a directional hint, not proof. Iterate and re-test.
Skipping accessibility observers — color-only cues, tiny text, and no remapping show up only when diverse players participate; see game accessibility.

Production checklist

Define one primary research question and success criteria before recruiting.
Use a build with version ID, logging, and known-scope limitations documented.
Recruit players outside the dev team; mix genre familiarity levels.
Obtain recording consent; store data per privacy policy (GDPR/COPPA if minors).
Pick think-aloud or silent-play protocol and stick to it for the whole session.
Take timestamped notes on behavior, not just post-game opinions.
Debrief within 24 hours; cluster themes and file tickets with severity.
Separate design clarity issues from engineering bugs in the tracker.
Re-test after fixes; close the loop before calling the milestone done.
Pair qualitative playtests with quantitative retention metrics once the build is public.

Key takeaways

Playtesting finds confusion QA cannot — experts blind themselves to onboarding gaps.
Behavior beats opinions — watch where players stall, not only what they say they liked.
Match format to question — FTUE, balance, and certification need different session designs.
Triage by player impact — soft-locks outrank shader z-fighting every time.
Iterate in small loops — one playtest is a start; fixed builds plus re-tests are how quality compounds.