Guide

Property-based testing explained

Harbor Commerce's fee calculator passed every hand-written test case — including zero-dollar carts, single-line orders, and a 10% promotional discount. Production still saw a refund bug: when three stackable percentage discounts applied in a different order than QA imagined, the total fee went negative by four cents. The engineer who fixed it did not add a fourth example; they wrote a property: “for any valid cart, computed fees are non-negative and sum to at most the subtotal.” A generator produced ten thousand random carts in two seconds and found the ordering edge case immediately. That is property-based testing (PBT) — asserting invariants over generated inputs instead of locking behavior to a fixed table of examples. Popularized by Haskell's QuickCheck and available today in Python (Hypothesis), JavaScript (fast-check), Java (jqwik), and Rust (proptest), PBT complements unit and integration tests where combinatorial edge cases hide. This guide explains properties vs examples, generators and shrinking, writing good invariants, framework patterns, a Harbor Commerce fee calculator worked example, an approach decision table, common pitfalls, and a production checklist.

Examples vs properties

Example-based tests (the default in pytest and most frameworks) say: “given input X, expect output Y.” They are precise, readable, and fast to write. Their weakness is selection bias: you test the cases you thought of, which skew toward happy paths and round numbers.

Property-based tests say: “for all inputs matching constraint C, invariant I must hold.” The framework generates many inputs satisfying C, runs I, and on failure shrinks the input to a minimal counterexample you can paste into a regression test.

Common property shapes

  • Round-tripdecode(encode(x)) == x for serializers, compressors, and codecs.
  • Idempotencenormalize(normalize(s)) == normalize(s) for slugifiers, Unicode normalizers, and fee rounding.
  • Commutativity / associativity — order of independent operations should not change results (when your domain claims it should not).
  • Oracle comparison — optimized path matches naive reference implementation on small inputs.
  • Metamorphic relations — if you sort twice, order is unchanged; if you add a constant to all inputs, ranking is unchanged.

PBT does not replace examples for documented business rules (“California charges 7.25% state tax on shipping”). It excels at structural correctness and “this should never happen” guards on pure functions and deterministic services.

Generators, strategies, and shrinking

A generator (Hypothesis calls them strategies) produces random values obeying constraints. Built-ins cover integers, text, datetimes, and collections; you compose them for domain types.

# Python — Hypothesis
from hypothesis import given, strategies as st

@given(st.integers(min_value=0, max_value=10_000_00))
def test_fee_non_negative_on_cents_subtotal(subtotal_cents):
    fee = compute_platform_fee(subtotal_cents)
    assert fee >= 0

Shrinking is PBT's debugging superpower. When an assertion fails on a 47-field cart with exotic Unicode SKUs, the framework simplifies: drop fields, shorten strings, lower magnitudes — until it finds a small counterexample like subtotal=3, discounts=[50, 50] that still fails. Without shrinking, CI logs are unreadable noise.

Custom strategies for domain types

Model your domain explicitly instead of hoping random dicts are valid:

line_items = st.lists(
    st.builds(
        LineItem,
        sku=st.text(min_size=1, max_size=12),
        qty=st.integers(min_value=1, max_value=99),
        unit_price_cents=st.integers(min_value=0, max_value=500_00),
    ),
    min_size=1,
    max_size=8,
)

Use assume() to discard invalid draws (e.g. carts over credit limit) but prefer tight strategies — heavy filtering wastes iterations and slows CI. For regex parsers, generate strings from the grammar or use library helpers rather than arbitrary text.

Framework landscape

LanguageLibraryNotes
Python Hypothesis First-class pytest plugin; @given on test functions; rich strategies
JavaScript / TS fast-check Works with Vitest/Jest; model-based commands for stateful APIs
Haskell QuickCheck Original; strong type-driven generators
Java jqwik JUnit 5 integration; @ForAll parameters
Rust proptest Macro-driven; persists failing seeds in proptest-regressions

Configure example count and deadlines per environment: 100 iterations locally, 1,000 in nightly CI, fewer on slow integration tests. Hypothesis stores a .hypothesis/examples database so known failures replay deterministically. Pair PBT with CI gates that fail on regression seeds.

Worked example: Harbor Commerce fee calculator

Harbor Commerce charges a platform fee as basis points on post-discount subtotal, with a minimum one-cent fee and half-up rounding to cents. Stackable percentage discounts apply sequentially. The bug: three 50% discounts did not floor at zero subtotal before fee calculation.

The team added properties alongside existing pytest examples:

  1. Non-negativity — fee and final total never negative.
  2. Bounded fee — fee ≤ post-discount subtotal.
  3. Discount monotonicity — adding a non-negative discount never increases total paid.
  4. Reference oracle — for carts with ≤ 3 lines and subtotal < $200, match a slow decimal reference implementation.
@given(cart=carts(), discount_stack=discount_lists(max_size=4))
def test_fee_never_exceeds_subtotal(cart, discount_stack):
    subtotal, fee, total = price_cart(cart, discount_stack)
    assume(subtotal >= 0)
    assert fee >= 0
    assert total >= 0
    assert fee <= subtotal

Hypothesis shrank a failing case to two 50% discounts on a one-cent line item. Engineers froze it as a named example test and fixed rounding order. Properties stayed for regression; the example documents the incident for code review.

When to use PBT vs other techniques

SituationPreferWhy
Pure function with clear invariants Property-based High bug yield per line of test code
Documented regulatory rule with fixed rates Example-based Auditors want explicit cases, not randomness
Stateful service / database Example integration + optional model-based PBT Setup cost is high; start with examples
Security parser / crypto PBT + dedicated fuzz harness See chaos and fault injection for runtime; PBT for input space
UI browser flows Playwright E2E examples PBT rarely pays off on DOM workflows
Flaky external APIs Contract tests with recorded fixtures Random live calls are nondeterministic

Common pitfalls

  • Properties that are always true — tautologies like assert isinstance(result, dict) waste CI time; assert meaningful relations.
  • Over-broad generators — mostly invalid inputs filtered by assume(); tighten strategies instead.
  • Ignoring flaky seeds — disable shrinking or example DB and you lose reproducibility; commit regression seeds.
  • Testing the mock — properties on code that only calls a stubbed SDK verify nothing about production.
  • Replacing all examples — keep explicit cases for regressions, specs, and onboarding; add properties for coverage gaps.
  • Unbounded runtime — cap examples and wall-clock per test; PBT on hot paths in every PR can slow merges.

Production checklist

  • Identify 2–3 invariants for each high-risk pure function (money, authz, serialization).
  • Define domain strategies that only emit valid structures.
  • Start with 50–100 examples per property locally; tune CI budgets.
  • On failure, save the shrunk counterexample as a named example test.
  • Wire Hypothesis/fast-check into existing pytest/Vitest suites — no separate runner.
  • Exclude nondeterministic tests (live network, wall-clock races) from PBT.
  • Document which properties map to which business guarantees in the module docstring.
  • Review PRs for properties that are too weak to catch real bugs.

Key takeaways

  • Properties express rules — examples document instances; both belong in the suite.
  • Generators encode domain knowledge — tighter strategies mean faster, clearer failures.
  • Shrinking is the debugger — minimal counterexamples turn PBT from noise into actionable bugs.
  • Target pure logic first — parsers, pricing, serializers, and ranking algorithms see the highest ROI.
  • Freeze regressions as examples — PBT finds bugs; example tests lock the fix.

Related reading