Guide
Property-based testing explained
Harbor Commerce's fee calculator passed every hand-written test case — including zero-dollar carts, single-line orders, and a 10% promotional discount. Production still saw a refund bug: when three stackable percentage discounts applied in a different order than QA imagined, the total fee went negative by four cents. The engineer who fixed it did not add a fourth example; they wrote a property: “for any valid cart, computed fees are non-negative and sum to at most the subtotal.” A generator produced ten thousand random carts in two seconds and found the ordering edge case immediately. That is property-based testing (PBT) — asserting invariants over generated inputs instead of locking behavior to a fixed table of examples. Popularized by Haskell's QuickCheck and available today in Python (Hypothesis), JavaScript (fast-check), Java (jqwik), and Rust (proptest), PBT complements unit and integration tests where combinatorial edge cases hide. This guide explains properties vs examples, generators and shrinking, writing good invariants, framework patterns, a Harbor Commerce fee calculator worked example, an approach decision table, common pitfalls, and a production checklist.
Examples vs properties
Example-based tests (the default in pytest and most frameworks) say: “given input X, expect output Y.” They are precise, readable, and fast to write. Their weakness is selection bias: you test the cases you thought of, which skew toward happy paths and round numbers.
Property-based tests say: “for all inputs matching constraint C, invariant I must hold.” The framework generates many inputs satisfying C, runs I, and on failure shrinks the input to a minimal counterexample you can paste into a regression test.
Common property shapes
- Round-trip —
decode(encode(x)) == xfor serializers, compressors, and codecs. - Idempotence —
normalize(normalize(s)) == normalize(s)for slugifiers, Unicode normalizers, and fee rounding. - Commutativity / associativity — order of independent operations should not change results (when your domain claims it should not).
- Oracle comparison — optimized path matches naive reference implementation on small inputs.
- Metamorphic relations — if you sort twice, order is unchanged; if you add a constant to all inputs, ranking is unchanged.
PBT does not replace examples for documented business rules (“California charges 7.25% state tax on shipping”). It excels at structural correctness and “this should never happen” guards on pure functions and deterministic services.
Generators, strategies, and shrinking
A generator (Hypothesis calls them strategies) produces random values obeying constraints. Built-ins cover integers, text, datetimes, and collections; you compose them for domain types.
# Python — Hypothesis
from hypothesis import given, strategies as st
@given(st.integers(min_value=0, max_value=10_000_00))
def test_fee_non_negative_on_cents_subtotal(subtotal_cents):
fee = compute_platform_fee(subtotal_cents)
assert fee >= 0
Shrinking is PBT's debugging superpower. When an assertion
fails on a 47-field cart with exotic Unicode SKUs, the framework simplifies:
drop fields, shorten strings, lower magnitudes — until it finds a small
counterexample like subtotal=3, discounts=[50, 50] that still fails.
Without shrinking, CI logs are unreadable noise.
Custom strategies for domain types
Model your domain explicitly instead of hoping random dicts are valid:
line_items = st.lists(
st.builds(
LineItem,
sku=st.text(min_size=1, max_size=12),
qty=st.integers(min_value=1, max_value=99),
unit_price_cents=st.integers(min_value=0, max_value=500_00),
),
min_size=1,
max_size=8,
)
Use assume() to discard invalid draws (e.g. carts over credit limit)
but prefer tight strategies — heavy filtering wastes iterations and slows
CI. For
regex parsers,
generate strings from the grammar or use library helpers rather than arbitrary text.
Framework landscape
| Language | Library | Notes |
|---|---|---|
| Python | Hypothesis | First-class pytest plugin; @given on test functions; rich strategies |
| JavaScript / TS | fast-check | Works with Vitest/Jest; model-based commands for stateful APIs |
| Haskell | QuickCheck | Original; strong type-driven generators |
| Java | jqwik | JUnit 5 integration; @ForAll parameters |
| Rust | proptest | Macro-driven; persists failing seeds in proptest-regressions |
Configure example count and deadlines per environment: 100
iterations locally, 1,000 in nightly CI, fewer on slow integration tests.
Hypothesis stores a .hypothesis/examples database so known failures
replay deterministically. Pair PBT with
CI gates that fail
on regression seeds.
Worked example: Harbor Commerce fee calculator
Harbor Commerce charges a platform fee as basis points on post-discount subtotal, with a minimum one-cent fee and half-up rounding to cents. Stackable percentage discounts apply sequentially. The bug: three 50% discounts did not floor at zero subtotal before fee calculation.
The team added properties alongside existing pytest examples:
- Non-negativity — fee and final total never negative.
- Bounded fee — fee ≤ post-discount subtotal.
- Discount monotonicity — adding a non-negative discount never increases total paid.
- Reference oracle — for carts with ≤ 3 lines and subtotal < $200, match a slow decimal reference implementation.
@given(cart=carts(), discount_stack=discount_lists(max_size=4))
def test_fee_never_exceeds_subtotal(cart, discount_stack):
subtotal, fee, total = price_cart(cart, discount_stack)
assume(subtotal >= 0)
assert fee >= 0
assert total >= 0
assert fee <= subtotal
Hypothesis shrank a failing case to two 50% discounts on a one-cent line item. Engineers froze it as a named example test and fixed rounding order. Properties stayed for regression; the example documents the incident for code review.
When to use PBT vs other techniques
| Situation | Prefer | Why |
|---|---|---|
| Pure function with clear invariants | Property-based | High bug yield per line of test code |
| Documented regulatory rule with fixed rates | Example-based | Auditors want explicit cases, not randomness |
| Stateful service / database | Example integration + optional model-based PBT | Setup cost is high; start with examples |
| Security parser / crypto | PBT + dedicated fuzz harness | See chaos and fault injection for runtime; PBT for input space |
| UI browser flows | Playwright E2E examples | PBT rarely pays off on DOM workflows |
| Flaky external APIs | Contract tests with recorded fixtures | Random live calls are nondeterministic |
Common pitfalls
- Properties that are always true — tautologies like
assert isinstance(result, dict)waste CI time; assert meaningful relations. - Over-broad generators — mostly invalid inputs filtered
by
assume(); tighten strategies instead. - Ignoring flaky seeds — disable shrinking or example DB and you lose reproducibility; commit regression seeds.
- Testing the mock — properties on code that only calls a stubbed SDK verify nothing about production.
- Replacing all examples — keep explicit cases for regressions, specs, and onboarding; add properties for coverage gaps.
- Unbounded runtime — cap examples and wall-clock per test; PBT on hot paths in every PR can slow merges.
Production checklist
- Identify 2–3 invariants for each high-risk pure function (money, authz, serialization).
- Define domain strategies that only emit valid structures.
- Start with 50–100 examples per property locally; tune CI budgets.
- On failure, save the shrunk counterexample as a named example test.
- Wire Hypothesis/fast-check into existing pytest/Vitest suites — no separate runner.
- Exclude nondeterministic tests (live network, wall-clock races) from PBT.
- Document which properties map to which business guarantees in the module docstring.
- Review PRs for properties that are too weak to catch real bugs.
Key takeaways
- Properties express rules — examples document instances; both belong in the suite.
- Generators encode domain knowledge — tighter strategies mean faster, clearer failures.
- Shrinking is the debugger — minimal counterexamples turn PBT from noise into actionable bugs.
- Target pure logic first — parsers, pricing, serializers, and ranking algorithms see the highest ROI.
- Freeze regressions as examples — PBT finds bugs; example tests lock the fix.
Related reading
- Software testing fundamentals explained — test pyramid, mocks, and when each layer pays off
- pytest fundamentals explained — fixtures, parametrize, and Hypothesis integration
- Vitest fundamentals explained — fast-check pairing for TypeScript unit tests
- Regular expressions explained — why parsers benefit from generated inputs