Guide

Data contracts explained

The checkout team renames customer_id to buyer_uuid in the orders table. Nobody tells the analytics pipeline. Revenue dashboards show zero for three days until someone notices the join broke. That failure mode is normal when teams share data through implicit assumptions instead of explicit agreements. A data contract is a formal, versioned promise from a producer (the team that writes the data) to every consumer (downstream pipelines, services, and analysts) about structure, semantics, freshness, and how changes will be communicated. This guide covers contract anatomy, schema formats, compatibility rules, enforcement with registries and CI, a Harbor orders topic worked example, a build-vs-share decision table, common pitfalls, and a production checklist. Pair contracts with solid ETL/ELT pipeline hygiene and change data capture patterns when events leave the operational database.

What a data contract contains

A contract is more than a JSON file in a repo. Mature organizations treat it as a product interface with four layers:

  • Schema — field names, types, nullability, enums, nested structures. Machine-readable in Avro, Protobuf, JSON Schema, or SQL DDL.
  • Semantics — what each field means: units (cents vs dollars), timezone (UTC vs local), identifier stability (UUID v4 vs sequential int), and business rules (refunds appear as negative amounts).
  • SLAs — freshness (events arrive within 5 minutes), completeness (no more than 0.01% nulls on required keys), availability (topic published 99.9% of the time), and retention (90 days hot, 7 years archive).
  • Ownership — producer team, on-call rotation, breaking-change notice period (e.g., 30 days), and a changelog linked to schema versions.

Without semantics and SLAs, you only catch type errors — not the dashboard that silently doubles revenue because someone switched currency fields from cents to dollars without a version bump.

Why contracts beat tribal knowledge

Microservices and Kafka topics decouple teams precisely so they can deploy independently. That decoupling breaks down when consumers parse payloads by convention. Contracts restore safety without re-coupling release cycles:

  • Detect breaking changes before deploy — CI fails if a pull request removes a required field consumers depend on.
  • Document the interface — new engineers onboard from the contract, not from grepping three repos.
  • Enable schema evolution — additive changes ship weekly; destructive changes follow a published migration path.
  • Assign accountability — the producer owns backward compatibility; consumers own adapting within the notice window.

In data-mesh terminology, contracts are how a domain exposes shareable data products. Even without adopting mesh org structure, the interface discipline pays off anywhere more than one team reads the same stream.

Schema formats and where they fit

Avro and Protobuf (streaming events)

Row-oriented event buses favor Avro (common in Confluent Schema Registry) or Protobuf (gRPC services, many internal pipelines). Both embed field IDs so readers can ignore unknown fields and resolve renamed columns safely when compatibility rules are configured. Payloads are compact; schemas live in a registry, not repeated in every message.

JSON Schema (APIs and semi-structured logs)

REST webhooks, SaaS exports, and document stores often standardize on JSON Schema. Validation libraries exist in every language. Downsides: larger payloads, weaker evolution story unless you discipline optional fields and forbid silent type changes.

SQL DDL and dbt models (warehouse tables)

Analytics contracts can be expressed as versioned SQL views or dbt models with column-level tests (unique, not_null, accepted_values). Schema changes flow through database migration tooling with review gates — the warehouse equivalent of a registry check.

Compatibility modes: backward, forward, full

Schema registries classify each new version against the previous one:

  • Backward compatible — new schema can read old data. Safe changes: add optional fields, add fields with defaults, remove fields consumers never read (if you track usage). Required for consumers upgrading before producers.
  • Forward compatible — old schema can read new data. Safe changes: add fields old readers ignore. Required for producers deploying before consumers.
  • Full compatible — both directions work. Strictest; limits you to adding optional fields with defaults.
  • Breaking — rename without alias, change type, tighten nullability, or remove a field something still reads. Requires a new topic, major version bump, or coordinated dual-write period.

Pick a default policy per domain. Event streams usually enforce backward compatibility so old consumers keep working while producers evolve. Internal batch files sometimes use explicit version folders (orders/v2/) when breaking changes are cheaper than compatibility gymnastics.

Enforcement: registries, CI, and runtime gates

Schema registry

Kafka-centric stacks register Avro/Protobuf/JSON schemas centrally. Producers submit new versions; the registry rejects incompatible ones before any broker accepts traffic. Consumers fetch schema ID from message headers and deserialize with the matching version — no hard-coded field lists in application code.

CI contract tests

Store contract files in git. On every pull request:

  • Diff new schema against main; fail on breaking changes unless labeled major-bump.
  • Run consumer fixture tests — sample payloads from the last 30 days must still parse.
  • Generate human-readable changelog entries for Slack or docs.

Runtime validation

Producers validate outbound records against the contract before publish. Ingestion jobs reject or quarantine records that fail (dead-letter queue with alert). This catches bugs the registry cannot see — wrong enum values, SLA violations, or business-rule failures.

Worked example: Harbor orders topic evolution

Harbor Market publishes an orders.created Avro topic. Version 1 schema:

{
  "type": "record",
  "name": "OrderCreated",
  "fields": [
    { "name": "order_id", "type": "string" },
    { "name": "customer_id", "type": "string" },
    { "name": "amount_cents", "type": "long" },
    { "name": "currency", "type": "string" },
    { "name": "created_at", "type": "long", "logicalType": "timestamp-millis" }
  ]
}

Product wants guest checkout. Version 2 adds optional guest_email (nullable string) and keeps all v1 fields — backward compatible, registry approves. Analytics adds a fraud model that requires payment_method. Version 3 adds it as optional with default "unknown" — still backward compatible.

Six months later, identity team merges customers and guests into buyer_uuid. Renaming customer_id is breaking. The contract owner:

  1. Announces deprecation in #data-contracts with 30-day notice.
  2. Ships v4 that adds buyer_uuid while keeping customer_id populated as an alias (dual-write).
  3. Updates consumers to read buyer_uuid with fallback to customer_id.
  4. After adoption metrics hit 100%, ships v5 removing customer_id — flagged as breaking; only consumers on v4+ survive.

Without the contract and registry, step 4 would have shipped on a Friday deploy and broken four downstream jobs silently.

Decision table: contracts vs alternatives

Approach Best when Trade-off
Formal data contract + registry Multiple independent teams consume the same event stream or table Upfront schema design; registry ops overhead
Shared monolith database One team, one deploy, few readers Couples releases; does not scale org-wide
Ad hoc JSON + documentation Prototype or <2 consumers, low change rate Silent breakage as soon as a third team joins
Versioned API (REST/gRPC) only Request/response sync access Does not govern async pipelines or warehouse loads
New topic per breaking change Rare radical redesigns, few long-lived consumers Topic sprawl; consumers must cut over explicitly

Common pitfalls

  • Schema without semantics — types validate but meaning drifts (currency units, timezone, ID format).
  • Breaking changes smuggled as patches — tightening string to enum rejects historical records.
  • No consumer registry — you cannot know if removing a field is safe without tracking who reads it.
  • Contracts owned by consumers — producers must own the interface; consumers propose changes via RFC.
  • Ignoring SLAs — perfect schema with six-hour lag still breaks real-time fraud rules.
  • One global compatibility policy — batch archives and live RPC streams have different tolerance for breakage.
  • Manual registry edits — bypassing CI invites production incidents; all changes through pull requests.
  • Contract drift from reality — producers deploy code that skips validation; schedule contract audits against sampled payloads.

Production checklist

  • Every shared dataset has a named producer owner and on-call contact.
  • Schema file lives in version control with compatibility mode documented.
  • CI rejects backward-incompatible changes unless explicitly approved.
  • Sample production payloads tested against schema weekly.
  • Semantic definitions (units, timezones, ID types) written in plain English.
  • Freshness and completeness SLAs defined with alerting thresholds.
  • Breaking-change process published: notice period, dual-write, cutover metrics.
  • Dead-letter queue for records failing runtime validation.
  • Consumer inventory maintained — know who breaks if a field disappears.
  • Changelog communicated in a channel consumers actually read.

Key takeaways

  • Data contracts formalize what producers guarantee and what consumers may assume.
  • Compatibility rules (backward, forward, full) decide whether a schema change can ship independently.
  • Registries and CI catch breaking changes before they reach production traffic.
  • Semantics and SLAs matter as much as column types — most silent bugs are meaning drift, not parse errors.
  • Invest in contracts wherever more than one team depends on the same moving data.

Related reading