Guide

Flow matching explained

Harbor Media's marketing asset pipeline ran Stable Diffusion 2.1 with 50 DDIM steps per hero banner. Quality was acceptable, but batch jobs queued for hours on a single A10. Migrating to a rectified-flow checkpoint cut inference to 8 Euler steps with no measurable FID regression on their internal eval set. The underlying idea is flow matching: instead of learning to reverse a long stochastic noise schedule, the model learns a velocity field that transports samples along nearly straight paths from a simple source distribution (Gaussian noise) to the data manifold (images, audio spectrograms, molecular conformations). Sampling becomes integrating an ordinary differential equation (ODE) in few steps — often faster and more stable than classical diffusion denoising loops.

Flow matching unifies several recent generative advances: rectified flow (Liu, Gong, and others), conditional flow matching (Lipman et al.), optimal-transport-flavored training objectives, and the backbone behind Stable Diffusion 3's multimodal transformer. This guide defines the continuous normalizing flow viewpoint, walks through the conditional flow matching loss, explains rectified flow and reflow distillation, connects flow matching to score-based diffusion SDEs, covers ODE solvers and step-count trade-offs, documents the Harbor Media refactor, compares techniques in a decision table, lists common pitfalls, and provides a production checklist alongside our GAN guide and VAE guide.

From diffusion SDEs to velocity fields

Classical diffusion models define a forward process that gradually adds noise and train a network to predict noise, score, or clean data at each discrete timestep. In the continuous-time limit, that process is described by a stochastic differential equation (SDE). Song et al. showed you can also sample via a probability-flow ODE that shares the same marginals but removes stochasticity — turning generation into deterministic integration.

Flow matching takes a more direct route: specify an explicit time-dependent vector field v_θ(x, t) such that integrating from t = 0 (noise) to t = 1 (data) produces samples from your target distribution. You do not need to simulate the forward noising process during training. Instead, you construct conditional paths between individual noise-data pairs and regress the network onto the analytic velocity of those paths.

The payoff is training simplicity and sampling speed. Conditional flow matching (CFM) samples a data point x₁, draws noise x₀ ~ N(0, I), picks a time t ~ U(0, 1), builds an intermediate state x_t along a chosen bridge (linear interpolation is the common default), and minimizes

|| v_θ(x_t, t) - (x₁ - x₀) ||²

when the bridge is the straight line x_t = (1 - t)x₀ + t x₁. No adversarial min-max game like GANs, no variational bound like VAEs — just supervised regression on vector fields, with theory tying the learned field to a valid generative model under mild conditions.

Rectified flow and reflow distillation

Straight bridges are elegant but naively learned flows can develop curved transport that still needs many ODE steps. Rectified flow iteratively “straightens” the field: train a flow, generate pairs (z, x) by integrating, then retrain on new straight bridges between those paired endpoints. Each reflow round reduces trajectory curvature so Euler integration with 1–8 steps remains accurate.

This is the practical trick behind one-step or few-step image generators marketed as distilled diffusion. You are not magically removing compute; you are reshaping the transport map so cheap solvers work. Harbor Media's migration used a publicly released rectified-flow image checkpoint; their engineering work was eval harnesses (FID, CLIP score, human preference on brand palettes), scheduler wiring in their ComfyUI batch worker, and fallback to a 20-step diffusion checkpoint when prompts hit rare failure modes (fine typography, exact logo geometry).

For text-to-image, flow models still rely on the same conditioning stack as latent diffusion: a vision-language text encoder (CLIP, T5, or both in SD3) injects cross-attention or joint-attention context into a U-Net or DiT backbone. Flow matching changes the time axis training objective, not the entire multimodal architecture.

ODE sampling: solvers, step counts, and guidance

At inference, generation solves dx/dt = v_θ(x, t) from t = 0 to 1. First-order Euler is the default: x ← x + Δt · v_θ(x, t). Heun and midpoint methods add function evaluations per step but improve accuracy when curvature remains. Adaptive solvers (dopri, RK45) help research prototypes; production image pipelines usually fix a small step budget for predictable latency.

Classifier-free guidance carries over from diffusion: during training, randomly drop text conditioning; at sample time, interpolate conditional and unconditional velocities: v = v_uncond + w (v_cond - v_uncond). Guidance scale w trades prompt adherence against diversity and artifact risk. Flow checkpoints are not immune to oversaturated textures or limb duplication at high w — tune per model card.

Step count vs quality follows a knee curve. Harbor Media measured FID flattening between 6 and 12 Euler steps on their eval set; they chose 8 as the production default with a “quality mode” at 16 for print-resolution crops. Log solver choice, step count, guidance, and seed alongside every asset for reproducibility.

Harbor Media refactor: from DDIM loops to flow ODEs

Before refactor, the pipeline was: prompt template → SD 2.1 latent U-Net → 50 DDIM steps → VAE decode → brand color LUT. Bottlenecks were GPU queue depth and inconsistent step-to-step latency (scheduler overhead dominated at small batch sizes).

After refactor:

Checkpoint swap — rectified-flow SD-family weights with matched VAE; kept text encoder frozen initially to reduce regression risk.
Solver module — unified interface: sample(model, z, steps, method='euler') shared with a legacy diffusion fallback.
Eval gate — nightly FID/CLIP on 2k held-out prompts; auto-rollback if FID drifts > 3% week-over-week.
Latency SLO — p95 single 1024×1024 image < 4.5s on A10 at 8 steps (down from 18s).
Failure routing — heuristic prompt classifier (logos, dense text, faces) escalates to 20-step diffusion or human review.

The lesson: flow matching is as much an ops upgrade as a research novelty. Without eval hooks and fallback paths, a faster model that fails on 5% of brand-critical prompts is worse than a slow reliable one.

Technique decision table

Goal	Prefer flow matching	Prefer classic diffusion	Prefer GAN / VAE
Few-step image generation on GPU	Rectified flow + Euler (8–16 steps)	Requires distillation or fast samplers (LCM, SD Turbo)	GANs fast but mode-collapse risk
Maximum open-weight ecosystem	Growing (SD3, Flux-class)	Largest (SD 1.5/2.x, ControlNet stacks)	StyleGAN for faces; niche elsewhere
Stable training on small data	CFM regression is mild; still needs data volume	Well-understood schedules and augments	VAEs good for representation; GANs brittle
Exact likelihood / compression	Not the primary metric	Score-based bounds exist but unused in practice	VAE ELBO explicit
Video / 3D generative models	Active research (straight flows in latent space-time)	Current production default (Sora-class denoisers)	Limited

Common pitfalls

Assuming 1-step models generalize — aggressive reflow can collapse diversity; always track FID and precision/recall or human eval.
Mismatched VAE and flow checkpoint — latent scaling constants differ across families; decode artifacts are not solver bugs.
Ignoring curvature after fine-tune — LoRA fine-tunes on diffusion checkpoints do not automatically transfer; retrain or re-distill flows after domain adaptation.
Using diffusion schedulers verbatim — DDIM timestep spacing is wrong for learned ODE fields; use flow-native time grids.
Guidance scale copy-paste — optimal w for flow models differs from SD 1.5; re-tune per checkpoint.
No deterministic regression tests — float kernels and solver order change pixels; hash outputs in CI with pinned seeds and cudnn flags.

Production checklist

Confirm checkpoint family (rectified flow vs diffusion) before wiring schedulers.
Implement Euler baseline; benchmark Heun at fixed step budgets on your eval set.
Log t grid, step count, guidance, seed, and checkpoint revision per sample.
Pair flow weights with the correct VAE and text encoders from the model card.
Build FID/CLIP (or task-specific) nightly regression against a frozen prompt suite.
Define fallback path (higher steps or diffusion checkpoint) for failure-class prompts.
Document VRAM at target resolution; flow fewer steps still scales with U-Net/DiT width.
Validate classifier-free guidance dropout rate matches training assumptions.
After LoRA or fine-tune, re-measure step-quality knee; curvature may return.
Version solver code; ODE implementations differ in terminal time handling (t=1 inclusive vs exclusive).

Key takeaways

Flow matching trains a velocity field that transports noise to data along explicit paths, often via simple conditional regression instead of score matching at every noise level.
Rectified flow straightens transport trajectories through reflow rounds, enabling accurate sampling with 1–8 Euler steps.
Sampling integrates a probability-flow ODE; solver choice and step count define the latency-quality trade-off in production.
Harbor Media's refactor shows the win is end-to-end: checkpoint, solver, eval gates, and fallback routing — not just swapping weights.
Flow matching complements rather than replaces diffusion literacy; many pipelines will run both depending on prompt class and quality tier.