Guide

Consistency models explained

Harbor Media's asset pipeline already ran rectified-flow checkpoints at eight Euler steps for final hero banners. Designers still waited on preview grids: eight steps times twenty layout variants meant minutes per iteration. Distilling the same teacher into a latent consistency model (LCM) cut draft thumbnails to two steps with acceptable composition fidelity. Final renders still use the eight-step teacher; previews became cheap enough to run interactively in the browser.

Consistency models, introduced by Song Yang and colleagues, attack the core bottleneck of diffusion sampling: you normally need dozens of sequential network evaluations to walk from noise to a clean image. A consistency model learns a single function f(x_t, t) that maps any point along a denoising trajectory directly to the endpoint x₀. At inference, one or two forward passes can suffice. This guide defines the consistency property, contrasts training from scratch with distillation from a pretrained teacher, explains LCM and SDXL-Turbo-style latent variants, covers multi-step “consistency sampling” for quality recovery, documents the Harbor Media refactor, compares techniques in a decision table, lists common pitfalls, and provides a production checklist alongside our knowledge distillation guide and model distillation guide.

The consistency property

Diffusion and flow models define a family of noisy states {x_t} indexed by time t in [0, T] (or [0, 1]). Along a fixed probability-flow ODE or SDE trajectory, every x_t should collapse to the same clean sample x₀ when fully denoised. The consistency property encodes that idea directly:

f(x_t, t) = x₀ for all t on the same trajectory.

A neural network f_θ approximates this mapping. The boundary condition f(x, ε) ≈ x at the smallest noise level keeps identity at the data end. At the highest noise, f(x_T, T) jumps from pure Gaussian noise to a plausible image in one evaluation — no iterative scheduler loop required.

Why this differs from standard diffusion

Diffusion UNet: Predicts noise, score, or velocity at one timestep; you chain 20–50 calls with a scheduler.
Consistency model: Predicts the final clean sample from the current noisy state; one call can finish generation.
Flow matching: Integrates a velocity field over several ODE steps; paths are straighter than DDPM but still multi-step unless distilled further.

Consistency models sit at the extreme of the speed–quality frontier: maximal step reduction, with quality recovered partly by distilling a strong multi-step teacher rather than learning from scratch alone.

Training: consistency distillation vs consistency training

Two training recipes dominate production deployments.

Consistency distillation (CD)

Start from a pretrained diffusion or flow teacher that already generates high-quality samples with many steps. Freeze the teacher (or use EMA weights). Sample a trajectory pair (x_t, x_t') at adjacent timesteps along the teacher's PF-ODE. Train the student so:

f_θ(x_t, t) ≈ f_θ⁻(x_t', t')

where f_θ⁻ is an exponential-moving-average copy of the student (target network). The student learns to match its own one-step prediction at t to the EMA prediction one step closer to data at t'. After convergence, the student inherits the teacher's data distribution while supporting few-step inference. This is the practical path behind LCM and SDXL-Turbo.

Consistency training (CT)

Train from scratch without a teacher by enforcing the consistency property on synthetic trajectories built from the forward noising process. CT is more research-facing: sample x₀ from data, noise to x_t and x_t', penalize disagreement between f_θ(x_t, t) and f_θ⁻(x_t', t') plus a reconstruction term at t ≈ 0. CT can reach competitive FID on CIFAR-10 but demands careful noise schedules and longer training than distilling an existing SD checkpoint.

Latent consistency models (LCM)

Full-resolution pixel consistency is expensive. LCM applies consistency distillation in the VAE latent space of Stable Diffusion: the teacher is a standard SD UNet operating on 64×64 latents; the student learns two- to four-step latent denoising. Decode with the frozen VAE. Public checkpoints (LCM-LoRA, SDXL-Turbo, SD-Turbo) ship as drop-in replacements or LoRA adapters. Guidance is baked into distillation — classifier-free guidance scale is fixed at training time, which limits runtime CFG tuning.

Inference: one-step, few-step, and consistency sampling

Pure one-step inference (t = T → 0 in a single jump) is fastest but can blur fine detail or drift on out-of-distribution prompts. Production systems usually use:

Two to four steps: Schedule a short decreasing sequence T = t₁ > t₂ > … > 0; apply f at each level, re-noise lightly between steps (consistency sampling) or chain predictions. Harbor Media's preview pipeline uses two steps.
Teacher-student cascade: LCM draft at two steps, then img2img refine with the eight-step flow teacher at low denoise strength for finals.
Step distillation hybrids: Combine LCM with speculative-style draft-and-verify when batch size is one and latency dominates.

Step count trades linearly against GPU milliseconds per image. Measure on your resolution and batch size: a 1024×1024 SDXL LCM at two steps can still exceed real-time on consumer GPUs if the VAE decode is not batched.

Harbor Media LCM thumbnail refactor

Harbor's marketing team generates hero banners, social crops, and email headers from shared prompt templates. After the rectified-flow migration, final quality was stable at eight steps, but the review UI still felt sluggish.

Teacher lock — Kept the eight-step rectified-flow UNet as the quality teacher; froze weights and logged PF-ODE trajectories on 50k internal prompt pairs.
LCM distillation — Trained a latent consistency student with CD for 40k steps on 512×512 latents; CFG scale fixed at 7.5 to match production prompts.
Preview tier — Browser UI calls the two-step LCM for grid previews (4×5 variants); designers discard 80% before any eight-step render.
Final tier — Selected previews upscale and refine with the teacher at denoise 0.35; composition anchors from LCM reduce teacher step count sensitivity.
Eval gate — Internal CLIP-score and human pairwise tests; LCM previews within 4% win-rate of teacher-only drafts at 12× lower preview cost.

Outcome: median design iteration dropped from 11 minutes to 90 seconds. Final banner quality unchanged. Lesson: consistency distillation is a tiering tool — cheap drafts, expensive finals — not always a full replacement for multi-step teachers.

Technique decision table

Approach	Typical steps	Best when	Watch out for
DDPM / DDIM diffusion	20–50	Maximum quality, research baselines, fine detail	Slow; scheduler tuning; high GPU cost per image
Flow matching / rectified flow	4–16	Balanced quality and speed; SD3-class models	Still multi-step; needs ODE solver choice
Consistency distillation (LCM)	1–4	Interactive previews, mobile, high-volume drafts	Fixed CFG; detail loss at 1 step; teacher dependency
Consistency training (scratch)	1–2	No teacher available; small datasets (CIFAR-scale)	Hard at 1024px; long training; schedule sensitivity
GAN	1	Ultra-low latency, fixed resolution	Mode collapse; limited diversity vs diffusion
Cascade (LCM + teacher)	2 + 4–8	Production UIs with draft/final tiers	Two models to ship; alignment between tiers

Common pitfalls

Expecting one-step to match fifty-step FID — Use LCM for previews or latency-critical paths; keep a teacher for finals.
Distilling a weak teacher — CD inherits teacher ceilings; garbage in, garbage out.
Changing CFG at inference — LCM/Turbo bake guidance into weights; arbitrary CFG breaks composition.
Skipping EMA target network — Training destabilizes without f_θ⁻; follow reference hyperparameters.
Pixel-space consistency at 1K+ — Latent CD is standard for SD-class resolutions; pixel CT does not scale cheaply.
Ignoring VAE decode cost — Two-step UNet plus full decode can dominate latency on small GPUs.
Prompt distribution shift — Distill on prompts matching production; anime-heavy LCM fails on photoreal product shots.
No pairwise human eval — FID misses layout and text fidelity; run side-by-side tests on real briefs.

Production checklist

Lock a strong multi-step teacher before distillation; log trajectories on production prompts.
Choose latent vs pixel CD based on resolution and existing VAE availability.
Fix CFG scale during distillation to match intended inference prompts.
Train with EMA target network; use reference learning rate and timestep spacing.
Benchmark 1-, 2-, and 4-step FID/CLIP-score on an internal holdout set.
Run human pairwise eval on composition, not just aggregate metrics.
Ship preview tier (LCM) and final tier (teacher) with clear UI labeling.
Profile end-to-end latency including VAE decode and text encoder.
Version-control LoRA adapters separately from base UNet weights.
Plan fallback to teacher-only path if LCM fails on edge-case prompts.

Key takeaways

Consistency models learn f(x_t, t) that maps any point on a denoising trajectory directly to the clean sample x_0.
Consistency distillation from a strong diffusion or flow teacher is the practical path to LCM, SDXL-Turbo, and two-step preview pipelines.
One-step inference is fastest but blurrier; two to four consistency-sampling steps recover most teacher quality at a fraction of GPU cost.
Harbor Media's cascade uses LCM for interactive drafts and the eight-step rectified-flow teacher for final banners.
Match technique to tier: LCM for latency-sensitive previews, multi-step flow or diffusion for maximum fidelity.