Guide
Consistency models explained
Harbor Media's asset pipeline already ran rectified-flow checkpoints at eight Euler steps for final hero banners. Designers still waited on preview grids: eight steps times twenty layout variants meant minutes per iteration. Distilling the same teacher into a latent consistency model (LCM) cut draft thumbnails to two steps with acceptable composition fidelity. Final renders still use the eight-step teacher; previews became cheap enough to run interactively in the browser.
Consistency models, introduced by Song Yang and colleagues, attack the
core bottleneck of
diffusion
sampling: you normally need dozens of sequential network evaluations to walk from noise
to a clean image. A consistency model learns a single function
f(xt, t) that maps any point along a denoising trajectory
directly to the endpoint x0. At inference, one or two forward passes
can suffice. This guide defines the consistency property, contrasts training from scratch
with distillation from a pretrained teacher, explains LCM and SDXL-Turbo-style latent
variants, covers multi-step “consistency sampling” for quality recovery,
documents the Harbor Media refactor, compares techniques in a decision table, lists common
pitfalls, and provides a production checklist alongside our
knowledge distillation guide
and
model distillation guide.
The consistency property
Diffusion and flow models define a family of noisy states
{xt} indexed by time t in [0, T] (or
[0, 1]). Along a fixed probability-flow ODE or SDE trajectory, every
xt should collapse to the same clean sample
x0 when fully denoised. The consistency property
encodes that idea directly:
f(xt, t) = x0 for all t on the same trajectory.
A neural network fθ approximates this mapping. The boundary
condition f(x, ε) ≈ x at the smallest noise level keeps
identity at the data end. At the highest noise, f(xT, T) jumps from
pure Gaussian noise to a plausible image in one evaluation — no
iterative scheduler loop required.
Why this differs from standard diffusion
- Diffusion UNet: Predicts noise, score, or velocity at one timestep; you chain 20–50 calls with a scheduler.
- Consistency model: Predicts the final clean sample from the current noisy state; one call can finish generation.
- Flow matching: Integrates a velocity field over several ODE steps; paths are straighter than DDPM but still multi-step unless distilled further.
Consistency models sit at the extreme of the speed–quality frontier: maximal step reduction, with quality recovered partly by distilling a strong multi-step teacher rather than learning from scratch alone.
Training: consistency distillation vs consistency training
Two training recipes dominate production deployments.
Consistency distillation (CD)
Start from a pretrained diffusion or flow teacher that already generates high-quality
samples with many steps. Freeze the teacher (or use EMA weights). Sample a trajectory
pair (xt, xt') at adjacent timesteps along the
teacher's PF-ODE. Train the student so:
fθ(xt, t) ≈ fθ−(xt', t')
where fθ− is an exponential-moving-average
copy of the student (target network). The student learns to match its own one-step prediction
at t to the EMA prediction one step closer to data at t'.
After convergence, the student inherits the teacher's data distribution while
supporting few-step inference. This is the practical path behind LCM and SDXL-Turbo.
Consistency training (CT)
Train from scratch without a teacher by enforcing the consistency property on synthetic
trajectories built from the forward noising process. CT is more research-facing: sample
x0 from data, noise to xt and
xt', penalize disagreement between
fθ(xt, t) and
fθ−(xt', t') plus a
reconstruction term at t ≈ 0. CT can reach competitive FID on CIFAR-10
but demands careful noise schedules and longer training than distilling an existing SD
checkpoint.
Latent consistency models (LCM)
Full-resolution pixel consistency is expensive. LCM applies consistency distillation in the VAE latent space of Stable Diffusion: the teacher is a standard SD UNet operating on 64×64 latents; the student learns two- to four-step latent denoising. Decode with the frozen VAE. Public checkpoints (LCM-LoRA, SDXL-Turbo, SD-Turbo) ship as drop-in replacements or LoRA adapters. Guidance is baked into distillation — classifier-free guidance scale is fixed at training time, which limits runtime CFG tuning.
Inference: one-step, few-step, and consistency sampling
Pure one-step inference (t = T → 0 in a single jump) is fastest but can
blur fine detail or drift on out-of-distribution prompts. Production systems usually use:
- Two to four steps: Schedule a short decreasing sequence
T = t1 > t2 > … > 0; applyfat each level, re-noise lightly between steps (consistency sampling) or chain predictions. Harbor Media's preview pipeline uses two steps. - Teacher-student cascade: LCM draft at two steps, then img2img refine with the eight-step flow teacher at low denoise strength for finals.
- Step distillation hybrids: Combine LCM with speculative-style draft-and-verify when batch size is one and latency dominates.
Step count trades linearly against GPU milliseconds per image. Measure on your resolution and batch size: a 1024×1024 SDXL LCM at two steps can still exceed real-time on consumer GPUs if the VAE decode is not batched.
Harbor Media LCM thumbnail refactor
Harbor's marketing team generates hero banners, social crops, and email headers from shared prompt templates. After the rectified-flow migration, final quality was stable at eight steps, but the review UI still felt sluggish.
- Teacher lock — Kept the eight-step rectified-flow UNet as the quality teacher; froze weights and logged PF-ODE trajectories on 50k internal prompt pairs.
- LCM distillation — Trained a latent consistency student with CD for 40k steps on 512×512 latents; CFG scale fixed at 7.5 to match production prompts.
- Preview tier — Browser UI calls the two-step LCM for grid previews (4×5 variants); designers discard 80% before any eight-step render.
- Final tier — Selected previews upscale and refine with the teacher at denoise 0.35; composition anchors from LCM reduce teacher step count sensitivity.
- Eval gate — Internal CLIP-score and human pairwise tests; LCM previews within 4% win-rate of teacher-only drafts at 12× lower preview cost.
Outcome: median design iteration dropped from 11 minutes to 90 seconds. Final banner quality unchanged. Lesson: consistency distillation is a tiering tool — cheap drafts, expensive finals — not always a full replacement for multi-step teachers.
Technique decision table
| Approach | Typical steps | Best when | Watch out for |
|---|---|---|---|
| DDPM / DDIM diffusion | 20–50 | Maximum quality, research baselines, fine detail | Slow; scheduler tuning; high GPU cost per image |
| Flow matching / rectified flow | 4–16 | Balanced quality and speed; SD3-class models | Still multi-step; needs ODE solver choice |
| Consistency distillation (LCM) | 1–4 | Interactive previews, mobile, high-volume drafts | Fixed CFG; detail loss at 1 step; teacher dependency |
| Consistency training (scratch) | 1–2 | No teacher available; small datasets (CIFAR-scale) | Hard at 1024px; long training; schedule sensitivity |
| GAN | 1 | Ultra-low latency, fixed resolution | Mode collapse; limited diversity vs diffusion |
| Cascade (LCM + teacher) | 2 + 4–8 | Production UIs with draft/final tiers | Two models to ship; alignment between tiers |
Common pitfalls
- Expecting one-step to match fifty-step FID — Use LCM for previews or latency-critical paths; keep a teacher for finals.
- Distilling a weak teacher — CD inherits teacher ceilings; garbage in, garbage out.
- Changing CFG at inference — LCM/Turbo bake guidance into weights; arbitrary CFG breaks composition.
- Skipping EMA target network — Training destabilizes without
fθ−; follow reference hyperparameters. - Pixel-space consistency at 1K+ — Latent CD is standard for SD-class resolutions; pixel CT does not scale cheaply.
- Ignoring VAE decode cost — Two-step UNet plus full decode can dominate latency on small GPUs.
- Prompt distribution shift — Distill on prompts matching production; anime-heavy LCM fails on photoreal product shots.
- No pairwise human eval — FID misses layout and text fidelity; run side-by-side tests on real briefs.
Production checklist
- Lock a strong multi-step teacher before distillation; log trajectories on production prompts.
- Choose latent vs pixel CD based on resolution and existing VAE availability.
- Fix CFG scale during distillation to match intended inference prompts.
- Train with EMA target network; use reference learning rate and timestep spacing.
- Benchmark 1-, 2-, and 4-step FID/CLIP-score on an internal holdout set.
- Run human pairwise eval on composition, not just aggregate metrics.
- Ship preview tier (LCM) and final tier (teacher) with clear UI labeling.
- Profile end-to-end latency including VAE decode and text encoder.
- Version-control LoRA adapters separately from base UNet weights.
- Plan fallback to teacher-only path if LCM fails on edge-case prompts.
Key takeaways
- Consistency models learn f(x_t, t) that maps any point on a denoising trajectory directly to the clean sample x_0.
- Consistency distillation from a strong diffusion or flow teacher is the practical path to LCM, SDXL-Turbo, and two-step preview pipelines.
- One-step inference is fastest but blurrier; two to four consistency-sampling steps recover most teacher quality at a fraction of GPU cost.
- Harbor Media's cascade uses LCM for interactive drafts and the eight-step rectified-flow teacher for final banners.
- Match technique to tier: LCM for latency-sensitive previews, multi-step flow or diffusion for maximum fidelity.
Related reading
- Diffusion models explained — DDPM, schedulers, and the teachers consistency models distill
- Flow matching explained — rectified flow, ODE sampling, and SD3 velocity fields
- Knowledge distillation explained — teacher-student training patterns beyond generative models
- LLM model distillation explained — compressing large models for faster inference