Guide
Weights & Biases fundamentals explained
A computer-vision team trains twelve ResNet variants overnight. Each engineer logs to a different spreadsheet; confusion matrices live in Slack threads; nobody can reproduce the 94.2% validation accuracy from Tuesday. Weights & Biases (W&B) is a cloud-first experiment platform that unifies training runs into searchable projects with live dashboards for metrics, images, hyperparameters, and system utilization. Unlike file-based trackers, W&B optimizes for collaborative visibility — parallel coordinates plots, run comparison tables, and shareable report pages that research teams actually open. It pairs naturally with PyTorch, Hugging Face Transformers, and Optuna sweeps. This guide covers projects and runs, config and logging APIs, Artifacts and Model Registry, Sweeps, a Harbor Analytics defect classifier worked example, a tooling decision table versus MLflow, common pitfalls, and a practitioner checklist — building on our broader MLOps overview.
What W&B is and when to adopt it
W&B is a SaaS platform (with optional self-hosted deployment) for logging, comparing, and governing machine learning experiments. The Python SDK wraps training loops with a few lines; the web UI renders time-series metrics, media panels, and hyperparameter sweeps without custom dashboard code.
Core primitives
- Entity — your organization or personal account namespace.
- Project — a bucket for related runs (e.g.
harbor-defect-classifier). - Run — one training execution with config, metrics, logs, and artifacts.
- Artifact — versioned datasets, model checkpoints, or evaluation outputs with lineage.
- Sweep — managed hyperparameter search coordinating many runs from a YAML config.
W&B shines when teams need rich media logging (image grids, attention maps, 3D point clouds), real-time collaboration on shared dashboards, and low-friction sweep orchestration without standing up a separate scheduler. Choose MLflow instead when policy requires fully self-hosted tracking with no external SaaS, or when Databricks-native registry integration is mandatory.
Projects, runs, and the logging API
Initialize a run at the top of your training script:
import wandb
wandb.init(
project="harbor-defect-classifier",
name="resnet50-baseline",
config={"lr": 3e-4, "batch_size": 32, "epochs": 50}
)
# ... training loop ...
wandb.log({"train_loss": loss, "val_auc": auc}, step=epoch)
wandb.finish()
The config dict becomes immutable hyperparameters visible in the UI
parallel-coordinates view. Log metrics with wandb.log — keys
become chartable time series. Use step explicitly when logging
multiple metrics per epoch so charts align correctly.
Config updates and grouping
Call wandb.config.update({"lr": new_lr}) only for values that truly
change mid-run (learning-rate schedules). Group related runs with
group="sweep-2026-06-09" or job_type="train" so the UI
clusters variants. Tags like dataset=v4 enable filtered views across
hundreds of historical runs.
Rich media logging
W&B’s differentiator is first-class media:
wandb.log({
"predictions": wandb.Image(img, caption="epoch 10"),
"confusion_matrix": wandb.plot.confusion_matrix(
probs=None, y_true=labels, preds=preds, class_names=classes
),
"gradients": wandb.Histogram(grad_norms)
})
Log attention heatmaps, audio spectrograms, or point-cloud renders during validation epochs. The UI overlays runs side-by-side — invaluable when debugging why one augmentation pipeline generalizes and another memorizes.
Artifacts: versioned datasets and model lineage
Artifacts track immutable versions of inputs and outputs with dependency graphs. Log a training dataset once:
artifact = wandb.Artifact("defect-images", type="dataset")
artifact.add_dir("./data/v4/train", name="train")
wandb.log_artifact(artifact)
Link a model checkpoint as a dependent artifact:
model_art = wandb.Artifact("resnet50-champion", type="model")
model_art.add_file("checkpoint.pt")
model_art.wait() # resolve dataset artifact first
run.log_artifact(model_art)
The UI renders lineage: checkpoint v3 was trained on dataset v4 produced by preprocessing run abc123. This closes audit gaps that plain metric logging leaves open — critical for regulated manufacturing QA like Harbor Analytics’ defect pipeline.
Using artifacts in downstream runs
run.use_artifact("harbor-defect-classifier/defect-images:v4")
dataset_path = artifact.download()
use_artifact declares the dependency before training starts, so W&B
can block promotion if someone retrains on an unapproved dataset version.
Sweeps: managed hyperparameter search
W&B Sweeps coordinate distributed hyperparameter search without writing a custom scheduler. Define a sweep config:
sweep_config = {
"method": "bayes",
"metric": {"name": "val_auc", "goal": "maximize"},
"parameters": {
"lr": {"min": 1e-5, "max": 1e-2, "distribution": "log_uniform_values"},
"weight_decay": {"values": [0, 1e-4, 1e-3]},
"augment_strength": {"values": ["light", "medium", "heavy"]}
}
}
sweep_id = wandb.sweep(sweep_config, project="harbor-defect-classifier")
Launch agents on multiple GPUs:
wandb.agent(sweep_id, function=train, count=50)
The sweep controller allocates trials using Bayesian optimization (default),
grid search, or random search. Each agent run inherits the suggested config via
wandb.config. Compare with
Optuna when you
need define-by-run conditional search spaces or tight integration with
non-W&B storage backends; many teams run Optuna studies and log each trial as
a W&B run for visualization.
Model Registry and deployment hooks
W&B Model Registry links champion artifacts to lifecycle stages (staging, production, archived). After logging a model artifact:
run.link_artifact(
artifact=model_art,
target_path="harbor-defect-classifier/production"
)
Registry entries carry aliases (@champion, @challenger)
similar to MLflow. Deployment integrations (SageMaker, Vertex AI, custom webhooks)
poll registry events to trigger canary rollouts. For teams already on MLflow
registry, W&B can coexist as the visualization layer while MLflow handles
promotion — avoid duplicating registry truth in two systems without sync
automation.
Reports and collaboration
Reports are shareable notebook-style pages embedding live charts from selected runs. Pin a report to a Slack channel for weekly model review instead of exporting static PNGs. Set run privacy at the project level when working with sensitive data; enterprise tiers add SSO and audit logs.
Framework integrations and autolog
W&B integrates with major ML stacks through thin wrappers:
- PyTorch Lightning —
WandbLoggercallback logs metrics, gradients, and learning rates automatically. - Hugging Face Transformers —
report_to="wandb"inTrainingArgumentslogs loss curves and evaluation metrics per checkpoint. - Keras / TensorFlow —
WandbCallbackonmodel.fit. - scikit-learn — wrap estimators or use
wandb.sklearn.plot_classifierhelpers. - Ultralytics YOLO — built-in
wandb=Trueflag for detection training dashboards.
wandb.watch(model, log="gradients", log_freq=100) streams weight and
gradient histograms — powerful for debugging vanishing gradients, but expensive
on large models. Disable or raise log_freq in production training.
Worked example: Harbor Analytics defect classifier
Harbor Analytics inspects PCB solder joints from camera feeds. The vision team retrains a ResNet classifier weekly on augmented image batches, comparing augmentation policies and class-weighting strategies.
Project setup
- Project:
harbor-defect-classifierunder entityharbor-analytics - Dataset artifact:
defect-images:v4(42k labeled crops) - Sweep: Bayesian search over lr, weight decay, augment strength (50 trials)
- Primary metric:
val_aucon held-out production-line validation set
Each sweep agent calls run.use_artifact("defect-images:v4"), trains
for 30 epochs, and logs per-epoch val_auc, sample misclassified images
via wandb.Image, and GPU utilization via
wandb.log({"gpu_mem_gb": ...}). After the sweep, the lead filters
runs with val_auc > 0.94 and reviews confusion-matrix panels. Run
sweep-champion-17 logs resnet50-champion:v1 to the
registry staging alias. Shadow deployment on the factory edge server compares
staging against production for seven days before promotion.
This workflow mirrors the Optuna + MLflow pattern in our hyperparameter tuning guide, but W&B replaces separate sweep UI and spreadsheet review with one linked dashboard.
Tooling decision table
| Need | W&B fit | Alternative |
|---|---|---|
| Collaborative experiment dashboards | Strong default | Neptune, Comet |
| Rich image/audio/3D logging | Strong | TensorBoard (local) |
| Self-hosted, air-gapped only | Enterprise self-host | MLflow, ClearML |
| Open-source model registry on-prem | Partial | MLflow Registry |
| Define-by-run conditional HPO | Via Optuna bridge | Optuna native |
| Git-native dataset versioning | Artifacts (complement) | DVC |
| LLM trace debugging | W&B Weave (emerging) | LangSmith, Phoenix |
| Zero external SaaS policy | Weak (cloud default) | MLflow file store |
Common pitfalls
- Logging every mini-batch — floods the UI and slows training. Log per epoch unless debugging a specific instability.
- Forgetting
wandb.finish()— leaves zombie runs marked “running” in the dashboard. Use try/finally or context managers. - API keys in committed code — use
wandb loginlocally andWANDB_API_KEYin CI secrets only. - Massive unversioned artifacts — logging 10 GB checkpoints every epoch exhausts storage quotas. Log final champion only or use reference links.
- Duplicate registry systems — promoting models in both W&B and MLflow without sync creates conflicting “production” versions.
- Offline runs never synced —
wandb.init(mode="offline")requires explicitwandb syncbefore results appear for the team. - PII in logged images — manufacturing photos may contain
employee badges or serial numbers. Scrub or blur before
wandb.Imageupload.
Practitioner checklist
- Create one project per model family; use consistent run naming conventions.
- Log hyperparameters via
wandb.configat init — never only as metrics. - Version datasets and models as Artifacts with explicit dependency links.
- Hold out a final test set that no sweep or run selection ever sees.
- Set sweep
early_terminate(Hyperband) for long training jobs. - Tag runs with git commit, data snapshot ID, and author email.
- Review sample predictions visually before promoting registry aliases.
- Configure storage retention policies for artifact versions in enterprise settings.
- Wire registry webhooks to deployment CI instead of manual UI promotion.
- Document whether W&B or MLflow is the single source of registry truth.
Key takeaways
- W&B centers on collaborative, media-rich experiment dashboards accessible from any browser.
- Artifacts provide versioned lineage between datasets, preprocessing, and model checkpoints.
- Sweeps run Bayesian, grid, or random search across distributed agents with minimal boilerplate.
- Model Registry stages champions for staging and production with deployment integrations.
- Pair W&B with Optuna or MLflow when you need conditional search spaces or self-hosted registry governance.
Related reading
- MLflow fundamentals explained — self-hosted tracking and registry alternative to W&B
- Optuna fundamentals explained — define-by-run hyperparameter search often logged to W&B
- Hyperparameter tuning explained — search theory behind W&B Sweeps and Optuna
- MLOps explained — deployment, monitoring, and governance beyond experiment tracking