Guide

Weights & Biases fundamentals explained

A computer-vision team trains twelve ResNet variants overnight. Each engineer logs to a different spreadsheet; confusion matrices live in Slack threads; nobody can reproduce the 94.2% validation accuracy from Tuesday. Weights & Biases (W&B) is a cloud-first experiment platform that unifies training runs into searchable projects with live dashboards for metrics, images, hyperparameters, and system utilization. Unlike file-based trackers, W&B optimizes for collaborative visibility — parallel coordinates plots, run comparison tables, and shareable report pages that research teams actually open. It pairs naturally with PyTorch, Hugging Face Transformers, and Optuna sweeps. This guide covers projects and runs, config and logging APIs, Artifacts and Model Registry, Sweeps, a Harbor Analytics defect classifier worked example, a tooling decision table versus MLflow, common pitfalls, and a practitioner checklist — building on our broader MLOps overview.

What W&B is and when to adopt it

W&B is a SaaS platform (with optional self-hosted deployment) for logging, comparing, and governing machine learning experiments. The Python SDK wraps training loops with a few lines; the web UI renders time-series metrics, media panels, and hyperparameter sweeps without custom dashboard code.

Core primitives

Entity — your organization or personal account namespace.
Project — a bucket for related runs (e.g. harbor-defect-classifier).
Run — one training execution with config, metrics, logs, and artifacts.
Artifact — versioned datasets, model checkpoints, or evaluation outputs with lineage.
Sweep — managed hyperparameter search coordinating many runs from a YAML config.

W&B shines when teams need rich media logging (image grids, attention maps, 3D point clouds), real-time collaboration on shared dashboards, and low-friction sweep orchestration without standing up a separate scheduler. Choose MLflow instead when policy requires fully self-hosted tracking with no external SaaS, or when Databricks-native registry integration is mandatory.

Projects, runs, and the logging API

Initialize a run at the top of your training script:

import wandb

wandb.init(
    project="harbor-defect-classifier",
    name="resnet50-baseline",
    config={"lr": 3e-4, "batch_size": 32, "epochs": 50}
)
# ... training loop ...
wandb.log({"train_loss": loss, "val_auc": auc}, step=epoch)
wandb.finish()

The config dict becomes immutable hyperparameters visible in the UI parallel-coordinates view. Log metrics with wandb.log — keys become chartable time series. Use step explicitly when logging multiple metrics per epoch so charts align correctly.

Config updates and grouping

Call wandb.config.update({"lr": new_lr}) only for values that truly change mid-run (learning-rate schedules). Group related runs with group="sweep-2026-06-09" or job_type="train" so the UI clusters variants. Tags like dataset=v4 enable filtered views across hundreds of historical runs.

Rich media logging

W&B’s differentiator is first-class media:

wandb.log({
    "predictions": wandb.Image(img, caption="epoch 10"),
    "confusion_matrix": wandb.plot.confusion_matrix(
        probs=None, y_true=labels, preds=preds, class_names=classes
    ),
    "gradients": wandb.Histogram(grad_norms)
})

Log attention heatmaps, audio spectrograms, or point-cloud renders during validation epochs. The UI overlays runs side-by-side — invaluable when debugging why one augmentation pipeline generalizes and another memorizes.

Artifacts: versioned datasets and model lineage

Artifacts track immutable versions of inputs and outputs with dependency graphs. Log a training dataset once:

artifact = wandb.Artifact("defect-images", type="dataset")
artifact.add_dir("./data/v4/train", name="train")
wandb.log_artifact(artifact)

Link a model checkpoint as a dependent artifact:

model_art = wandb.Artifact("resnet50-champion", type="model")
model_art.add_file("checkpoint.pt")
model_art.wait()  # resolve dataset artifact first
run.log_artifact(model_art)

The UI renders lineage: checkpoint v3 was trained on dataset v4 produced by preprocessing run abc123. This closes audit gaps that plain metric logging leaves open — critical for regulated manufacturing QA like Harbor Analytics’ defect pipeline.

Using artifacts in downstream runs

run.use_artifact("harbor-defect-classifier/defect-images:v4")
dataset_path = artifact.download()

use_artifact declares the dependency before training starts, so W&B can block promotion if someone retrains on an unapproved dataset version.

Sweeps: managed hyperparameter search

W&B Sweeps coordinate distributed hyperparameter search without writing a custom scheduler. Define a sweep config:

sweep_config = {
    "method": "bayes",
    "metric": {"name": "val_auc", "goal": "maximize"},
    "parameters": {
        "lr": {"min": 1e-5, "max": 1e-2, "distribution": "log_uniform_values"},
        "weight_decay": {"values": [0, 1e-4, 1e-3]},
        "augment_strength": {"values": ["light", "medium", "heavy"]}
    }
}
sweep_id = wandb.sweep(sweep_config, project="harbor-defect-classifier")

Launch agents on multiple GPUs:

wandb.agent(sweep_id, function=train, count=50)

The sweep controller allocates trials using Bayesian optimization (default), grid search, or random search. Each agent run inherits the suggested config via wandb.config. Compare with Optuna when you need define-by-run conditional search spaces or tight integration with non-W&B storage backends; many teams run Optuna studies and log each trial as a W&B run for visualization.

Model Registry and deployment hooks

W&B Model Registry links champion artifacts to lifecycle stages (staging, production, archived). After logging a model artifact:

run.link_artifact(
    artifact=model_art,
    target_path="harbor-defect-classifier/production"
)

Registry entries carry aliases (@champion, @challenger) similar to MLflow. Deployment integrations (SageMaker, Vertex AI, custom webhooks) poll registry events to trigger canary rollouts. For teams already on MLflow registry, W&B can coexist as the visualization layer while MLflow handles promotion — avoid duplicating registry truth in two systems without sync automation.

Reports and collaboration

Reports are shareable notebook-style pages embedding live charts from selected runs. Pin a report to a Slack channel for weekly model review instead of exporting static PNGs. Set run privacy at the project level when working with sensitive data; enterprise tiers add SSO and audit logs.

Framework integrations and autolog

W&B integrates with major ML stacks through thin wrappers:

PyTorch Lightning — WandbLogger callback logs metrics, gradients, and learning rates automatically.
Hugging Face Transformers — report_to="wandb" in TrainingArguments logs loss curves and evaluation metrics per checkpoint.
Keras / TensorFlow — WandbCallback on model.fit.
scikit-learn — wrap estimators or use wandb.sklearn.plot_classifier helpers.
Ultralytics YOLO — built-in wandb=True flag for detection training dashboards.

wandb.watch(model, log="gradients", log_freq=100) streams weight and gradient histograms — powerful for debugging vanishing gradients, but expensive on large models. Disable or raise log_freq in production training.

Worked example: Harbor Analytics defect classifier

Harbor Analytics inspects PCB solder joints from camera feeds. The vision team retrains a ResNet classifier weekly on augmented image batches, comparing augmentation policies and class-weighting strategies.

Project setup

Project: harbor-defect-classifier under entity harbor-analytics
Dataset artifact: defect-images:v4 (42k labeled crops)
Sweep: Bayesian search over lr, weight decay, augment strength (50 trials)
Primary metric: val_auc on held-out production-line validation set

Each sweep agent calls run.use_artifact("defect-images:v4"), trains for 30 epochs, and logs per-epoch val_auc, sample misclassified images via wandb.Image, and GPU utilization via wandb.log({"gpu_mem_gb": ...}). After the sweep, the lead filters runs with val_auc > 0.94 and reviews confusion-matrix panels. Run sweep-champion-17 logs resnet50-champion:v1 to the registry staging alias. Shadow deployment on the factory edge server compares staging against production for seven days before promotion.

This workflow mirrors the Optuna + MLflow pattern in our hyperparameter tuning guide, but W&B replaces separate sweep UI and spreadsheet review with one linked dashboard.

Tooling decision table

Need	W&B fit	Alternative
Collaborative experiment dashboards	Strong default	Neptune, Comet
Rich image/audio/3D logging	Strong	TensorBoard (local)
Self-hosted, air-gapped only	Enterprise self-host	MLflow, ClearML
Open-source model registry on-prem	Partial	MLflow Registry
Define-by-run conditional HPO	Via Optuna bridge	Optuna native
Git-native dataset versioning	Artifacts (complement)	DVC
LLM trace debugging	W&B Weave (emerging)	LangSmith, Phoenix
Zero external SaaS policy	Weak (cloud default)	MLflow file store

Common pitfalls

Logging every mini-batch — floods the UI and slows training. Log per epoch unless debugging a specific instability.
Forgetting wandb.finish() — leaves zombie runs marked “running” in the dashboard. Use try/finally or context managers.
API keys in committed code — use wandb login locally and WANDB_API_KEY in CI secrets only.
Massive unversioned artifacts — logging 10 GB checkpoints every epoch exhausts storage quotas. Log final champion only or use reference links.
Duplicate registry systems — promoting models in both W&B and MLflow without sync creates conflicting “production” versions.
Offline runs never synced — wandb.init(mode="offline") requires explicit wandb sync before results appear for the team.
PII in logged images — manufacturing photos may contain employee badges or serial numbers. Scrub or blur before wandb.Image upload.

Practitioner checklist

Create one project per model family; use consistent run naming conventions.
Log hyperparameters via wandb.config at init — never only as metrics.
Version datasets and models as Artifacts with explicit dependency links.
Hold out a final test set that no sweep or run selection ever sees.
Set sweep early_terminate (Hyperband) for long training jobs.
Tag runs with git commit, data snapshot ID, and author email.
Review sample predictions visually before promoting registry aliases.
Configure storage retention policies for artifact versions in enterprise settings.
Wire registry webhooks to deployment CI instead of manual UI promotion.
Document whether W&B or MLflow is the single source of registry truth.

Key takeaways

W&B centers on collaborative, media-rich experiment dashboards accessible from any browser.
Artifacts provide versioned lineage between datasets, preprocessing, and model checkpoints.
Sweeps run Bayesian, grid, or random search across distributed agents with minimal boilerplate.
Model Registry stages champions for staging and production with deployment integrations.
Pair W&B with Optuna or MLflow when you need conditional search spaces or self-hosted registry governance.