Guide

PyTorch fundamentals explained

You have labeled support tickets and want a classifier that runs in production. NumPy can multiply matrices, but it cannot automatically compute gradients through a 12-layer network, shard batches across GPUs, or export to ONNX for serving. PyTorch is the open-source deep learning framework that fills that gap — used by Meta, OpenAI, Hugging Face, and most academic labs because its Python-first API feels like writing normal code while still compiling to fast CUDA kernels. This guide covers tensors and device placement, automatic differentiation with autograd, building models with nn.Module, feeding data through DataLoader, the standard training loop every project shares, GPU and mixed-precision basics, torch.compile optimizations, a Harbor Support ticket-router worked example, a framework comparison table, common pitfalls, and a practitioner checklist alongside our deep learning primer, gradient descent guide, and backpropagation explainer.

What PyTorch provides

PyTorch is a tensor computation library with automatic differentiation built in. Tensors are n-dimensional arrays (like NumPy ndarray) that can live on CPU or GPU, track computation history for gradient calculation, and dispatch to optimized BLAS/cuDNN kernels. The high-level torch.nn package supplies layers, loss functions, and optimizers; torch.utils.data handles batching and shuffling; torch.distributed scales training across machines.

Unlike graph-first frameworks that build a static computation graph before execution, PyTorch uses eager execution by default: each operation runs immediately, which makes debugging with print() and Python breakpoints natural. Since PyTorch 2.0, torch.compile can fuse operations into optimized graphs when you need inference speed without sacrificing the eager development experience.

Core modules you will touch daily

torch — tensor creation, math, random, device management
torch.nn — Module, Linear, Conv2d, LayerNorm, loss functions
torch.optim — SGD, Adam, AdamW, learning-rate schedulers
torch.utils.data — Dataset, DataLoader
torch.cuda.amp — automatic mixed precision (FP16/BF16)

Tensors: shape, dtype, and device

Everything in PyTorch is a tensor. Shape matters: a batch of 32 RGB images at 224×224 is (32, 3, 224, 224) in NCHW order (batch, channels, height, width). Dtype controls memory and precision — float32 for training, float16 or bfloat16 for mixed precision, int64 for class labels.

The device tells PyTorch where memory lives: cpu, cuda:0 (first GPU), or mps on Apple Silicon. A classic bug is creating a model on GPU but leaving input tensors on CPU — operations fail with device-mismatch errors. The fix is consistent .to(device) calls or moving the model once and passing device-aware batches.

Broadcasting and views

PyTorch follows NumPy-style broadcasting: tensors with compatible trailing dimensions can operate without explicit expansion. Use .view() or .reshape() to change shape without copying data when possible; .permute() reorders dimensions (essential when interfacing with libraries that expect channels-last layout). In-place operations (suffix _) save memory but can break autograd if they overwrite values needed for backward passes.

Autograd: automatic differentiation

requires_grad=True tells PyTorch to track operations on a tensor so it can compute partial derivatives. When you call loss.backward(), gradients flow backward through the computation graph and accumulate in tensor.grad. Optimizers read those gradients and update nn.Parameter weights.

Wrap inference and validation in torch.no_grad() (or the newer torch.inference_mode()) to skip graph construction — this cuts memory use roughly in half and speeds up evaluation. During training, always call optimizer.zero_grad() before loss.backward() or gradients from the previous step accumulate incorrectly.

Detaching and stopping gradients

Use .detach() when you need a tensor’s value but not its gradient — common in GANs, reinforcement learning target networks, and teacher-student distillation. with torch.no_grad(): around metric computation prevents accidental graph retention that balloons GPU memory across epochs.

nn.Module: building models

Subclass nn.Module and define layers in __init__, then wire them in forward(). PyTorch registers parameters automatically when you assign nn.Linear, nn.Conv2d, etc. as attributes. The forward method is what runs on each batch; do not call it directly in training loops — use model(inputs) so hooks and torch.compile wrappers work correctly.

Sequential containers stack layers for simple MLPs. For transformers and U-Nets, explicit forward logic with skip connections is clearer. Use model.train() before training (enables dropout and batch-norm updates) and model.eval() before validation (freezes batch-norm running stats and disables dropout).

Loss functions and optimizers

Pair nn.CrossEntropyLoss with raw logits (no softmax — the loss applies log-softmax internally). For binary tasks, BCEWithLogitsLoss is numerically stable. AdamW is the default optimizer for most transformer fine-tuning; SGD with momentum still wins on some vision tasks with careful learning-rate schedules. Attach a scheduler — cosine annealing or warmup-then-decay — when training beyond a few epochs.

DataLoader: batches, workers, and pinning

Implement Dataset.__getitem__ to return one sample and __len__ for the count. DataLoader wraps it with batching, shuffling, and parallel loading via num_workers. Set pin_memory=True when training on CUDA so host-to-device copies use pinned (page-locked) memory and transfer asynchronously.

Collate functions handle variable-length sequences — pad text token IDs to the longest item in the batch and return an attention mask. For large datasets that do not fit in RAM, use IterableDataset or memory-mapped formats. Shuffle training loaders every epoch; keep validation loaders deterministic (no shuffle).

The standard training loop

Every PyTorch project follows the same skeleton:

Load a batch: inputs, labels = next(iter_loader)
Move to device: inputs, labels = inputs.to(device), labels.to(device)
Zero gradients: optimizer.zero_grad()
Forward pass: outputs = model(inputs)
Compute loss: loss = criterion(outputs, labels)
Backward pass: loss.backward()
Optimizer step: optimizer.step()
Scheduler step (if per-batch): scheduler.step()

Track training loss and validation metrics each epoch. Save model.state_dict() — not the whole model object — for portable checkpoints. Include optimizer state when you need to resume long runs exactly. Use torch.save / torch.load with map_location when moving checkpoints between CPU and GPU machines.

Mixed precision with autocast

torch.cuda.amp.autocast() runs matmuls in FP16/BF16 while keeping loss-sensitive ops in FP32. Pair with GradScaler to prevent gradient underflow. On modern NVIDIA GPUs (Ampere and later), BF16 often needs no loss scaling. Mixed precision can nearly double throughput with minimal accuracy impact on large models.

GPU, distributed training, and torch.compile

Check GPU availability with torch.cuda.is_available(). Multi-GPU on one machine typically uses torch.nn.DataParallel (simple but slower) or DistributedDataParallel (DDP — preferred: one process per GPU, gradients synced via NCCL). For multi-node training, launch with torchrun and initialize the process group with dist.init_process_group.

torch.compile(model) (PyTorch 2+) traces hot paths through TorchInductor and can yield 20–40% inference speedups with no code changes beyond wrapping the model. Compile after debugging in eager mode — stack traces from compiled graphs are harder to read. Use mode="reduce-overhead" for small-batch inference and mode="max-autotune" for offline benchmarking.

Worked example: Harbor Support ticket router

Harbor Support receives 4,000 tickets per day across billing, technical, and account categories. A team fine-tunes a small classifier in PyTorch:

Dataset: CSV with text and label columns; Dataset tokenizes with a Hugging Face tokenizer, returning input_ids and attention_mask tensors.
Model: AutoModelForSequenceClassification from transformers with three output classes — a pretrained DistilBERT head swapped for the task.
Training: batch size 32, AdamW lr=2e-5, 3 epochs, DataLoader with num_workers=4, pin_memory=True, mixed precision via autocast.
Evaluation: macro F1 on a held-out 15% split; early stopping on validation loss with patience 2.
Export: model.save_pretrained() plus ONNX export for the C++ inference server — 12 ms p95 latency vs 45 ms in pure Python eager mode after torch.compile.

The entire training script is ~80 lines. The team spent more time on label quality and class balance than on framework mechanics — which is the point of choosing PyTorch over writing CUDA by hand.

Framework decision table

Need	PyTorch	TensorFlow / Keras	JAX
Research flexibility, dynamic graphs	Best fit — eager by default	Eager mode available; Keras 3 unifies APIs	Functional transformations; steep learning curve
LLM / transformer ecosystem	Hugging Face, vLLM, PyTorch native	Smaller hub; improving	Flax + Hugging Face; common in Google stacks
Mobile / edge deployment	TorchScript, ExecuTorch, ONNX	TFLite mature on Android	Less edge tooling
TPU training	Supported via XLA bridge	Native TPU integration	First-class `pmap` / `jit`
Production serving at scale	TorchServe, TensorRT, Triton	TF Serving, TFX pipelines	Custom; often research-first

For most teams shipping NLP or vision models in 2026, PyTorch is the default choice unless you are locked into TFLite on mobile or TPU-only Google infra.

Common pitfalls

Device mismatch — model on GPU, data on CPU. Fix with consistent .to(device).
Forgetting zero_grad() — gradients accumulate across steps, causing unstable training.
Training in eval mode — dropout disabled and batch-norm frozen; model appears to stop learning.
Double softmax — applying softmax before CrossEntropyLoss hurts numerical stability.
Shuffling validation data — makes epoch-to-epoch metrics incomparable; shuffle train only.
num_workers=0 on large datasets — GPU sits idle waiting for CPU loading.
Not setting seeds — results are irreproducible across runs; use torch.manual_seed and torch.cuda.manual_seed_all.
Saving the wrong object — pickle entire model breaks across PyTorch versions; save state_dict.
Memory leaks in training loop — retaining computation graphs via logging tensor lists; use .item() for scalars.
Compiling too early — debug in eager mode first; torch.compile obscures errors.

Practitioner checklist

Pin PyTorch and CUDA versions in requirements.txt — wheels are CUDA-specific.
Verify tensor shapes at the model boundary with a single batch before long training runs.
Use model.train() / model.eval() explicitly each epoch phase.
Log learning rate, loss, and primary metric every epoch (Weights & Biases, TensorBoard, or MLflow).
Save best checkpoint by validation metric, not final epoch.
Run a CPU-only smoke test in CI so imports and forward passes do not regress.
Profile one epoch with torch.profiler before scaling num_workers or batch size.
Export to ONNX or TorchScript only after accuracy parity tests against eager inference.
Document random seeds and data splits for auditability.
Plan GPU memory: gradient checkpointing and smaller batch sizes beat OOM crashes mid-run.

Key takeaways

PyTorch tensors extend NumPy with GPU placement and autograd for gradient-based learning.
nn.Module + optimizer + loss function + DataLoader is the universal training stack.
The training loop is seven lines of core logic repeated every epoch — master it once.
Mixed precision and torch.compile are free performance wins after correctness is proven.
Save state_dict checkpoints and keep train/eval modes explicit to avoid subtle bugs.