Guide
PyTorch fundamentals explained
You have labeled support tickets and want a classifier that runs in production.
NumPy can multiply matrices, but it cannot automatically compute gradients through
a 12-layer network, shard batches across GPUs, or export to ONNX for serving.
PyTorch is the open-source deep learning framework that fills that
gap — used by Meta, OpenAI, Hugging Face, and most academic labs because its
Python-first API feels like writing normal code while still compiling to fast
CUDA kernels. This guide covers tensors and device placement, automatic
differentiation with autograd, building models with
nn.Module, feeding data through DataLoader, the standard
training loop every project shares, GPU and mixed-precision basics,
torch.compile optimizations, a Harbor Support ticket-router worked
example, a framework comparison table, common pitfalls, and a practitioner
checklist alongside our
deep learning primer,
gradient descent guide,
and
backpropagation explainer.
What PyTorch provides
PyTorch is a tensor computation library with automatic
differentiation built in. Tensors are n-dimensional arrays (like NumPy
ndarray) that can live on CPU or GPU, track computation history
for gradient calculation, and dispatch to optimized BLAS/cuDNN kernels. The
high-level torch.nn package supplies layers, loss functions, and
optimizers; torch.utils.data handles batching and shuffling;
torch.distributed scales training across machines.
Unlike graph-first frameworks that build a static computation graph before
execution, PyTorch uses eager execution by default: each
operation runs immediately, which makes debugging with print() and
Python breakpoints natural. Since PyTorch 2.0, torch.compile can
fuse operations into optimized graphs when you need inference speed without
sacrificing the eager development experience.
Core modules you will touch daily
torch— tensor creation, math, random, device managementtorch.nn—Module,Linear,Conv2d,LayerNorm, loss functionstorch.optim— SGD, Adam, AdamW, learning-rate schedulerstorch.utils.data—Dataset,DataLoadertorch.cuda.amp— automatic mixed precision (FP16/BF16)
Tensors: shape, dtype, and device
Everything in PyTorch is a tensor. Shape matters: a batch of
32 RGB images at 224×224 is (32, 3, 224, 224) in
NCHW order (batch, channels, height, width). Dtype controls
memory and precision — float32 for training,
float16 or bfloat16 for mixed precision,
int64 for class labels.
The device tells PyTorch where memory lives:
cpu, cuda:0 (first GPU), or mps on Apple
Silicon. A classic bug is creating a model on GPU but leaving input tensors on
CPU — operations fail with device-mismatch errors. The fix is consistent
.to(device) calls or moving the model once and passing device-aware
batches.
Broadcasting and views
PyTorch follows NumPy-style broadcasting: tensors with
compatible trailing dimensions can operate without explicit expansion. Use
.view() or .reshape() to change shape without copying
data when possible; .permute() reorders dimensions (essential when
interfacing with libraries that expect channels-last layout). In-place
operations (suffix _) save memory but can break autograd if they
overwrite values needed for backward passes.
Autograd: automatic differentiation
requires_grad=True tells PyTorch to track operations on a tensor
so it can compute partial derivatives. When you call loss.backward(),
gradients flow backward through the computation graph and accumulate in
tensor.grad. Optimizers read those gradients and update
nn.Parameter weights.
Wrap inference and validation in torch.no_grad() (or the newer
torch.inference_mode()) to skip graph construction — this cuts
memory use roughly in half and speeds up evaluation. During training, always
call optimizer.zero_grad() before loss.backward() or
gradients from the previous step accumulate incorrectly.
Detaching and stopping gradients
Use .detach() when you need a tensor’s value but not its
gradient — common in GANs, reinforcement learning target networks, and
teacher-student distillation. with torch.no_grad(): around metric
computation prevents accidental graph retention that balloons GPU memory across
epochs.
nn.Module: building models
Subclass nn.Module and define layers in __init__, then
wire them in forward(). PyTorch registers parameters automatically
when you assign nn.Linear, nn.Conv2d, etc. as
attributes. The forward method is what runs on each batch; do not
call it directly in training loops — use model(inputs) so hooks
and torch.compile wrappers work correctly.
Sequential containers stack layers for simple MLPs.
For transformers and U-Nets, explicit forward logic with skip
connections is clearer. Use model.train() before training (enables
dropout and batch-norm updates) and model.eval() before validation
(freezes batch-norm running stats and disables dropout).
Loss functions and optimizers
Pair nn.CrossEntropyLoss with raw logits (no softmax — the loss
applies log-softmax internally). For binary tasks, BCEWithLogitsLoss
is numerically stable. AdamW is the default optimizer for most
transformer fine-tuning; SGD with momentum still wins on some vision tasks with
careful learning-rate schedules. Attach a scheduler — cosine annealing or
warmup-then-decay — when training beyond a few epochs.
DataLoader: batches, workers, and pinning
Implement Dataset.__getitem__ to return one sample and
__len__ for the count. DataLoader wraps it with
batching, shuffling, and parallel loading via num_workers.
Set pin_memory=True when training on CUDA so host-to-device copies
use pinned (page-locked) memory and transfer asynchronously.
Collate functions handle variable-length sequences — pad text
token IDs to the longest item in the batch and return an attention mask.
For large datasets that do not fit in RAM, use IterableDataset or
memory-mapped formats. Shuffle training loaders every epoch; keep validation
loaders deterministic (no shuffle).
The standard training loop
Every PyTorch project follows the same skeleton:
- Load a batch:
inputs, labels = next(iter_loader) - Move to device:
inputs, labels = inputs.to(device), labels.to(device) - Zero gradients:
optimizer.zero_grad() - Forward pass:
outputs = model(inputs) - Compute loss:
loss = criterion(outputs, labels) - Backward pass:
loss.backward() - Optimizer step:
optimizer.step() - Scheduler step (if per-batch):
scheduler.step()
Track training loss and validation metrics each epoch. Save
model.state_dict() — not the whole model object — for portable
checkpoints. Include optimizer state when you need to resume long runs exactly.
Use torch.save / torch.load with map_location
when moving checkpoints between CPU and GPU machines.
Mixed precision with autocast
torch.cuda.amp.autocast() runs matmuls in FP16/BF16 while keeping
loss-sensitive ops in FP32. Pair with GradScaler to prevent
gradient underflow. On modern NVIDIA GPUs (Ampere and later), BF16 often needs
no loss scaling. Mixed precision can nearly double throughput with minimal
accuracy impact on large models.
GPU, distributed training, and torch.compile
Check GPU availability with torch.cuda.is_available(). Multi-GPU
on one machine typically uses torch.nn.DataParallel (simple but
slower) or DistributedDataParallel (DDP — preferred: one process
per GPU, gradients synced via NCCL). For multi-node training, launch with
torchrun and initialize the process group with
dist.init_process_group.
torch.compile(model) (PyTorch 2+) traces hot paths through
TorchInductor and can yield 20–40% inference speedups with no code changes
beyond wrapping the model. Compile after debugging in eager mode — stack traces
from compiled graphs are harder to read. Use mode="reduce-overhead"
for small-batch inference and mode="max-autotune" for offline
benchmarking.
Worked example: Harbor Support ticket router
Harbor Support receives 4,000 tickets per day across billing, technical, and account categories. A team fine-tunes a small classifier in PyTorch:
- Dataset: CSV with
textandlabelcolumns;Datasettokenizes with a Hugging Face tokenizer, returninginput_idsandattention_masktensors. - Model:
AutoModelForSequenceClassificationfromtransformerswith three output classes — a pretrained DistilBERT head swapped for the task. - Training: batch size 32, AdamW lr=2e-5, 3 epochs,
DataLoaderwithnum_workers=4,pin_memory=True, mixed precision viaautocast. - Evaluation: macro F1 on a held-out 15% split; early stopping on validation loss with patience 2.
- Export:
model.save_pretrained()plus ONNX export for the C++ inference server — 12 ms p95 latency vs 45 ms in pure Python eager mode aftertorch.compile.
The entire training script is ~80 lines. The team spent more time on label quality and class balance than on framework mechanics — which is the point of choosing PyTorch over writing CUDA by hand.
Framework decision table
| Need | PyTorch | TensorFlow / Keras | JAX |
|---|---|---|---|
| Research flexibility, dynamic graphs | Best fit — eager by default | Eager mode available; Keras 3 unifies APIs | Functional transformations; steep learning curve |
| LLM / transformer ecosystem | Hugging Face, vLLM, PyTorch native | Smaller hub; improving | Flax + Hugging Face; common in Google stacks |
| Mobile / edge deployment | TorchScript, ExecuTorch, ONNX | TFLite mature on Android | Less edge tooling |
| TPU training | Supported via XLA bridge | Native TPU integration | First-class pmap / jit |
| Production serving at scale | TorchServe, TensorRT, Triton | TF Serving, TFX pipelines | Custom; often research-first |
For most teams shipping NLP or vision models in 2026, PyTorch is the default choice unless you are locked into TFLite on mobile or TPU-only Google infra.
Common pitfalls
- Device mismatch — model on GPU, data on CPU. Fix with consistent
.to(device). - Forgetting
zero_grad()— gradients accumulate across steps, causing unstable training. - Training in eval mode — dropout disabled and batch-norm frozen; model appears to stop learning.
- Double softmax — applying softmax before
CrossEntropyLosshurts numerical stability. - Shuffling validation data — makes epoch-to-epoch metrics incomparable; shuffle train only.
num_workers=0on large datasets — GPU sits idle waiting for CPU loading.- Not setting seeds — results are irreproducible across runs; use
torch.manual_seedandtorch.cuda.manual_seed_all. - Saving the wrong object — pickle entire model breaks across PyTorch versions; save
state_dict. - Memory leaks in training loop — retaining computation graphs via logging tensor lists; use
.item()for scalars. - Compiling too early — debug in eager mode first;
torch.compileobscures errors.
Practitioner checklist
- Pin PyTorch and CUDA versions in
requirements.txt— wheels are CUDA-specific. - Verify tensor shapes at the model boundary with a single batch before long training runs.
- Use
model.train()/model.eval()explicitly each epoch phase. - Log learning rate, loss, and primary metric every epoch (Weights & Biases, TensorBoard, or MLflow).
- Save best checkpoint by validation metric, not final epoch.
- Run a CPU-only smoke test in CI so imports and forward passes do not regress.
- Profile one epoch with
torch.profilerbefore scalingnum_workersor batch size. - Export to ONNX or TorchScript only after accuracy parity tests against eager inference.
- Document random seeds and data splits for auditability.
- Plan GPU memory: gradient checkpointing and smaller batch sizes beat OOM crashes mid-run.
Key takeaways
- PyTorch tensors extend NumPy with GPU placement and autograd for gradient-based learning.
nn.Module+ optimizer + loss function + DataLoader is the universal training stack.- The training loop is seven lines of core logic repeated every epoch — master it once.
- Mixed precision and
torch.compileare free performance wins after correctness is proven. - Save
state_dictcheckpoints and keep train/eval modes explicit to avoid subtle bugs.
Related reading
- Deep learning explained — neural network concepts that PyTorch implements
- Backpropagation explained — the algorithm autograd automates
- Gradient descent explained — how optimizers use computed gradients
- Python fundamentals explained — language basics before diving into ML code