Explainer · 7 June 2026
How Linux containers, cgroups, and namespaces work
When someone says "we run it in Docker," they usually picture a lightweight virtual machine — a guest OS booting inside a box. That mental model is wrong and leads to bad capacity planning. A Linux container is an ordinary process (or tree of processes) on the host kernel, wrapped in namespaces that change what it can see and cgroups that cap what it can consume. There is no second kernel, no emulated CPU, and no full hardware virtualization layer unless you explicitly add one. Understanding that distinction explains why containers start in milliseconds, why they share the host's syscall surface, and why a kernel CVE can affect every pod on a node.
Containers vs virtual machines
A virtual machine (VM) runs a hypervisor that emulates or partitions hardware. The guest boots its own kernel, manages its own page tables, and schedules its own processes — isolated at the hardware or ring-privilege boundary. A container skips the guest kernel: your Node.js or Rust binary calls the same Linux kernel as the host, subject to extra rules.
- Startup — VMs pay firmware + kernel boot; containers exec an already-running kernel and only start userspace.
- Density — hundreds of containers per host are routine; dozens of VMs is already heavy.
- Isolation strength — VM escape is rare; container escape via kernel bugs or misconfigured privileges is a real threat model.
- Portability — images bundle userspace libraries; the kernel version and available syscalls still must match expectations.
Production fleets often combine both: Kubernetes nodes as VMs for hard multi-tenant boundaries, pods as containers for fast deploy cycles behind a load balancer.
Namespaces — faking separate machines
Linux namespaces partition kernel data structures so a process group sees a customized view of the system. The container runtime creates a new namespace set, then execs your entrypoint inside it. Key types:
- PID — process ID 1 inside the container is your app, not
systemd on the host;
pslists only cgroup members. - Mount — separate filesystem tree; bind mounts inject host paths (sockets, secrets) at chosen mount points.
- Network — own interfaces, routing table, and iptables/nft rules; often a veth pair connects to a bridge or CNI plugin on the host.
- UTS — unique hostname (
my-service-7f3a). - IPC — isolated SysV IPC and POSIX message queues.
- User — maps container UID 0 to an unprivileged host UID so root inside is not root outside (when configured).
- Cgroup (namespace) — hides which cgroup hierarchy the process belongs to.
Namespaces are not security boundaries by themselves. A process with
CAP_SYS_ADMIN or access to the host mount namespace can break
out. Hardening stacks seccomp (syscall allowlists), AppArmor/SELinux, and
read-only root filesystems on top of namespace isolation.
cgroups — CPU, memory, and I/O budgets
Control groups (cgroups) attach resource limits and accounting
to processes. Modern Linux uses cgroup v2 unified hierarchy;
orchestrators write limits under /sys/fs/cgroup. Common knobs:
- memory.max — hard RAM cap; exceed it and the OOM killer terminates container processes (not the whole host, if limits are correct).
- cpu.max — bandwidth quota per period (e.g. 50% of one core); prevents a runaway parser from starving neighbors on the same node.
- pids.max — fork bomb protection.
- io.max — throttle disk read/write bytes per second on shared SSDs.
cgroup memory limits interact with
virtual
memory: a container can hit its cap while the host still has free RAM
because page cache charged to the cgroup counts toward the limit. Setting
requests and limits in Kubernetes without measuring working set leads to
mysterious OOMKilled pods that look fine in host-level free -m
output.
CPU throttling shows up as high latency without high utilization — the cgroup exhausted its quota mid-request. Pair cgroup metrics with RED metrics (rate, errors, duration) to distinguish saturation from misconfigured limits.
Images, layers, and overlay filesystems
A container image is a stack of read-only filesystem layers
plus metadata (entrypoint, env vars, exposed ports). Each Dockerfile
instruction that modifies files adds a layer; unchanged layers are cached and
shared across images — two services both FROM debian:bookworm
reuse the same base tarballs on disk.
At runtime, an overlay filesystem (overlayfs) merges the read-only lower layers with a thin writable upper layer. Writes go to upper; reads fall through to lowers. Deleting a file from a lower layer creates a "whiteout" marker in upper. Container stop discards the writable layer unless you commit it — ephemeral by design, like copy-on-write snapshots at the deployment unit level.
The Open Container Initiative (OCI) defines image format
(layers as tarballs + JSON config) and runtime spec (how to construct
namespaces, mounts, and cgroups). runc is the reference low-level
runtime; containerd and CRI-O manage images and call runc; Docker and
Kubernetes sit above that stack.
From docker run to a Kubernetes pod
docker run nginx roughly: pull image manifest, create mount +
network namespaces, apply cgroup limits, overlay-mount layers, set capabilities
and seccomp profile, exec nginx as PID 1. The Docker daemon
(or rootless alternatives) holds privileges the CLI user lacks.
Kubernetes schedules a pod — one or more containers sharing PID and network namespaces by default so localhost IPC works between sidecars (Envoy proxy, log shipper, service mesh). The kubelet talks CRI to containerd; CNI plugins wire pod IPs; kube-proxy or eBPF programs program cluster-wide routing. Health checks restart failed containers; they do not reschedule unless the node fails — distinguish liveness from readiness when debugging traffic routed through a circuit breaker upstream.
Security and operational pitfalls
- Privileged containers — disable most isolation; treat like root on the host.
- HostPath mounts — expose host disks or Docker socket; a container escape becomes host compromise.
- Image supply chain — pin digests, scan layers, sign
with cosign;
:latestis not a version. - Single-process PID 1 — without a tiny init, zombie
processes accumulate; use
tiniordumb-init. - Kernel compatibility — eBPF, io_uring, and newer syscalls may behave differently across host versions; test on production-like AMIs.
- Noisy neighbor at the node — cgroup limits help but shared kernel locks and disk queues still correlate failures; spread critical workloads across nodes.
Practical checklist
- Run as non-root inside the container; map UIDs with user namespaces where possible.
- Set memory requests near measured working set; leave headroom for page cache spikes.
- Read-only root + tmpfs for
/tmpreduces runtime mutation attack surface. - Drop capabilities; use seccomp and minimal base images (distroless, Alpine with eyes open).
- Log and alert on OOMKilled and CPU throttling events — they precede user-visible outages.
- Understand you are shipping processes, not VMs — kernel patching is still fleet-wide critical.
Related on Solana Garden: Virtual memory explained, Load balancing explained, Observability guide, Circuit breakers explained, Explainers hub.