Explainer · 7 June 2026

How Linux containers, cgroups, and namespaces work

When someone says "we run it in Docker," they usually picture a lightweight virtual machine — a guest OS booting inside a box. That mental model is wrong and leads to bad capacity planning. A Linux container is an ordinary process (or tree of processes) on the host kernel, wrapped in namespaces that change what it can see and cgroups that cap what it can consume. There is no second kernel, no emulated CPU, and no full hardware virtualization layer unless you explicitly add one. Understanding that distinction explains why containers start in milliseconds, why they share the host's syscall surface, and why a kernel CVE can affect every pod on a node.

Containers vs virtual machines

A virtual machine (VM) runs a hypervisor that emulates or partitions hardware. The guest boots its own kernel, manages its own page tables, and schedules its own processes — isolated at the hardware or ring-privilege boundary. A container skips the guest kernel: your Node.js or Rust binary calls the same Linux kernel as the host, subject to extra rules.

Startup — VMs pay firmware + kernel boot; containers exec an already-running kernel and only start userspace.
Density — hundreds of containers per host are routine; dozens of VMs is already heavy.
Isolation strength — VM escape is rare; container escape via kernel bugs or misconfigured privileges is a real threat model.
Portability — images bundle userspace libraries; the kernel version and available syscalls still must match expectations.

Production fleets often combine both: Kubernetes nodes as VMs for hard multi-tenant boundaries, pods as containers for fast deploy cycles behind a load balancer.

Namespaces — faking separate machines

Linux namespaces partition kernel data structures so a process group sees a customized view of the system. The container runtime creates a new namespace set, then execs your entrypoint inside it. Key types:

PID — process ID 1 inside the container is your app, not systemd on the host; ps lists only cgroup members.
Mount — separate filesystem tree; bind mounts inject host paths (sockets, secrets) at chosen mount points.
Network — own interfaces, routing table, and iptables/nft rules; often a veth pair connects to a bridge or CNI plugin on the host.
UTS — unique hostname (my-service-7f3a).
IPC — isolated SysV IPC and POSIX message queues.
User — maps container UID 0 to an unprivileged host UID so root inside is not root outside (when configured).
Cgroup (namespace) — hides which cgroup hierarchy the process belongs to.

Namespaces are not security boundaries by themselves. A process with CAP_SYS_ADMIN or access to the host mount namespace can break out. Hardening stacks seccomp (syscall allowlists), AppArmor/SELinux, and read-only root filesystems on top of namespace isolation.

cgroups — CPU, memory, and I/O budgets

Control groups (cgroups) attach resource limits and accounting to processes. Modern Linux uses cgroup v2 unified hierarchy; orchestrators write limits under /sys/fs/cgroup. Common knobs:

memory.max — hard RAM cap; exceed it and the OOM killer terminates container processes (not the whole host, if limits are correct).
cpu.max — bandwidth quota per period (e.g. 50% of one core); prevents a runaway parser from starving neighbors on the same node.
pids.max — fork bomb protection.
io.max — throttle disk read/write bytes per second on shared SSDs.

cgroup memory limits interact with virtual memory: a container can hit its cap while the host still has free RAM because page cache charged to the cgroup counts toward the limit. Setting requests and limits in Kubernetes without measuring working set leads to mysterious OOMKilled pods that look fine in host-level free -m output.

CPU throttling shows up as high latency without high utilization — the cgroup exhausted its quota mid-request. Pair cgroup metrics with RED metrics (rate, errors, duration) to distinguish saturation from misconfigured limits.

Images, layers, and overlay filesystems

A container image is a stack of read-only filesystem layers plus metadata (entrypoint, env vars, exposed ports). Each Dockerfile instruction that modifies files adds a layer; unchanged layers are cached and shared across images — two services both FROM debian:bookworm reuse the same base tarballs on disk.

At runtime, an overlay filesystem (overlayfs) merges the read-only lower layers with a thin writable upper layer. Writes go to upper; reads fall through to lowers. Deleting a file from a lower layer creates a "whiteout" marker in upper. Container stop discards the writable layer unless you commit it — ephemeral by design, like copy-on-write snapshots at the deployment unit level.

The Open Container Initiative (OCI) defines image format (layers as tarballs + JSON config) and runtime spec (how to construct namespaces, mounts, and cgroups). runc is the reference low-level runtime; containerd and CRI-O manage images and call runc; Docker and Kubernetes sit above that stack.

From docker run to a Kubernetes pod

docker run nginx roughly: pull image manifest, create mount + network namespaces, apply cgroup limits, overlay-mount layers, set capabilities and seccomp profile, exec nginx as PID 1. The Docker daemon (or rootless alternatives) holds privileges the CLI user lacks.

Kubernetes schedules a pod — one or more containers sharing PID and network namespaces by default so localhost IPC works between sidecars (Envoy proxy, log shipper, service mesh). The kubelet talks CRI to containerd; CNI plugins wire pod IPs; kube-proxy or eBPF programs program cluster-wide routing. Health checks restart failed containers; they do not reschedule unless the node fails — distinguish liveness from readiness when debugging traffic routed through a circuit breaker upstream.

Security and operational pitfalls

Privileged containers — disable most isolation; treat like root on the host.
HostPath mounts — expose host disks or Docker socket; a container escape becomes host compromise.
Image supply chain — pin digests, scan layers, sign with cosign; :latest is not a version.
Single-process PID 1 — without a tiny init, zombie processes accumulate; use tini or dumb-init.
Kernel compatibility — eBPF, io_uring, and newer syscalls may behave differently across host versions; test on production-like AMIs.
Noisy neighbor at the node — cgroup limits help but shared kernel locks and disk queues still correlate failures; spread critical workloads across nodes.

Practical checklist

Run as non-root inside the container; map UIDs with user namespaces where possible.
Set memory requests near measured working set; leave headroom for page cache spikes.
Read-only root + tmpfs for /tmp reduces runtime mutation attack surface.
Drop capabilities; use seccomp and minimal base images (distroless, Alpine with eyes open).
Log and alert on OOMKilled and CPU throttling events — they precede user-visible outages.
Understand you are shipping processes, not VMs — kernel patching is still fleet-wide critical.