Explainer · 7 June 2026
How CPU process scheduling works
A laptop with eight logical CPUs can run hundreds of programs at once — but each core executes one instruction stream at a time. The operating system scheduler is the piece of kernel code that decides which process or thread runs next, for how long, and when to yank the CPU away. Get scheduling wrong and you see the classic symptoms: p99 latency spikes while average CPU looks idle, containers that starve each other, or a background indexer that accidentally hogs every core. Understanding the run queue, context switches, and policies like Linux's Completely Fair Scheduler (CFS) explains those failures without guessing.
Processes, threads, and the run queue
A process is an address space plus resources (open files, environment, security context). A thread is a schedulable unit inside that process — it shares memory but has its own stack and program counter. The kernel tracks each runnable thread in a run queue (often one per CPU or per scheduling domain). When a thread blocks on disk I/O, a lock, or a network read, it leaves the run queue; when it becomes ready again it re-enters and competes for CPU time.
User space does not pick the next thread directly. Your code calls
read() or waits on a mutex; the kernel puts you to sleep and
schedules someone else. That indirection is why a single hot mutex can make
the whole machine feel slow even when top reports moderate CPU —
threads are runnable but spinning or queued, not doing useful work. Pair this
mental model with our
virtual
memory explainer: scheduling picks who runs; paging decides
where their bytes live in RAM.
Context switches and their hidden tax
When the scheduler picks a new thread, the CPU must save the outgoing thread's registers and stack pointer, load the incoming thread's state, and often flush or repoint translation lookaside buffer (TLB) entries. That event is a context switch. A few microseconds per switch sounds cheap until you do thousands per second — then you burn measurable CPU just switching, and you evict hot data from CPU caches the next thread needs.
High switch rates usually mean too many runnable threads for too few cores,
aggressive time-slicing, or threads that wake each other constantly (chatty
locks, busy-wait retries). Tools like pidstat -w, vmstat,
and eBPF sched tracepoints show context-switch storms before they show up in
application logs. Fixing them is often "fewer threads, better batching, shorter
critical sections" — not a faster CPU.
Preemptive vs cooperative multitasking
Modern general-purpose OS kernels use preemptive scheduling: a timer interrupt fires every few milliseconds and the scheduler may replace the current thread even if it never voluntarily yields. That prevents one runaway loop from freezing the desktop. Older cooperative systems (early Mac OS, some embedded runtimes) required threads to call yield points; one bug could hang the machine.
Languages and runtimes add another layer. JavaScript on the browser main thread is cooperative at the language level — long synchronous loops block painting even though the OS could preempt the process. Go's goroutines are multiplexed onto OS threads by the Go scheduler; Rust async tasks need an executor. The OS scheduler is always underneath, but your runtime can defeat or complement it.
Linux CFS, nice values, and fairness
Since kernel 2.6.23, Linux's default policy for normal workloads is the Completely Fair Scheduler (CFS). Each runnable thread accumulates vruntime (virtual runtime) proportional to actual CPU time used, adjusted by priority. The thread with the lowest vruntime runs next. The goal is proportional shares: two CPU-bound threads with equal priority should each get about half a core over a long window.
Nice values (-20 to 19) bias that
proportion — lower nice means more CPU share, not real-time guarantees.
nice -n 10 batch-job is the politeness knob for background work.
CFS is fair in the long run but not for latency: a burst of runnable threads
still queues behind vruntime debt, which is why interactive services isolate
thread pools (
bulkheads
again) instead of sharing one unbounded pool.
Real-time classes and when not to use them
Linux also exposes SCHED_FIFO and SCHED_RR real-time classes. RT threads run before any CFS thread on that CPU until they block or (for round-robin) exhaust a time quantum. That is appropriate for audio DSP or industrial control with audited code paths — and dangerous on a general server: a CPU-bound RT loop can starve the kernel, SSH, and your database. Containers and systemd can confine RT bandwidth, but default deployments should stay on CFS unless you truly need deterministic wake latency.
Multi-core load balancing and CPU affinity
On SMP machines the scheduler periodically load-balances:
moves runnable threads between per-CPU run queues so one core is not idle while
another backs up. Balancing has a cost — cache cold starts on the new core —
so the kernel tries not to bounce threads unnecessarily. You can pin threads
with taskset or sched_setaffinity when you know
access patterns (NUMA-local memory on big iron, or isolating a latency-sensitive
polling thread).
NUMA (non-uniform memory access) adds another wrinkle: memory
attached to socket 0 is faster for cores on socket 0. Blind load balancing can
place a thread on socket 1 while its heap still lives on socket 0, every access
crossing the interconnect. Production tuning uses numactl, per-socket
memory pools, or Kubernetes topology hints — scheduling and memory placement
are one problem viewed from two angles.
cgroup CPU limits and containers
Linux
cgroups v2 let orchestrators cap CPU without nice values. A
cpu.max quota (e.g. max 100000 200000 = 50% of one
core per 200 ms period) throttles a cgroup by freezing its threads when the
budget is exhausted — visible as nr_throttled in cgroup stats.
cpu.weight (1–10000) adjusts relative share among siblings when
CPUs contend, similar in spirit to CFS niceness but scoped to the container
hierarchy.
Kubernetes requests and limits map to these knobs.
Under-provisioned limits cause throttling and tail latency; over-provisioned
requests waste cluster headroom. CPU limits do not create cores — they time-slice
existing ones — so a Java service with a huge thread pool can still spend most
of its quota context-switching instead of serving requests.
What to watch in production
- Runnable vs running — many runnable threads per core means queueing delay dominates; scale out or reduce concurrency.
- Run queue latency — eBPF sched metrics show how long threads wait before executing; correlates with p99 better than average CPU.
- Throttling — cgroup CPU throttling counters climbing means limits are too tight for the actual workload.
- Steal time — on VMs, hypervisor scheduling shows as %st in
top; your process is ready but the host gave the core to another tenant. - Priority inversions — a low-priority thread holds a lock a high-priority thread needs; the middle priority thread runs instead. Fix with priority inheritance mutexes or lock ordering discipline.
Scheduling sits below almost every backend pattern — RPC timeouts, connection pools, and distributed tracing all assume threads eventually get CPU. When they do not, breakers trip, retries multiply, and the outage looks like a dependency failure when it is really on-box contention.
Related on Solana Garden: Virtual memory and paging, CPU caches, Containers and cgroups, Explainers hub.