Explainer · 7 June 2026

How file systems, inodes, and journaling work

When you save report.pdf or clone a Git repository, you think in paths and folders. The kernel thinks in block devices, inodes, and ordered writes to sectors that may live on NVMe flash, a cloud volume, or a loopback file inside a container. A file system is the translation layer: it turns human-readable names into metadata structures on disk, maps file bytes to physical blocks, and recovers consistency after power loss. Misunderstanding that layer is why databases corrupt after a crash, why docker pull thrashes disk, and why "we called write()" does not mean the bits survived a reboot until fsync() completes.

Block devices, partitions, and mount points

Storage hardware exposes a linear array of fixed-size sectors (traditionally 512 bytes; many NVMe drives use 4 KiB). The kernel's block layer turns /dev/nvme0n1 or /dev/sda into a queue of read/write requests. A partition table (GPT on modern machines) slices the device into regions; each region can host a different file system or swap space.

Mounting attaches a file system's root inode to a directory in the global tree — typically / for the root file system and /home for a separate volume. Linux mount namespaces (used by containers) give each cgroup its own view of what is mounted where, which is why a pod sees /app/data as local even when it is a network-attached volume underneath.

User programs never touch raw sectors directly (except specialized tools). They call open(), read(), write() on paths; the VFS (virtual file system) dispatches to ext4, XFS, or btrfs implementations.

Inodes: where metadata lives

A file name is not stored inside the file. Names live in directories; the actual file identity is an inode (index node) — a fixed-size record containing:

File type (regular, directory, symlink, device node)
Owner, group, and permission bits
Timestamps (atime, mtime, ctime — and sometimes birth time)
Size in bytes
Pointers to data blocks on disk

Classic Unix inodes use direct block pointers plus indirect blocks for large files. Modern file systems often store extents — contiguous run-length tuples (start block + length) — which compress metadata for multi-gigabyte files and improve sequential read performance. The inode number is what ls -i prints; hard links are multiple directory entries pointing at the same inode with a reference count.

When you delete a file, the kernel drops a directory entry and decrements the inode link count. Data blocks are freed only when the count hits zero and no process still has the file open — which is why Unix lets you unlink a log file while a daemon keeps writing to the anonymous inode until restart.

Directories as maps, not folders

A directory is a special file whose data blocks contain a list of (name, inode_number) pairs (plus file type hints in dirent structures for performance). Path lookup walks the tree: /var/log/syslog starts at inode 2 (root), reads the var entry, loads that inode, reads log, and so on. Symlinks store a target path string inside the inode and require extra resolution steps.

This design mirrors how content-addressable stores separate human names from immutable hashes — Git object names are directory-like refs pointing at commit trees, while file systems use mutable paths over inode tables. Both must answer: "given this identifier, where are the bytes?"

The page cache: fast reads, delayed writes

Disk is orders of magnitude slower than RAM. The kernel keeps a page cache of recently accessed file blocks in memory — the same 4 KiB pages the virtual memory subsystem manages. A read often hits cached pages; a write typically marks cache pages dirty and returns to your program long before the storage device sees the data.

That buffering is why a text editor "saves" instantly and why a database must call fsync() (or fdatasync()) on its WAL file before acknowledging a commit. Without an explicit flush, a power loss can roll back writes the application believed were durable. The flush path schedules writeback I/O; on SSDs it may also trigger FTL garbage collection internally.

Crash consistency: journaling and copy-on-write

Updating a file can require multiple disk writes: new data blocks, an updated inode, and a directory entry. If power fails between them, the file system can end up with allocated blocks not linked to any file (leaks) or pointers to garbage (corruption). Production file systems use structured recovery:

Journaling (ext4, XFS metadata journaling)

A journal is a circular log of pending metadata changes. Before applying an operation to the main structures, the file system appends a transaction describing the intended end state. After a crash, replay completes or rolls back partial transactions. ext4 defaults to data=ordered: metadata is journaled; data blocks are written before metadata commits, reducing torn-file risk without journaling every payload byte.

Copy-on-write (Btrfs, ZFS)

COW file systems never overwrite live blocks in place. A write allocates new blocks, builds a new inode snapshot pointing at them, and atomically swaps the tree root. Old blocks remain reachable until reference counts drop — enabling cheap snapshots and send/receive replication. The trade-off is fragmentation, higher write amplification on random I/O, and RAM hunger for ZFS's adaptive replacement cache (ARC).

Databases that ship their own storage engine (RocksDB, LSM trees) still sit on a file system unless given raw block devices; tuning mount -o noatime, alignment, and avoiding double journaling matters at scale.

SSD-specific behavior: TRIM, wear, and queues

Spinning disks care about seek order; SSDs care about erase blocks and over-provisioning. When you delete a file, the file system frees logical blocks but the drive may not know those pages are stale until you issue TRIM (DISCARD). Without TRIM, garbage collection inside the SSD competes with foreground I/O and latency spikes.

NVMe queues are deep and parallel — very different from a single SATA link. File systems and the block layer batch I/O; misaligned small writes still hurt. For server workloads, XFS and ext4 on noop or none schedulers with properly partitioned volumes often beat desktop defaults tuned for laptop power saving.

Comparing ext4, XFS, BTRFS, and ZFS

ext4 — default on many Linux distros; mature journaling, reasonable small-file performance, online resize, no built-in snapshots. Safe general-purpose root file system.
XFS — excels at large files and parallel metadata ops; allocation groups reduce lock contention. Common for database and media servers. Delayed allocation improves contiguous writes but can hide ENOSPC until flush.
Btrfs — COW, checksums, snapshots, subvolumes; kernel native. Good for desktops and NAS-like setups; historically had edge-case bugs — verify kernel version and backup strategy.
ZFS — checksumming, compression, snapshots, RAID-Z; runs in userspace on Linux via OpenZFS. Strong data integrity story; needs plenty of RAM and careful pool layout. Popular for backups and archives.

Cloud block volumes (EBS, Persistent Disk) add another layer: network latency, burst credits, and replication beneath your chosen file system. A Postgres pod on slow storage shows up as checkpoint spikes long before query plans look wrong.

Practical checklist for builders

Treat write() success as "buffered" unless you fsync() (or use direct I/O with eyes open).
Separate WAL/log volumes from data when you can; match file system to workload (XFS for large sequential, ext4 for mixed).
Enable periodic TRIM on SSDs; monitor iostat await and queue depth, not just CPU.
Snapshot before major upgrades; COW file systems make this cheap — use it.
In containers, remember the writable layer is a file system on a file system; lots of small writes to overlayfs can be slow — mount volumes for databases.
Backups must read consistent blocks (snapshots + pg_dump, not blind cp of a live data directory mid-write).