Explainer · 7 June 2026
How file systems, inodes, and journaling work
When you save report.pdf or clone a Git repository, you think in
paths and folders. The kernel thinks in block devices,
inodes, and ordered writes to sectors that may live on NVMe
flash, a cloud volume, or a loopback file inside a
container.
A file system is the translation layer: it turns human-readable
names into metadata structures on disk, maps file bytes to physical blocks,
and recovers consistency after power loss. Misunderstanding that layer is why
databases corrupt after a crash, why docker pull thrashes disk,
and why "we called write()" does not mean the bits survived a
reboot until fsync() completes.
Block devices, partitions, and mount points
Storage hardware exposes a linear array of fixed-size sectors
(traditionally 512 bytes; many NVMe drives use 4 KiB). The kernel's block
layer turns /dev/nvme0n1 or /dev/sda into a queue
of read/write requests. A partition table (GPT on modern
machines) slices the device into regions; each region can host a different
file system or swap space.
Mounting attaches a file system's root inode to a directory
in the global tree — typically / for the root file system and
/home for a separate volume. Linux
mount namespaces (used by containers) give each cgroup
its own view of what is mounted where, which is why a pod sees
/app/data as local even when it is a network-attached volume
underneath.
User programs never touch raw sectors directly (except specialized tools).
They call open(), read(), write() on
paths; the VFS (virtual file system) dispatches to ext4, XFS, or btrfs
implementations.
Inodes: where metadata lives
A file name is not stored inside the file. Names live in directories; the actual file identity is an inode (index node) — a fixed-size record containing:
- File type (regular, directory, symlink, device node)
- Owner, group, and permission bits
- Timestamps (atime, mtime, ctime — and sometimes birth time)
- Size in bytes
- Pointers to data blocks on disk
Classic Unix inodes use direct block pointers plus indirect blocks for large
files. Modern file systems often store extents — contiguous
run-length tuples (start block + length) — which compress metadata for
multi-gigabyte files and improve sequential read performance. The inode
number is what ls -i prints; hard links are multiple directory
entries pointing at the same inode with a reference count.
When you delete a file, the kernel drops a directory entry and decrements the inode link count. Data blocks are freed only when the count hits zero and no process still has the file open — which is why Unix lets you unlink a log file while a daemon keeps writing to the anonymous inode until restart.
Directories as maps, not folders
A directory is a special file whose data blocks contain a list of
(name, inode_number) pairs (plus file type hints in
dirent structures for performance). Path lookup walks the tree:
/var/log/syslog starts at inode 2 (root), reads the
var entry, loads that inode, reads log, and so on.
Symlinks store a target path string inside the inode and require extra
resolution steps.
This design mirrors how content-addressable stores separate human names from immutable hashes — Git object names are directory-like refs pointing at commit trees, while file systems use mutable paths over inode tables. Both must answer: "given this identifier, where are the bytes?"
The page cache: fast reads, delayed writes
Disk is orders of magnitude slower than RAM. The kernel keeps a page cache of recently accessed file blocks in memory — the same 4 KiB pages the virtual memory subsystem manages. A read often hits cached pages; a write typically marks cache pages dirty and returns to your program long before the storage device sees the data.
That buffering is why a text editor "saves" instantly and why a database
must call fsync() (or fdatasync()) on its WAL
file before acknowledging a commit. Without an explicit flush, a power loss
can roll back writes the application believed were durable. The flush path
schedules writeback I/O; on SSDs it may also trigger FTL garbage collection
internally.
Crash consistency: journaling and copy-on-write
Updating a file can require multiple disk writes: new data blocks, an updated inode, and a directory entry. If power fails between them, the file system can end up with allocated blocks not linked to any file (leaks) or pointers to garbage (corruption). Production file systems use structured recovery:
Journaling (ext4, XFS metadata journaling)
A journal is a circular log of pending metadata changes.
Before applying an operation to the main structures, the file system appends
a transaction describing the intended end state. After a crash, replay
completes or rolls back partial transactions. ext4 defaults
to data=ordered: metadata is journaled; data blocks are written
before metadata commits, reducing torn-file risk without journaling every
payload byte.
Copy-on-write (Btrfs, ZFS)
COW file systems never overwrite live blocks in place. A write allocates new blocks, builds a new inode snapshot pointing at them, and atomically swaps the tree root. Old blocks remain reachable until reference counts drop — enabling cheap snapshots and send/receive replication. The trade-off is fragmentation, higher write amplification on random I/O, and RAM hunger for ZFS's adaptive replacement cache (ARC).
Databases that ship their own storage engine (RocksDB,
LSM
trees) still sit on a file system unless given raw block devices; tuning
mount -o noatime, alignment, and avoiding double journaling
matters at scale.
SSD-specific behavior: TRIM, wear, and queues
Spinning disks care about seek order; SSDs care about erase blocks and
over-provisioning. When you delete a file, the file system frees logical
blocks but the drive may not know those pages are stale until you issue
TRIM (DISCARD). Without TRIM, garbage
collection inside the SSD competes with foreground I/O and latency spikes.
NVMe queues are deep and parallel — very different from a single SATA link.
File systems and the block layer batch I/O; misaligned small writes still
hurt. For server workloads, XFS and ext4 on noop or
none schedulers with properly partitioned volumes often beat
desktop defaults tuned for laptop power saving.
Comparing ext4, XFS, BTRFS, and ZFS
- ext4 — default on many Linux distros; mature journaling, reasonable small-file performance, online resize, no built-in snapshots. Safe general-purpose root file system.
- XFS — excels at large files and parallel metadata ops; allocation groups reduce lock contention. Common for database and media servers. Delayed allocation improves contiguous writes but can hide ENOSPC until flush.
- Btrfs — COW, checksums, snapshots, subvolumes; kernel native. Good for desktops and NAS-like setups; historically had edge-case bugs — verify kernel version and backup strategy.
- ZFS — checksumming, compression, snapshots, RAID-Z; runs in userspace on Linux via OpenZFS. Strong data integrity story; needs plenty of RAM and careful pool layout. Popular for backups and archives.
Cloud block volumes (EBS, Persistent Disk) add another layer: network latency, burst credits, and replication beneath your chosen file system. A Postgres pod on slow storage shows up as checkpoint spikes long before query plans look wrong.
Practical checklist for builders
- Treat
write()success as "buffered" unless youfsync()(or use direct I/O with eyes open). - Separate WAL/log volumes from data when you can; match file system to workload (XFS for large sequential, ext4 for mixed).
- Enable periodic TRIM on SSDs; monitor
iostatawait and queue depth, not just CPU. - Snapshot before major upgrades; COW file systems make this cheap — use it.
- In containers, remember the writable layer is a file system on a file system; lots of small writes to overlayfs can be slow — mount volumes for databases.
- Backups must read consistent blocks (snapshots +
pg_dump, not blindcpof a live data directory mid-write).
Related on Solana Garden: Virtual memory and paging explained, Linux containers explained, LSM trees explained, Database indexing guide, Explainers hub.