Chapter 3: Storage and Retrieval

How Databases Store and Retrieve Data: From the World's Simplest Key-Value Store to High-Performance LSM-Trees and B-Trees.

At its most fundamental level, a database is a tool for storing data and finding it again. This chapter explores the storage engines that power modern databases, comparing log-structured approaches with page-oriented structures.

The World's Simplest Database

Consider a database implemented as two Bash functions:

#!/bin/bash
db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This implements a simple append-only log. - Write: Extremely fast (O(1)) because it just appends to the end of a file. - Read: Terrible performance (O(n)) because it must scan the entire file to find a key.

The "So What?": To make reads efficient, we need an Index. However, any index slows down writes because it must be updated every time data is written. This is the fundamental trade-off in storage systems.

I. Hash Indexes

The simplest indexing strategy is a Hash Map kept in memory, where every key is mapped to a byte offset in the data file.

i. Bitcask (Riak's Default Engine)

Bitcask offers high-performance reads and writes but requires all keys to fit in RAM. - Efficiency: Values can be larger than memory because they are loaded with a single disk seek. - Workload: Ideal for scenarios where a limited number of keys are updated frequently (e.g., a play counter for a URL).

ii. Compaction and Merging

To prevent running out of disk space, the log is broken into segments. Compaction throws away duplicate keys, keeping only the most recent update. Multiple segments are merged in the background to keep the total number of files manageable.

II. SSTables and LSM-Trees

To improve upon hash indexes, we require that the sequence of key-value pairs is sorted by key. This format is called a Sorted String Table (SSTable).

i. The LSM-Tree Flow

Log-Structured Merge-Trees (LSM-Trees) use SSTables as their core storage format, managing data from volatile RAM to immutable disk segments.

graph TD
    Write[Write Operation] --> WAL[Write-Ahead Log]
    Write --> Mem[Memtable]
    Mem -->|Flush| SST[SSTable Segments]
    SST -->|Compaction| SST

    Read[Read Request] --> Mem
    Mem -.->|Not Found| Bloom{Bloom Filter}
    Bloom -->|Possible Hit| SST

- Write --> WAL: Every write is first appended to the Write-Ahead Log to ensure durability.

Write --> Memtable: The write is simultaneously added to a sorted in-memory balanced tree (Memtable).
Memtable --> SSTable: When full, the Memtable is flushed to disk as a sorted, immutable SSTable segment.
SSTable --> Compaction: Background processes merge and compact segments to reclaim space and reduce read amplification.
Read --> Memtable: The system first checks the Memtable to retrieve the most recent version of a key.
Mem --> Bloom Filter: If missing from RAM, the system checks the Bloom Filter to avoid unnecessary disk probes.
Bloom --> SSTable: If the Bloom Filter indicates a possible match, the system searches the on-disk segments.

ii. Bloom Filters

LSM-Trees can be slow for non-existent keys. Bloom Filters are memory-efficient structures that can quickly confirm if a key definitely does not exist, saving unnecessary disk I/O.

[!TIP] Hands-on Implementation: I built a production-grade LSM-Tree in Go. Check out the Go-LSM Project to see these concepts in action.

III. B-Trees

While LSM-Trees are gaining popularity, B-Trees remain the most widely used indexing structure. They break the database down into fixed-size pages (usually 4 KB).

i. Architecture

One page is the root. It contains keys and pointers to child pages. Each child covers a continuous range of keys. - Height: A B-tree with n keys always has a height of $O(\log n)$. - Update-in-place: Unlike LSM-trees, B-trees overwrite pages on disk with new data.

ii. Durability: WAL and Latches

Write-Ahead Log (WAL): Since B-trees overwrite pages, a crash during an update could corrupt the tree. The WAL is used to restore the tree to a consistent state.
Latches: Lightweight locks used to protect the tree's internal state during concurrent access.

IV. LSM-Trees vs. B-Trees: The Trade-offs

Feature	LSM-Tree	B-Tree
Write Pattern	Append-only (Sequential)	Update-in-place (Random)
Throughput	High Write Throughput	Consistent Read Performance
Storage Overhead	Fragmented (Needs Compaction)	Fixed-size Pages
Consistency	Eventual (via Merging)	Strong (Single Key Location)

V. Indexing Secondary Structures

Heap Files: A common pattern where an index stores references to rows stored in a separate "heap" file.
Multi-Column Indexes:
- Concatenated Index: Appends one column to another to form a single key.
- Multi-dimensional Index: Crucial for geospatial data (e.g., searching within a rectangular map area). PostGIS uses R-trees for this purpose.

VI. OLTP vs. OLAP

Modern data systems are split between transaction processing and analytics.

Feature	OLTP (Transaction)	OLAP (Analytics)
Read Pattern	Small number of records by key	Aggregates over millions of rows
Write Pattern	Random-access, low-latency	Bulk import (ETL)
User	End user (via Web App)	Internal Analyst
Dataset Size	Gigabytes to Terabytes	Terabytes to Petabytes

Data Warehousing

To avoid slowing down production OLTP systems, companies use Data Warehouses. The process of moving data from OLTP to the warehouse is known as Extract-Transform-Load (ETL).

graph TD
    subgraph "Users"
        Customer((Customer))
        Worker((Warehouse Worker))
        Driver((Truck Driver))
    end

    subgraph "OLTP Systems"
        Ecom[E-commerce site]
        Stock[Stock-keeping app]
        Route[Vehicle route planner]

        SalesDB[(Sales DB)]
        InvDB[(Inventory DB)]
        GeoDB[(Geo DB)]
    end

    subgraph "OLAP Systems (ETL Process)"
        T1[Transform]
        T2[Transform]
        T3[Transform]
        WH[(Data Warehouse)]
        Analyst((Business Analyst))
    end

    Customer --> Ecom
    Worker --> Stock
    Driver --> Route

    Ecom --> SalesDB
    Stock --> InvDB
    Route --> GeoDB

    SalesDB -->|Extract| T1
    InvDB -->|Extract| T2
    GeoDB -->|Extract| T3

    T1 -->|Load| WH
    T2 -->|Load| WH
    T3 -->|Load| WH

    Analyst -->|Query| WH

- User --> App: Different user roles interact with specialized applications to perform business tasks.

App --> DB: Each application writes its operational data to a dedicated OLTP database.
DB --> Transform: Data is extracted from various production databases to begin the ETL lifecycle.
Transform --> WH: Cleaned and formatted data is loaded into the central Data Warehouse for consolidation.
Analyst --> WH: Business analysts run complex analytical queries against the warehouse to drive decision-making.

i. Stars and Snowflakes: Schemas for Analytics

Unlike the complex, highly normalized schemas in OLTP, data warehouses often use a Star Schema (also known as dimensional modeling). - Fact Tables: The center of the star. Each row represents an event (e.g., a single sale) and contains foreign keys to dimension tables. - Dimension Tables: The points of the star. They contain descriptive data (e.g., product details, date information, store locations). - Snowflake Schema: A variation where dimensions are further broken down into sub-dimensions (e.g., "brand" is a sub-dimension of "product").

ii. Column-Oriented Storage

Transactional databases store data row-by-row on disk. For analytics, this is inefficient because most queries only need a few columns but scan millions of rows. - The Solution: Store all values from each column together on disk. - Efficiency: If a query only needs total_price, the engine only reads that specific column's file, skipping every other field.

iii. Column Compression

Storing columns together allows for extreme compression because values in a column are often repetitive. - Bitmap Encoding: For columns with low cardinality (e.g., "gender" or "product_category"), we can use a bitmask for each value. - Run-length Encoding: Compressing repeated sequences of the same value.

iv. Aggregations: Data Cubes and Materialized Views

In some cases, even columnar storage is too slow. - Materialized Views: Storing the actual results of a query on disk (like a cache) to avoid re-calculating common aggregates. - Data Cubes: A multi-dimensional materialized view that pre-computes sums for every combination of dimensions (e.g., sales by product, by store, by day).

Last Updated: April 26, 2026

End Note: Storage engines are not one-size-fits-all. A Bitcask engine is perfect for a cache, an LSM-tree for a high-write time-series database, and a B-tree for a traditional relational store. Understanding the underlying data structures is the first step toward choosing the right tool for the job.