Chapter 2: Data Models and Query Languages

Data models are perhaps the most important part of developing software because they have a profound effect: not only on how the software is written but also on how we think about the problem we are solving.

I. Relational Model vs. Document Model

The Relational Data Model, introduced by Edgar Codd in 1970, organizes data into tables (relations) made up of rows (tuples). It originated from business data processing needs in the 1960s–70s, such as banking, reservations, and payroll systems.

In contrast, the Document Model (and NoSQL) emerged in the 2010s to challenge relational dominance, focusing on distributed systems and flexibility.

Feature	Relational Model	Document Model
Structure	Tables/Rows (Tuples)	Nested documents (JSON/BSON)
Best For	Many-to-many relationships	One-to-many / Tree-like structures
Joins	Native support	Often requires manual application-side joins
Schema	Explicit (Schema-on-write)	Implicit (Schema-on-read)

II. The Advantages of Using IDs

Whether you store an ID or a text string is a question of duplication.

Normalization: Using an ID (e.g., 34) instead of a string (e.g., Philanthropy) ensures that human-meaningful information is stored in only one place.
Immutability: Because an ID has no inherent meaning to humans, it never needs to change. Even if the text representation changes, the reference remains valid.
Redundancy: If information is duplicated, updating it requires finding and changing every record, which is error-prone.

graph LR
    UserRecord[User Record] -->|ref: dept_id=42| DeptTable[(Department Table)]
    DeptTable -->|42| DeptName[Philanthropy]

    subgraph "The ID Benefit"
    DeptName -.->|Change Label| NewName[Social Impact]
    UserRecord -.->|Still Points To| DeptTable
    end

- UserRecord --> DeptTable: User Record references a department via a numeric dept_id instead of a literal string.

DeptTable --> DeptName: The Department Table maps the numeric ID to the human-readable department name.
DeptName --> NewName: Updating the label in the Department Table (e.g., to "Social Impact") propagates to all records without changing the User Record.
UserRecord --> DeptTable (Stable): The User Record remains stable as it still points to the same invariant ID in the Department Table.

III. Are Document Databases Repeating History?

In the 1970s, IBM’s Information Management System (IMS) used the Hierarchical Model, which is remarkably similar to the JSON model used by document databases today.

The Hierarchical Model (IMS)

IMS represented all data as a tree of records nested within records. - Pros: Worked well for simple one-to-many relationships. - Cons: Many-to-many relationships were difficult, and it did not support joins. Developers had to manually resolve references or denormalize data.

graph TD
    Root[User Profile]
    Root --> Positions[Work Positions]
    Root --> Education[Education History]
    Positions --> Pos1[Engineer]
    Positions --> Pos2[Manager]
    Education --> Edu1[University]

- Root --> Positions: The Root User Profile serves as the parent for nested Work Positions.

Root --> Education: Education History is similarly nested as a sub-document of the user profile.
Positions --> JobRoles: Specific job roles (Engineer, Manager) are leaf nodes within the Work Positions branch.
Education --> Institution: Specific institutions are leaf nodes within the Education History branch.

IV. The Network Model (CODASYL)

The Network Model was standardized by CODASYL. It allowed a record to have multiple parents, enabling many-to-one and many-to-many relationships.

The Concept of Access Paths

Unlike foreign keys, links in the network model were like pointers stored on disk. To access a record, you had to follow a specific path from a root record along these chains.

Why it Failed

Manual Navigation: Querying required moving a cursor through the database by iterating over records.
Complexity: If a record had multiple parents, the developer had to keep track of all relationships. CODASYL committee members admitted this was like "navigating around an n-dimensional data space."

V. The Relational Solution

The relational model simplified everything by laying data out in the open: a table is simply a collection of rows. There are no labyrinthine nested structures or complicated manual access paths.

The Query Optimizer

In a relational database, you simply describe the data you want (Declarative). The Query Optimizer automatically decides: - Which parts of the query to execute first. - Which indexes to use.

These choices are the "access paths," but they are managed automatically by the database, not the developer.

VI. Schema Flexibility

Concept	Schema-on-write (Relational)	Schema-on-read (Document)
Enforcement	Database ensures all data conforms to the schema.	Database stores anything; code interprets structure.
Analogy	Statically typed programming language.	Dynamically typed programming language.
Migration	Requires `ALTER TABLE` (can be slow).	Just start writing new fields to new documents.

[!WARNING] MySQL Limitation: Most relational systems execute ALTER TABLE in milliseconds. However, MySQL copies the entire table on ALTER TABLE, which can cause hours of downtime for large datasets.

VII. Declarative vs. Imperative Query Languages

Imperative: Tells the computer how to perform operations in a specific order (e.g., iterating through a list).
Declarative (SQL): Describes what data is needed, but not the implementation details.

Example: Finding all animals of species "Sharks"

Imperative Approach (JavaScript)

function getSharks(animals) {
    let sharks = [];
    for (let i = 0; i < animals.length; i++) {
        if (animals[i].species === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}

Declarative Approach (SQL)

SELECT * FROM animals WHERE species = 'Sharks';

Why Declarative is Better:

Optimization: The database can optimize the execution path behind the scenes (e.g., reclaiming disk space without breaking queries).
Parallelism: Declarative languages are much easier to execute in parallel because they don't specify a rigid order of operations.

VIII. Graph-like Data Models

When many-to-many relationships become the core of your data, the Graph Model is the most natural choice. A graph consists of Vertices (Nodes) and Edges (Relationships).

Common Graph Examples: - Social Graphs: Vertices are people; edges are friendships. - Web Graphs: Vertices are pages; edges are HTML links. - Logistics: Vertices are junctions; edges are roads/rails.

Property Graphs

In this model: - Vertices have a unique ID, incoming/outgoing edges, and property KVs. - Edges have a unique ID, head/tail vertices, a label, and property KVs.

Example: Cypher Query Language

Cypher is a declarative language (created for Neo4j) used to describe patterns in graphs.

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location {name:'United States', type:'country' }),
  (Idaho:Location {name:'Idaho', type:'state' }),
  (Lucy:Person {name:'Lucy' }),
  (Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),
  (Lucy) -[:BORN_IN]-> (Idaho)

IX. Comparison: Graph Databases vs. Network Model

Feature	CODASYL (Network Model)	Graph Databases
Schema	Rigid; specified which types could nest.	Flexible; any vertex can link to any other.
Access	Only via manual access paths.	Direct ID reference or property indexes.
Ordering	Children were an ordered set (manual management).	No defined ordering; sort only in queries.

Last Updated: April 11, 2026

End Note: Data models provide the interface between components. Choosing a model that doesn't match the application's data structure leads to "impedance mismatch" and unnecessary complexity. Whether it's the simplicity of Relational, the flexibility of Document, or the connectivity of Graph, the goal remains the same: reducing the gap between how we think and how the data is stored.