Apache Iceberg Architecture: Snapshots, Manifests, and Catalogs
Apache Iceberg Architecture
To truly master the Open Data Lakehouse, a data engineer must move beyond high-level definitions and understand the absolute, unyielding mathematical structure of the Apache Iceberg architecture. Iceberg is not a file format; it is an incredibly complex, highly deterministic metadata tracking protocol designed to enforce strict ACID database constraints on top of highly eventual, chaotic object storage systems like Amazon S3.
The entire architecture is designed to solve one massive computer science problem: How do you allow 50 distributed servers to simultaneously read and write to a single massive dataset without corrupting the files or slowing down the query? Iceberg achieves this through a rigid, hierarchical Metadata Tree and a strict implementation of Optimistic Concurrency Control.
The Metadata Tree Hierarchy
The physical files of an Iceberg table reside entirely in object storage. They are organized in a strict top-down hierarchy.
graph TD
subgraph The Iceberg Catalog
Catalog[Apache Polaris / REST Catalog]
end
subgraph Table Metadata Layer
Metadata[TableMetadata v12.json]
end
subgraph Snapshot Layer
Snapshot[Snapshot ID: 89347598]
end
subgraph Manifest List Layer
ManifestList[ManifestList.avro]
end
subgraph Manifest File Layer
Manifest1[ManifestFile_A.avro]
Manifest2[ManifestFile_B.avro]
end
subgraph Physical Data Layer
Parquet1[DataFile_1.parquet]
Parquet2[DataFile_2.parquet]
Parquet3[DataFile_3.parquet]
end
Catalog -->|Points to Current| Metadata
Metadata -->|Contains Array of| Snapshot
Snapshot -->|Points to One| ManifestList
ManifestList -->|Tracks Bounds of| Manifest1
ManifestList -->|Tracks Bounds of| Manifest2
Manifest1 -->|Points to URI| Parquet1
Manifest1 -->|Points to URI| Parquet2
Manifest2 -->|Points to URI| Parquet3
Level 1: The Iceberg Catalog
The Catalog is the absolute source of truth. When Apache Spark wants to read a table named sales_data, it asks the Catalog. The Catalog is essentially a highly fast, transactional Key-Value store (like Apache Polaris, AWS Glue, or Nessie). Its sole mathematical purpose is to store a single pointer: the exact S3 URI of the current Table Metadata JSON file.
Level 2: The Table Metadata JSON
The Catalog hands Spark a string: s3://bucket/metadata/v12.json.
Spark downloads this JSON file. This file contains the entire historical DNA of the table. It explicitly lists the current schema, the partition specification (e.g., days(timestamp)), and an array of all Historical Snapshots. Spark looks at the JSON, identifies the ID of the Current Snapshot, and reads its corresponding pointer.
Level 3: The Snapshot and Manifest List
The Snapshot points to exactly one Manifest List (an Avro file). The Manifest List is an index of Manifests. It contains a row for every single Manifest File that belongs to this specific snapshot. Crucially, the Manifest List stores the absolute upper and lower boundaries of the data contained in those manifests. If the query is filtering for the year 2026, and the Manifest List indicates that a specific Manifest File only contains 2024 data, the engine drops that Manifest File instantly from active memory.
Level 4: The Manifest Files
The remaining Manifest Files (also Avro files) are downloaded. These files contain the actual, physical S3 URI paths to the underlying Apache Parquet data files. They track exactly how many rows are in the Parquet file and the Min/Max statistics for every single column in that file.
The Commit Flow and Concurrency
When a massive data engineering pipeline attempts to update a table, it must do so without breaking the queries of analysts who are currently reading the table. Iceberg achieves this via Snapshot Isolation.
- The Read Phase: The analyst executes a query. They read
v12.jsonand its associated snapshots. They are mathematically isolated in the past. - The Write Phase: The data pipeline (Spark) writes new Parquet data files to S3. It then generates new Manifest Files, a new Manifest List, and a brand new
v13.jsonmetadata file. - The Commit Phase: Spark sends a request to the Catalog: “Please swap the table pointer from
v12.jsontov13.json.” - Optimistic Concurrency Control: The Catalog checks the state. If no one else has updated the table, the swap is instantaneous and atomic. If another pipeline successfully committed
v13.jsona millisecond earlier, the Catalog rejects Spark’s request. Spark must autonomously download the newv13.json, re-evaluate its metadata tree, generate av14.json, and try the commit again.
This architectural rigidity is the absolute reason Apache Iceberg is completely trusted by massive Fortune 500 enterprises to handle petabyte-scale transactional workloads.
Learn More
To learn more about the Data Lakehouse, read the book “Lakehouse for Everyone” by Alex Merced. You can find this and other books by Alex Merced at books.alexmerced.com.