Introduction


Description

Modern data platforms need the flexibility of data lakes and the reliability of warehouses. Apache Iceberg delivers both. In this course, you’ll go beyond the hype and learn the inner workings of this powerful open table format. We’ll walk through the complete Iceberg architecture and explore how it brings schema evolution, time travel, and performant analytics to your Lakehouse setup.

This is a hands-on course grounded in real-world data engineering practice. You’ll set up a local lab using Docker, Spark, and MinIO to build and interact with Iceberg tables. From writing data and exploring metadata to optimizing queries and rewriting partitions, you’ll gain the experience needed to confidently use Iceberg in production-like environments.

By the end of this course, you’ll not only understand how Apache Iceberg works under the hood. You’ll have a running system, project-ready notebooks, and a strong grasp of the table operations that matter in real Lakehouse architectures.


Why Iceberg?

Iceberg solves long-standing challenges in big data: slow queries, complex schema changes, and tightly coupled storage and compute. You’ll learn why leading tech companies like Netflix, Stripe, and Apple use Iceberg to power their data platforms, and how you can apply the same principles in your own stack.

Building the Lakehouse Lab

You’ll create a complete Iceberg-based Lakehouse lab using Docker Compose. This includes Spark for processing, a REST-based Iceberg catalog, and MinIO as your S3-compatible object storage. Everything runs locally, so you can explore, break, and rebuild with confidence.

Writing and Managing Tables

Start by creating your first Iceberg table using a fun Pokémon dataset (which you can replace with your own later on). You’ll define schemas, write data with PySpark, and inspect how Iceberg tracks table metadata through manifests, snapshots, and partitioning.

Schema Evolution and Partitioning

Iceberg makes schema changes easy. You’ll walk through column adds, renames, and type changes, and see how they’re reflected in metadata. Then, dive into advanced partitioning techniques, including hidden and derived partitions, to improve query performance without changing your data model.

Row-Level Operations and Time Travel

Learn how to perform fine-grained table changes like row-level deletes and how to query previous versions of a dataset using Iceberg’s powerful time travel capabilities. These features are essential for building reliable, traceable data pipelines.

Iceberg Architecture Deep Dive

Unpack how Iceberg works under the hood. From immutable Parquet files and manifest lists to metadata snapshots and catalog integrations. You’ll develop a clear mental model of the layered architecture that powers Iceberg’s performance and flexibility.

Inspecting Data in MinIO

Use MinIO’s UI to visualize how data and metadata are stored. You'll explore partitioned file layouts, see snapshots evolve over time, and understand how Iceberg structures physical data for analytical performance.

SQL on Iceberg

Finish by running analytical SQL queries directly on Iceberg tables using PySpark. You’ll group, join, and filter data like in any warehouse, but now on top of decoupled object storage, with Iceberg handling schema and snapshot management behind the scenes.

By the way: for this course, it is no problem at all if you are not so familiar with Spark yet.

If you want to dive into Spark fundamentals though, check out our Apache Spark Fundamentals course in the Academy.


About the Instructor

David Reger is a Cloud Data Engineer at MSG Systems, where he builds scalable Lakehouse platforms using Azure, Databricks, and open-source technologies like Apache Spark and Iceberg. With a background in IoT and years of hands-on experience in data integration and architecture, David brings a strong mix of theory and real-world practice to this course. He’s passionate about helping engineers master modern data tools and sharing battle-tested insights from the field.

Connect with David on LinkedIn.