Logo

Iceberg Won - Now Kafka Needs to Keep Up.

Cover Image for Iceberg Won - Now Kafka Needs to Keep Up.
Leo Delmouly

Leo Delmouly

blog

This is part 3 in a series on modern real-time data architectures.

Iceberg Won - Now Kafka Needs to Keep Up.

How Apache Iceberg quietly won the table format war - and why Kafka needs to catch up

A few years ago, Apache Iceberg was just another open table format. A bet on open data, big compute and future-proof analytics.

Today, it's the standard.

Iceberg is everywhere.

Adopted by Netflix, endorsed by Snowflake, embraced by Databricks, and deeply integrated into platforms like Confluent, AWS, Dremio...

We've seen this movie before: a slow, quiet shift... And then one day, it's the default.

Delta was first out the gate. Hudi had momentum. But Iceberg's design with atomic commits, schema evolution, partition specs and full support across engines made it inevitable.

Iceberg won by solving real pain points:

  • Hidden partitioning logic in Hive? Iceberg made it explicit and flexible.
  • Backfills breaking schemas? Iceberg supports evolution + rollback.
  • Stale metadata? Iceberg added snapshot isolation and transaction logs.
  • Multiple engines, same table? Iceberg said yes to Trino, Spark, Flink, Snowflake, Dremio, and more.

So What's the Big Deal?

Well, Iceberg isn't just a new table format.
It's the foundation for the next generation of analytics:

  • Cloud-native
  • Streaming aware
  • Engine agnostic
  • Open by design

It brings the reliability of a database table to the flexibility of a data lake. And it does so without vendor lock-in.

That's a big deal if you're a data platform team trying to unify messy pipelines, decouple compute and storage, and move faster without rewriting everything.

But there's a catch...

Kafka Wasn't Invited

As Iceberg adoption soars, most of the attention has gone to files and batch systems - Parquet, ORC, Object stores, lakehouses.

But what about Kafka?

The irony is painful: Kafka is the system where the data is born. Real-time, high fidelity, full of operational truth.

Yet Iceberg tables? They sit downstream.

Way downstream. After CDC pipelines, Flink jobs, ETL chains and cloud storage syncs.

The result?

Hours of delay.

Complex stateful logic.

And entire classes of real-time data that never make it to Iceberg at all.

Why Kafka-to-Iceberg Matters

If Iceberg is going to be the standard table format

And if Kafka continues to be the standard "real-time bus"

Then getting data from Kafka to Iceberg should be table stakes.

And yet it's not.

  • It's fragile.
  • It's expensive.
  • It's slow.
  • And it's disconnected from the very streaming model Iceberg claims to support.

Which is why the industry is moving fast to solve it.

You can see it in the momentum and integrations:

  • Confluent now supports Iceberg as a sink for streaming data
  • Databricks embraced Iceberg alongside Delta to meet open standards
  • Snowflake offers support for Iceberg table formats as part of its external table strategy
  • And Tabular (the company founded by Iceberg’s creators) is building the core infrastructure to make Iceberg the lakehouse standard

Everyone's racing to make real-time sync into Iceberg easier, cheaper and more reliable.

But that raises the real question:

Why is it so hard to get Kafka data into Iceberg in the first place?

If the last 5 years were about making lakes queryable, the next 5 will be about making streams queryable. And Iceberg is uniquely positioned to bridge that.... if Kafka can catch up.

On paper, it sounds simple: stream data into tables. But the challenges are real:

  • Kafka is append-only, Iceberg expects files and atomic commits
  • Kafka events may evolve quickly. Iceberg needs schema tracking + compatibility
  • Streams don’t partition themselves. Someone has to design the layout
  • There’s no built-in notion of snapshot completeness
  • And achieving exactly-once delivery at scale? Not easy

In short: Kafka is fast. Iceberg is consistent. Bridging the two takes real work.

We’re working on a way to make Kafka to Iceberg as easy and reliable as batch.

But it raises the real question:

What should a real-time-first data architecture actually look like?

👀 Stay tuned for Part 4: Why Kafka-to-Iceberg is still so painful

🔍 Want a preview of our approach? streambased.io