Looking to deploy on prem? Check out the demos here

Why Iceberg doesn’t make sense for Kafka (and why what it means does)

Iceberg is a great fix for issues experienced with legacy data formats but is it the right solution for new data sources, formats and paradigms? This blog post takes a look at Iceberg's suitability for streaming data and why it may not be the silver bullet here that it appears for traditional warehousing.

June 10, 2024
4 Minutes
Iceberg is everywhere! The battle of the table formats has been popcorn fodder for over a year now and seems to be finally drawing to a close. Iceberg, as the winner, claims the spoils. The sudden influx of interest has meant new features are being added by the day, and every data platform is at pains to stress their support for the new hotness.

Kafka has not been left out. As the de facto standard for real-time data storage, Confluent announced TableFlow (integration between Kafka data and Iceberg) at this year’s Kafka Summit. This promises to capitalise on a global trend towards the unification of operational and analytical data, and using the latest and greatest technology stack in Iceberg to do this just makes sense…

… except it doesn’t.

To understand why, we must take a look at what Iceberg (and table formats in general) are and the role they perform. Table formats are enrichers; they take an underlying data format (usually a bunch of files like Parquet) and provide a logical abstraction over them that can increase performance when querying the data, enable extra capabilities such as transaction processing, and ease some of the uglier problems like schema evolution. When you have an existing investment in a particular format (as most do), table formats provide a shortcut to efficiency without having to go the more traditional migration route.

It’s important to note that table formats alone do not provide access to data in the same way that the underlying formats they accelerate do. A directory full of Parquet files requires a reading and processing engine such as Apache Hive before it becomes really useful, and as an enrichment/acceleration technology, table formats also have the same limitation. No analyst is connecting their Tableau directly to Iceberg.

Kafka quite obviously is not a bunch of legacy files (actually it is, but that’s not how we interact with it ;-) ) and there is no earlier commitment to using Kafka data for analytical purposes. The goal here is to provide our analyst with a view on Kafka data in their tools, so why introduce the extra hop to Iceberg? Kafka -> Iceberg -> Query engine -> Analytical tool is, by its very definition, more complex than Kafka -> Query engine -> Analytical tool direct and an unnecessary one as the interface is brand new.

The lack of prior art means we are free to define the analytical interface to Kafka in any way we see fit. Streaming data is not batch data and comes with a whole different set of considerations, so why try to shoehorn enrichment and acceleration principles explicitly designed for batch data into Kafka? This point is driven home by a bunch of proposals we’ve seen where Kafka data is transformed to Parquet to provide better integration with Iceberg (Kafka -> Parquet -> Iceberg -> Query engine -> Analytical tool, this is getting out of hand…).

So why does Iceberg on Kafka excite me? Well, it means that we are finally thinking about fully-fledged analytical solutions for Kafka. This means thinking about Kafka as a data repository rather than just a buffer between services, and that’s a genuinely exciting concept that I can really get behind. For a lot of businesses, Kafka is the entry point, exit point, and internal conduit for data into the organisation and, as such, the ultimate store for the kind of raw data that provides life-changing (and often unexpected) insights. We should embrace the convergence of operational and analytical workloads and the incredible opportunity it represents. Only please can we use one of the many decades-old connection protocols like JDBC/ODBC, SQL Alchemy etc., that bring capability to users instead of introducing yet another ~~complicated~~ abstraction in an already too long chain just because it’s new and popular.

Experience lightning-fast filter queries with Streambased: achieve up to 30x speed boost!

Uncover the power of Streambased’s DataLake and unlock the potential for unparalleled efficiency and productivity. Learn more today!

Copyright 2024 Streambased Platform Limited. Company Number 14709247.