Streambased And TableFlow - Two Very Different Approaches To The Same Problem

Introduction

‍

Like everyone else, we’re thrilled at the announcement of Confluent’s new TableFlow feature (great name!). The convergence of operational and analytical data systems is the core value of our business and having it validated by such a large and reputable company as Confluent is brilliant!

One question I was asked constantly at the recent Kafka Summit was how Streambased differs from this Confluent offering. There aren’t many technical details available on how Confluent intends to achieve its goals but there are enough to highlight some key similarities and some key differences in the Confluent and Streambased approaches.

Please bear in mind this is based on the (very limited) information on TableFlow available today and subject to our understanding of it. This is not a fully fledged comparison of the two approaches (that will follow later) but a few initial thoughts based on what we’ve seen so far.

‍

Similarities between TableFlow and Streambased

‍

Both TableFlow and Streambased recognise Kafka as a valuable analytical data source – It wasn’t long ago that Kafka was viewed as a mere transport mechanism for communicating data between services. With the exposure of message data for analytics as enabled by TableFlow and Streambased this role is greatly expanded. Feature enhancements (e.g practically unlimited cheap storage (KIP-405) and expanded metadata capabilities (KIP-833)) and the rise of “streaming first” architectures mean that Kafka is often the landing point for all new data into an organisation, a true source of high volume, raw, historical data or to put it another way: A Datalake!
TableFlow and Streambased are both focused on Analysts rather than Data Engineers– A good friend of mine once said “The Streaming Datalake is Kafka for people that don’t have a Github account”. I love this as it easily explains that analytical users are likely to be more focused on the data itself than the manner in which it is obtained. Creating a frictionless experience for this will be key. TableFlow does this via Iceberg and Streambased via JDBC but the goal is the same, to ensure that Kafka datasets are available with the tools and techniques that analysts are used to and with minimal engineering requirements:

‍

‍

That’s not to say we don’t like data engineers at Streambased, we love them! In fact we love them so much that we want to save them the grunt work of surfacing the data and provide direct connections. This leaves engineering free to focus on governance, performance and all the other needle moving tasks that get lost in the forest of ELT/ETL pipeline requests.‍

3. Both rely on the Kafka ecosystem for metadata – The Kafka ecosystem has created a solid set of tools and principles for describing the data they hold and ensuring that data can evolve in a way that can be gracefully handled by the applications that work with it. We can take these same tools and principles (mostly encapsulated in Confluent’s Schema Registry and equivalents) and applythem to analytical workers. This smooths the transition from stream to table representation, it’s not a fully fledged data catalogue solution but it’s a great start!

‍

And now some differences

1. TableFlow speaks files, Streambased speaks messages – It’s hard to get more fundamentally different than this! To achieve its analytical goals, TableFlow will transform Kafka data into a file format more suited to analytics. Streambased will never move, copy or transform the underlying Kafka data, instead retaining it in the raw form it was created. This guarantees total consistency between operational and analytical workloads and ensures:

A. All the Kafka goodness we’ve become used to (transactions, compaction etc.) is preserved with no further work.

B. Like any other client, Streambased can transparently mix hot and cold set data ensuring minimum latency in data availability for analytics.

C. Streambased seamlessly handles data at rest manipulation. For example, if messages are encrypted at rest the Streambased client can interact with it in the same way any other client would. No need for any special handling of keys or processes to get at the data for analytics.

2. Streambased degrades gracefully – If your data isn’t fully loaded for analytical purposes yet, Streambased can fall back to a simple read from Kafka. There is an obvious performance hit to doing this but the query will complete and will be correct. If you are relying on Iceberg/Parquet to satisfy your queries then the query can only complete after all data relevant to this query has been converted. Let’s say you have a large topic and decide to use it for analytical purposes, if you flip the switch to enable this (both Streambased and TableFlow have switches 😉 ) and then immediately run a query it’s likely that the extra information required to use the topic for analytical purposes will not be ready yet. You can end up with the situation below where the query you are executing operates on data not yet transformed:

‍

‍

In Streambased the query will still execute correctly (all but with degraded performance), the data read will become just regular consumer access to the data. It’s not the best experience but it will finish and give the correct answer. It remains to be seen how this will be handled in TableFlow but the signs we have seen so far indicate that it will be a much more binary experience. Either the data has been converted and is available or it is not and the query cannot complete successfully.

‍

3. Streambased brings analytics principles to the stream rather than streams to analytics – Streambased brings data analytics principles to the stream. Using features such as indexing, pre-aggregation, topic statistics and many more data access to Kafka can be massively accelerated for analytical workloads. This is different to the TableFlow approach in which a much more traditional ETL/ELT operation is performed to bring the data from the stream to the analytics. There are pros and cons to both approaches but not being tied to a particular format and feature set means the Streambased approach is free to explore new exciting and experimental advances in analytics principles without having to wait for them to be supported by Iceberg or other dependencies.

‍

It’s too early to determine the relative benefits and fit of one solution over another and this post doesn’t claim to do that in any way. I’m sure more posts will follow in this regard but it is exciting (and rare) to see such a variety of solutions to this problem. We’re aware of a number of vendors addressing this market and I’m glad to say they have all tackled this from a different angle. I look forward to continuing the same positive and energetic conversation we have had up until now. Go Streaming Datalake!

‍

Streambased And TableFlow - Two Very Different Approaches To The Same Problem

Introduction

Similarities between TableFlow and Streambased

And now some differences

Similar Posts

Bridging the Gap Between Data Science and Real-Time Insights

3 Reasons counting records on a Kafka topic is hard

Experience lightning-fast filter queries with Streambased: achieve up to 30x speed boost!