I.S.K. architecture

Tom Scott

April 26, 2025

blog

Introduction to Apache Iceberg

Apache Iceberg is an open table format designed for large-scale, high-performance analytics on big data. It was built to address the limitations of older table formats like Hive and to support modern data lake architectures.

In this article we'll ake a deep dive into I.S.K. (Iceberg Service for Kafka), an Iceberg projection over real-time, event-based data stored in operational systems.

Materialized at runtime, ISK allows for the decoupling of the physical structure of the data from the logical, table-based representation.

With I.S.K., you can:

Access up-to-date (zero latency) data from analytical applications.
Impose structure on read in your analytical applications.
Logically repartition with no physical data movement.
Surface multiple “views” on the data with different characteristics defined at runtime.
Take advantage of analytical practices such as indexing, column statistics, etc., to achieve massive performance enhancements.

High-Level Architecture

ARCH

Iceberg Architecture Duality

ARCH

Iceberg Unit	I.S.K. Unit
Iceberg Catalog	Custom REST Iceberg Catalog implementation that generates metadata files on the fly during the `LoadTable` call based on the underlying streaming system topic.
Metadata File	Metadata file content is returned by the catalog during the above operation—metadata files are ephemeral and not stored persistently.
Manifest List	Ephemeral manifest list with references to manifest files.
Manifest File	Ephemeral manifest file — computed at request time with data splits relevant to the query. Enables per-query optimization, efficient pruning, and schema-on-read.
Data File	Ephemeral data files — served via an S3-compatible interface but are actually data read directly from the source streaming system with per-request pruning/splits.

Stream Table Duality

I.S.K. maps common streaming concepts to Iceberg-specific terms, enabling seamless representation of event streaming data (e.g., from Apache Pulsar or Apache Kafka) in Iceberg-compatible systems and tools.

Kafka Unit	I.S.K. Unit	Description
Cluster	Namespace	Each Kafka or Pulsar cluster is represented as an Iceberg namespace to indicate the separation of resources.
Topic	Table	As logical collections of data points, both topics and tables are equivalent.
Message	Row	An individual message within a topic can be viewed as a row within a table when combined with a schema.
Message field	Column	A field within an event message is similar to a column within a table row in Iceberg.

Indexing

Consumers of event streaming systems and analytical tools that consume Apache Iceberg tables have vastly different performance requirements:

Streaming systems iterate through a small volume of messages quickly.
Analytical systems perform less frequent but higher-volume reads.

I.S.K. bridges this gap using indexing techniques that selectively read only relevant data for a query.

Example: For the predicate WHERE customer_name = 'X', I.S.K. indexing drastically reduces the number of messages by pruning to only those that match the predicate.

Note: This is not an exhaustive list of I.S.K.'s performance-enhancing techniques, which continue to evolve.

Logical Architecture Diagram

ARCH

Request Processing Flow

Indexing

Background indexing of data in the streaming system — maintaining up-to-date indexes.

Table metadata

Analytical application issues a query (e.g., SELECT … FROM DataTopic WHERE customer_name = 'X').
Analytical application queries I.S.K. for table metadata.
I.S.K. generates table metadata at runtime by querying the stream-based indexer service.
I.S.K. returns the metadata, including references to manifest files (e.g., S3 paths).

Manifests

Analytical application reads manifest files directly from the storage service (SSK).
SSK queries I.S.K. for manifest files — I.S.K. generates manifests with partitions/splits for efficient pruning.
SSK returns the manifests to the analytical application.
Analytical application uses the manifest to locate data files.

Data files

SSK fetches the relevant data from the source streaming system, applying skips/splits for pruning.
SSK aggregates events into batch data files in memory (streaming fashion; files not held fully in memory).
SSK returns the batched data files to the analytical application.

Summary

I.S.K. (Iceberg Service for Kafka) bridges the gap between real-time event streaming and modern analytical workflows by providing a logical, Iceberg-compliant interface over operational data systems. By materializing metadata and data views on the fly, I.S.K. enables zero-latency access to streaming data using familiar analytical tools and paradigms. Through ephemeral metadata, runtime partitioning, and indexing techniques, it delivers powerful performance optimizations without requiring changes to the underlying streaming infrastructure.

Whether you're looking to unify streaming and batch processing, simplify your analytical data architecture, or unlock real-time insights without data duplication, I.S.K. offers a scalable and elegant solution built on the foundations of Apache Iceberg.