A.S.K. architecture

Tom Scott

March 24, 2025

blog

For this post we're going to take a more technical dive into the most mature product in Streambased: A.S.K. (Analytics Service for Kafka)

A.S.K. offers a high-performance, SQL-oriented query engine specifically designed for Kafka, outperforming existing solutions like ksqlDB and Apache Flink, especially under workloads where Kafka is traditionally less efficient. By directly leveraging Kafka's native architecture and integrating indexing to provide predicate pushdown, pre-aggregation and join optimisation, A.S.K. delivers fast, low-latency queries on both historical and real-time data, achieving performance comparable to modern data lakes without the complexity or overhead of traditional data transformation.

Unlike other SQL-on-Kafka platforms, A.S.K. does not require additional tooling or complex access management infrastructure. It is fully compatible with the Kafka ecosystem, utilizing Kafka's security, schema registry, and authorization components. This integration reduces both complexity and cost, offering a lightweight, scalable solution that allows businesses to seamlessly query Kafka data without needing separate data processing systems or extensive engineering resources.

Core Features

Real-time Data Access
A.S.K. enables real-time querying of data in Kafka, allowing analysts to access and analyze data as it streams in, without waiting for batch processes or ETL pipelines.
No ETL Required
Our solution removes the need for complex ETL operations by providing direct access to Kafka data, simplifying your data workflow and accelerating time to insight.
Integration with SQL Tools
A.S.K. integrates effortlessly with SQL-based analytical tools via JDBC/ODBC, enabling analysts to query Kafka data as they would with traditional databases.
Optimized Performance
A.S.K. utilizes indexing and query optimization techniques (such as predicate pushdown and pre-aggregation) to accelerate queries on Kafka, delivering up to 30x faster performance for key queries compared to traditional SQL-on-Kafka systems.

Additional Capabilities

Access up-to-date (zero latency) data from analytical applications.
Impose structure on read in your analytical applications.
Logically repartition with no physical data movement.
Surface multiple “views” on the data with different characteristics defined at runtime.
Take advantage of analytical practices such as indexing and column statistics to achieve massive performance enhancements.

High Level Architecture

ARCH

The architectural components and interactions within Streambased. The numbered arrows show a typical query flow.

Streambased is a parallel query execution engine that runs separate to the underlying Kafka cluster that provides the input data. Streambased is designed to be horizontally scalable to limits defined only by the underlying Kafka cluster. Streambased interacts with the Kafka cluster using only the public Kafka API and, this means the compute engine (Streambased), is completely decoupled from the storage engine (Kafka).

A Streambased deployment is made up of 2 services. The Streambased Server is responsible for accepting queries from client processes and orchestrating their execution across the cluster. It is also responsible for executing individual query fragments on behalf of other Streambased Servers. The other service in a Streambased cluster is the Index Server, this is responsible for collecting index information from the underlying data and serving it to requesting Streambased Servers. Index data is typically 50-100X smaller in size than the data it represents.

When a new query is accepted it is first planned and optimised by a single Streambased Server, this server becomes the coordinator for this query and assumes responsibility for orchestrating its execution among the other Streambased Servers in the cluster (1).

From here the query is parsed and split into a number of fragments (2 and 3) that can be executed concurrently. Each of these fragments is distributed to a node for execution (4) and these nodes consult Streambased index services to optimise their execution (5). When all fragments are complete the results are returned to the query co-ordinator (6) and (when requested) back to the client (7).

Note: there is a separate background process (7) that reads directly from Kafka and computes Streambased indexes for later serving for acceleration.

Stream Table Duality

A.S.K. maps common streaming concepts to Database-specific terms. This mapping allows event streaming data from Apache Pulsar or Apache Kafka to be seamlessly represented.

Kafka Unit	A.S.K. Unit	Description
Cluster	Database Schema	Each Kafka cluster is represented as a database schema to indicate resource separation.
Topic	Table	Topics and tables are logical collections of data points and are equivalent.
Message	Row	A message in a topic can be seen as a row in a table with an applied schema.
Message field	Column	A field within a message maps to a column within a table.

Indexing

The performance needs of event streaming consumers and analytical tools differ significantly. To bridge this gap, A.S.K. uses indexing techniques to read only the data needed by the query.

Techniques Employed

Predicate Pushdown

Predicates from incoming SQL are parsed and used to reduce read volume using indexes.
Example:

SELECT * 
FROM transactions
WHERE accountId = 1001001

Streambased would identify the predicate associated with the accountId column, allowing the read tasks spawned from the query to consult any indexes that exist for that column and vastly reduce the amount of data required to be read for the query.

Pre-aggregation

Pre-aggregation is the process of pre-computing aggregates (e.g., max, min, sum, count, etc.) for configured sets of grouping columns and aggregate columns. The Streambased engine parses incoming queries for elements that may benefit from such computations and applies the pre-computed values.

For example, the query:

SELECT accountId, sum(depositAmount)
FROM transactions
GROUP BY accountId

Would identify accountId as a grouping column and depositAmount as an aggregate column. The execution engine would match an eligible stored element with pre-computed values from the Streambased index rather than computing the values by reading from Kafka, greatly improving query performance.

Join optimisation

A.S.K. is a distributed system that is capable of joining data from multiple Kafka topics (and clusters) in order to achieve analytical results. These joins can involve data movement between nodes. In order to optimise this movement Streambased assesses the volume and shape of fragments to be joined and will move data between nodes only when necessary and according to a number of policies built into the streambased engine. The result is acceleration beyond the read phase of complex queries.

Conclusion

A.S.K. represents a significant evolution in the way organizations can interact with real-time streaming data. By combining the flexibility and familiarity of SQL with the performance and scalability of native Kafka infrastructure, A.S.K. bridges the gap between streaming and analytics—without introducing the complexity, latency, or overhead of traditional ETL pipelines and batch systems.

If you're looking to bring true analytical power to your Kafka infrastructure without the trade-offs, A.S.K. is ready to meet you there.