Landscape

How the Streaming Lakehouse compares

Kafka + Iceberg, Apache Fluss, Apache Hudi, RisingWave, Databricks — and the one capability none of them fully deliver.

The fundamental split: table-first vs log-first

Almost every "streaming + analytics" system on the market today started as a table and bolted streaming on afterwards — Hudi, Iceberg, Delta, and Fluss all think table-first. StreamBricks starts from the opposite end: it starts as a log (a Pulsar topic on BookKeeper) and evolves that log into a lakehouse.

Pulsar Topic → BookKeeper Ledger → Columnar Pages → Temporal Table → SQL

That ordering matters. Because we begin from Pulsar's log, we inherit topics, consumer groups, replay, retention, multi-tenancy, tiered storage, and geo-replication for free — and then add the lakehouse on the same physical bytes. The result is closest to an event-sourced lakehouse: every stream is also a historical table, and every table is also a stream.

The capability matrix

Capability	StreamBricks	Kafka + Iceberg+ Spark	Apache Fluss	Apache Hudi
Pub-sub messaging API	●Native Pulsar	●Kafka	—	—
Streaming storage	●	●	●	◐batch-oriented
Single physical copy	●	—multiple copies	●	◐multiple views
No ETL between stream & lake	●	—	●	—
Columnar storage	●Vortex	●Parquet	◐log / table	●Parquet
Historical versions	●	●	●	●
Point-in-time query (AS OF)	●	●time travel	●	●
Native point-in-time joins	◐on roadmap	—external engine	—limited	—external engine
Native SQL engine	◐DataFusion, roadmap	—Spark / Trino	—uses Flink	—Spark / Trino
Message replay & consumer groups	●	●	—	—
Data lake / analytical scans	●	●	●	●

● available ◐ on roadmap / partial — not supported

Kafka + Iceberg + Spark

The default modern stack, and the one StreamBricks most directly replaces. It does everything — but by maintaining separate copies at every layer: Kafka for the stream, object storage for the raw data, Iceberg for the table, Spark/Trino for compute. Every copy is another pipeline, another bill, and another source of staleness. The single-copy architecture is precisely the copy StreamBricks removes.

Apache Hudi

Hudi is fundamentally a data-lake storage system, not a streaming system. It solves upserts, CDC, historical snapshots, and incremental queries on Parquet in S3/HDFS — but there is no producer/consumer pub-sub layer. You don't subscribe to a Hudi table. StreamBricks wants pub-sub, lakehouse, and temporal query from the same storage; Hudi wants object store + table format.

Apache Fluss

Fluss is the most conceptually similar project — it explicitly merges Kafka-style streaming with a lakehouse into one system, with streaming tables you can query while data is still arriving. We share the same north star: one copy, one storage, streaming + analytics.

The difference is the starting point. Fluss still thinks in terms of table storage and leans on Flink for SQL. StreamBricks thinks in terms of message storage with full Pulsar semantics — so it inherits the entire Pulsar ecosystem (millions of topics, multi-tenancy, tiered storage, functions, connectors, transactions, geo-replication) rather than rebuilding it. Fluss does not have that pub-sub ecosystem.

RisingWave

RisingWave is the closest conceptual match for the query side — streaming SQL with materialized views, joins, and historical state. But it is a streaming database, not a pub-sub platform: you don't get topics, consumer groups, and replay as first-class primitives. StreamBricks keeps the messaging substrate and adds the analytics on top.

Databricks

Interestingly, the closest commercial competitor isn't Hudi or Fluss — it's Databricks, whose Kafka → Delta → Photon → SQL vision also targets "streaming + tables + analytics." The distinction is the same one that runs through this whole comparison: Databricks still maintains separate layers and copies. StreamBricks collapses them into one.

Pinot / Druid

Pinot and Druid share the philosophy of columnar storage with min/max ranges and bloom filters for low-latency analytics, ingesting from Kafka/Pulsar. But they are analytics sinks — they don't offer full pub-sub, replay, or consumer groups. StreamBricks is both the stream and the analytics store.

The one thing none of them deliver

Native temporal joins over streaming history, backed by a single physical copy:

SELECT * FROM Orders o JOIN Customer c ON o.customer_id = c.id AS OF o.event_time

Because BookKeeper storage is append-only and retains full history, StreamBricks can answer "what was true at the moment of this event" without maintaining Iceberg snapshots or separate historical tables. That is the single most valuable capability for AI — leakage-free, point-in-time-correct features and training sets — and it is exactly where the log-first design pays off. Neither Hudi nor Fluss fully delivers it today.

An honest take

This is a hard, multi-year systems effort, and several pieces — native joins, the full SQL engine, distributed execution — are on the roadmap rather than shipping today (the matrix marks them honestly as ◐). For a generic data warehouse, Iceberg + Spark will likely remain simpler. But for an organization already heavily invested in event streams, the ability to run analytics, historical queries, and temporal joins without copying data into a separate lake is genuinely compelling. The differentiator isn't "another lakehouse" — it's that every table is also a stream, every stream is also a historical table, all on one copy in BookKeeper.

Want to see how this maps to your stack? We'll walk through your current pipeline and where the copies disappear.

Request early access →