Platform deep dive

How the Streaming Lakehouse works

One copy of data for messaging, AI features, and SQL — built on Apache Pulsar & BookKeeper.

The problem: AI keeps re-copying the same data

Almost every data platform is a relay race of copies. A single event is written, then copied, then copied again, just to become usable for messaging, then analytics, then AI:

Producer → Kafka/Pulsar → Object Store (S3/HDFS) → Iceberg/Delta → Spark/Flink/Trino → Feature Store / Vector DB

Each hop is real cost and real complexity:

Multiple copies of data — the same bytes live in the streaming layer, in object storage, and again in a table format.
ETL pipelines — jobs to move and reshape data between every layer, built, scheduled, monitored, and paid for.
High storage cost and a large infrastructure footprint.
Stale data — AI sees features minutes or hours after the event happened.
Two worlds — a streaming stack and an analytics stack, operated by different teams.

The idea: make the streaming layer the lakehouse

The Streaming Lakehouse collapses that relay race into a single platform on Apache Pulsar and Apache BookKeeper:

Producer → Pulsar → BookKeeper Streaming Lakehouse

One write. One copy. One storage engine. From that single copy you get pub-sub messaging, lakehouse-style analytical scans, SQL table access, historical replay, and native point-in-time joins — with no ETL and no duplicate storage.

The key innovation: columnar pages inside BookKeeper

BookKeeper already does something convenient: after writing incoming data to its journal (a write-ahead log), it buffers entries in a write cache before batching them to the ledger device. The Streaming Lakehouse hooks into exactly that moment.

Instead of flushing row-oriented log entries straight to the EntryLog, we convert the buffered data into immutable, compressed columnar pages using Apache Vortex, and write those pages to disk:

Message Stream → Journal → Write Cache → Vortex Columnar Page Builder → Ledger Storage

Each page (written in 2–4 MB batches by default, configurable) is a self-describing analytics unit carrying columnar data, compression, statistics (min/max, null counts, cardinality), bloom filters, time ranges, and join indexes. For an Employee topic, each page records both a creation-time range and an employee_id range — so a query for employee_id = 123 skips every page whose range can't contain it, reading only the few pages that could match. This turns BookKeeper from a log store into a Streaming Lakehouse storage engine, while still batching exactly as it does today.

Two brokers over one storage layer

1. Pub-Sub Broker

Reads the pages, reconstructs entries, and delivers them as normal Pulsar consumer messages.
Existing Pulsar semantics and existing Java, Python, and Go clients keep working — no client changes.

2. DataLake Broker

Reads the columnar pages, loads page metadata, prunes irrelevant pages, and runs filters, projections, aggregations, and joins in memory.
Returns analytical and SQL results — including joined output — from the same copy of data.

To keep low-latency messaging and heavy analytics from fighting each other, topics come in two flavors: a Messaging Topic (full Pulsar behavior) and a Streaming Lakehouse Topic (sequential, cumulative-ack, columnar, analytics-enabled).

Where it gets interesting: native point-in-time joins

Because BookKeeper storage is append-only and immutable, every topic already contains the full history of every record. That means you can answer "as of" questions natively — no Iceberg snapshots, no separate historical tables. This is exactly what AI needs for point-in-time-correct features and leakage-free training sets.

Consider an Employee topic and a Department topic:

Employee:  employee_id=1, name=John, dept=10 @ 10:00   then   employee_id=1, dept=20 @ 11:00
Department: dept=10, name=Engineering @ 09:00     and    dept=20, name=Platform @ 10:30

The query:

SELECT * FROM Employee e JOIN Department d ON e.dept = d.dept AS OF '10:15'

correctly returns John → Engineering — because at 10:15 John was still in dept 10, and dept 20 ("Platform") did not yet exist. This is a temporal hash join, very similar to how MVCC databases resolve versions.

Three layers make it scale

Scanning everything would be hopeless — imagine a billion-row Employee topic joined to ten million Departments. Layered metadata means the engine reads almost nothing it doesn't need:

Layer 1 — Columnar Pages: the 2–4 MB Vortex page batches holding the actual data.
Layer 2 — Page Metadata: page_id, min_ts, max_ts, min_emp_id, max_emp_id for pruning.
Layer 3 — Join / Version Index: for each key, the page, row group, and time range of every version — sorted by (key, event_time) so the broker binary-searches for the latest version ≤ the query timestamp.

Without Layer 3, joins degrade into full scans; with it, they become fast lookups.

Execution: stand on Arrow, Vortex, and DataFusion

Rather than writing a SQL engine from scratch, the DataLake Broker turns pages into Arrow record batches and lets a vectorized engine do the work:

BookKeeper Page → Arrow RecordBatch → DataFusion → Hash Join → Result

Filters, projections, joins, and aggregations come almost for free, exposed through SQL over Arrow Flight, JDBC, and REST — including SELECT, JOIN, GROUP BY, WINDOW, and AS OF.

The value proposition is simple: One Write. One Copy. One Storage Engine. Streaming + Lakehouse + Historical SQL — purpose-built to feed AI.

Where this fits

The Streaming Lakehouse is a natural fit for ML feature stores, agent memory and RAG, real-time personalization, fraud and risk, observability, Customer-360, financial transactions, and IoT analytics — anywhere teams pay today to keep a streaming copy and an analytics copy of the same data in sync.

The Streaming Lakehouse is an actively evolving platform StreamBricks is building on Apache Pulsar & BookKeeper. Want to pilot it on your workload?

Request early access →