How the Streaming Lakehouse works
One copy of data for messaging, AI features, and SQL — built on Apache Pulsar & BookKeeper.
The problem: AI keeps re-copying the same data
Almost every data platform is a relay race of copies. A single event is written, then copied, then copied again, just to become usable for messaging, then analytics, then AI:
Producer → Kafka/Pulsar → Object Store (S3/HDFS) → Iceberg/Delta → Spark/Flink/Trino → Feature Store / Vector DB
Each hop is real cost and real complexity:
- Multiple copies of data — the same bytes live in the streaming layer, in object storage, and again in a table format.
- ETL pipelines — jobs to move and reshape data between every layer, built, scheduled, monitored, and paid for.
- High storage cost and a large infrastructure footprint.
- Stale data — AI sees features minutes or hours after the event happened.
- Two worlds — a streaming stack and an analytics stack, operated by different teams.
The idea: make the streaming layer the lakehouse
The Streaming Lakehouse collapses that relay race into a single platform on Apache Pulsar and Apache BookKeeper:
Producer → Pulsar → BookKeeper Streaming Lakehouse
One write. One copy. One storage engine. From that single copy you get pub-sub messaging, lakehouse-style analytical scans, SQL table access, historical replay, and native point-in-time joins — with no ETL and no duplicate storage.
The key innovation: columnar pages inside BookKeeper
BookKeeper already does something convenient: after writing incoming data to its journal (a write-ahead log), it buffers entries in a write cache before batching them to the ledger device. The Streaming Lakehouse hooks into exactly that moment.
Instead of flushing row-oriented log entries straight to the EntryLog, we convert the buffered data into immutable, compressed columnar pages using Apache Vortex, and write those pages to disk:
Message Stream → Journal → Write Cache → Vortex Columnar Page Builder → Ledger Storage
Each page (written in 2–4 MB batches by default, configurable) is a self-describing analytics unit carrying columnar data, compression, statistics (min/max, null counts, cardinality), bloom filters, time ranges, and join indexes. For an Employee topic, each page records both a creation-time range and an employee_id range — so a query for employee_id = 123 skips every page whose range can't contain it, reading only the few pages that could match. This turns BookKeeper from a log store into a Streaming Lakehouse storage engine, while still batching exactly as it does today.
Two brokers over one storage layer
1. Pub-Sub Broker
- Reads the pages, reconstructs entries, and delivers them as normal Pulsar consumer messages.
- Existing Pulsar semantics and existing Java, Python, and Go clients keep working — no client changes.
2. DataLake Broker
- Reads the columnar pages, loads page metadata, prunes irrelevant pages, and runs filters, projections, aggregations, and joins in memory.
- Returns analytical and SQL results — including joined output — from the same copy of data.
To keep low-latency messaging and heavy analytics from fighting each other, topics come in two flavors: a Messaging Topic (full Pulsar behavior) and a Streaming Lakehouse Topic (sequential, cumulative-ack, columnar, analytics-enabled).
Where it gets interesting: native point-in-time joins
Because BookKeeper storage is append-only and immutable, every topic already contains the full history of every record. That means you can answer "as of" questions natively — no Iceberg snapshots, no separate historical tables. This is exactly what AI needs for point-in-time-correct features and leakage-free training sets.
Consider an Employee topic and a Department topic:
Employee: employee_id=1, name=John, dept=10 @ 10:00 then employee_id=1, dept=20 @ 11:00
Department: dept=10, name=Engineering @ 09:00 and dept=20, name=Platform @ 10:30
The query:
SELECT * FROM Employee e JOIN Department d ON e.dept = d.dept AS OF '10:15'
correctly returns John → Engineering — because at 10:15 John was still in dept 10, and dept 20 ("Platform") did not yet exist. This is a temporal hash join, very similar to how MVCC databases resolve versions.
Three layers make it scale
Scanning everything would be hopeless — imagine a billion-row Employee topic joined to ten million Departments. Layered metadata means the engine reads almost nothing it doesn't need:
- Layer 1 — Columnar Pages: the 2–4 MB Vortex page batches holding the actual data.
- Layer 2 — Page Metadata:
page_id, min_ts, max_ts, min_emp_id, max_emp_idfor pruning. - Layer 3 — Join / Version Index: for each key, the page, row group, and time range of every version — sorted by
(key, event_time)so the broker binary-searches for the latest version≤the query timestamp.
Without Layer 3, joins degrade into full scans; with it, they become fast lookups.
Execution: stand on Arrow, Vortex, and DataFusion
Rather than writing a SQL engine from scratch, the DataLake Broker turns pages into Arrow record batches and lets a vectorized engine do the work:
BookKeeper Page → Arrow RecordBatch → DataFusion → Hash Join → Result
Filters, projections, joins, and aggregations come almost for free, exposed through SQL over Arrow Flight, JDBC, and REST — including SELECT, JOIN, GROUP BY, WINDOW, and AS OF.
The value proposition is simple: One Write. One Copy. One Storage Engine. Streaming + Lakehouse + Historical SQL — purpose-built to feed AI.
Where this fits
The Streaming Lakehouse is a natural fit for ML feature stores, agent memory and RAG, real-time personalization, fraud and risk, observability, Customer-360, financial transactions, and IoT analytics — anywhere teams pay today to keep a streaming copy and an analytics copy of the same data in sync.
The Streaming Lakehouse is an actively evolving platform StreamBricks is building on Apache Pulsar & BookKeeper. Want to pilot it on your workload?
Request early access →