From Kafka to Delta Lake: The Streaming Pattern That Replaced My SQL Pipeline
For most of my career, a "real-time pipeline" meant reading from a SQL Server table on a short schedule — every 5 minutes, every minute, as fast as you could run the query. It worked, but it had edge cases: what if the job ran slower than the interval? What if a row got deleted between runs? What if you needed sub-minute latency?
Structured Streaming in Spark, writing to Delta Lake, is what replaced that pattern on my projects. Here's what the real-world version looks like.
The Basic Stream-to-Delta Pattern
from pyspark.sql.functions import get_json_object, col, to_timestamp, current_timestamp
# Read from Kafka
raw_stream = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "order-events")
.option("startingOffsets", "latest")
.load())
# Parse the JSON payload
orders_parsed = raw_stream.select(
get_json_object(col("value").cast("string"), "$.order_id").cast("long").alias("order_id"),
get_json_object(col("value").cast("string"), "$.customer_id").cast("long").alias("customer_id"),
to_timestamp(get_json_object(col("value").cast("string"), "$.created_at")).alias("order_ts"),
get_json_object(col("value").cast("string"), "$.total_amount").cast("decimal(18,4)").alias("total_amount"),
get_json_object(col("value").cast("string"), "$.region").alias("region"),
current_timestamp().alias("ingested_at")
)
# Write to Delta bronze layer
query = (orders_parsed.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/mnt/checkpoints/orders_bronze")
.trigger(processingTime="30 seconds")
.start("/mnt/myproject/bronze/orders"))Why Delta as the Stream Target
Writing a stream to plain Parquet files works but creates a small-files problem instantly — a 30-second trigger interval writes a new set of files every 30 seconds. Over a week, you have tens of thousands of tiny files that destroy query performance.
Writing to Delta handles this differently. Delta writes files in streaming mode and they can be compacted later with OPTIMIZE without interrupting readers. The transaction log makes each micro-batch's writes atomic, so a reader always sees consistent data even while the stream is writing.
Reading the Stream Output With a Batch Query
Once data is landing in Delta, any batch query can read it immediately:
# Batch query runs concurrently with the stream write — safe in Delta
recent_orders = (spark.read
.format("delta")
.load("/mnt/myproject/bronze/orders")
.filter("order_ts >= '2019-06-28' AND region = 'West'"))
display(recent_orders.orderBy("order_ts", ascending=False).limit(20))This is the "Lambda architecture collapse" that makes Delta interesting: you don't need separate batch and stream paths anymore. The stream writes to Delta, the batch queries Delta, and they don't interfere with each other.
Checkpoint Files: Don't Skip This
The checkpointLocation option is not optional. The checkpoint directory stores the stream's state: which Kafka offsets have been processed, what micro-batch the stream is on. If you restart the stream without a checkpoint, it starts from the beginning (or from startingOffsets) and you process everything again. If you have a checkpoint, restart resumes exactly where it left off.
Keep your checkpoint directory on durable storage (ADLS, S3 — not local disk) and never delete it unless you're intentionally replaying from the beginning. That directory is the stream's memory. As always, I'm here to help.