The original Spark Streaming API — DStreams — was a competent piece of engineering surrounded by an awkward abstraction. You worked with DStream objects that were not quite RDDs and not quite DataFrames. The Python and Java APIs behaved differently in subtle ways. Windowing operations were powerful but verbose. Most annoyingly, DStreams were a separate universe from the rest of Spark SQL — you could not write a window function over a DStream the same way you could over a DataFrame.
Structured Streaming, introduced in Spark 2.0 and declared production-ready in 2.2, fixed most of this. The core idea: a stream is just a table that keeps getting new rows. You write queries against it using the DataFrame API or Spark SQL. The engine handles the incremental execution.
The Model: Unbounded Table
In Structured Streaming, Kafka topic data is modeled as an append-only table. Each new message is a new row. You write a query that selects, filters, or aggregates from this table — the same query you would write against a static DataFrame. Spark handles the mechanics of pulling new data, tracking progress, and writing results incrementally.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp
from pyspark.sql.types import StructType, StringType, LongType, DoubleType
spark = SparkSession.builder .appName("SensorRawLanding") .config("spark.sql.streaming.checkpointLocation", "/mnt/checkpoints/sensor-raw") .getOrCreate()
sensor_schema = StructType() .add("sensor_id", StringType()) .add("value", DoubleType()) .add("unit", StringType()) .add("ts_ms", LongType())
# Read from Kafka — each row is a Kafka message
raw_stream = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "kafka-broker-01:9092") .option("subscribe", "sensor_readings_raw") .option("startingOffsets", "earliest") .load()
# Stage 1: land raw — keep Kafka metadata alongside raw value bytes
raw_landed = raw_stream.select(
col("topic"),
col("partition"),
col("offset"),
col("timestamp").alias("kafka_ts"),
col("value").cast("string").alias("raw_payload"),
current_timestamp().alias("arrived_at")
)
# Write raw to Delta Lake (Stage 1 output)
raw_query = raw_landed.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/mnt/checkpoints/sensor-raw") .start("/mnt/datalake/raw/sensor_readings")
Triggers: You Control the Micro-Batch Interval
Structured Streaming does not have to be always-on real-time. The trigger controls how frequently it processes new data:
from pyspark.sql.streaming import Trigger
# Process continuously with minimal latency (true streaming)
.trigger(processingTime='0 seconds')
# Micro-batch: process every 5 minutes
.trigger(processingTime='5 minutes')
# Run once and stop — the "scheduled batch" mode
.trigger(availableNow=True) # Spark 3.3+; earlier: once=True
availableNow=True (formerly once=True) is the micro-batch sweet spot: it processes all available data, commits checkpoints, and exits. Run it on a schedule. You get the operational simplicity of a batch job — run, complete, stop — with the checkpoint-based exactly-once semantics of streaming. The cluster only exists while the job runs.
Checkpoints: State That Survives Restarts
Structured Streaming tracks progress in a checkpoint directory — a folder on HDFS, S3, ADLS, or a Delta table. The checkpoint records which Kafka offsets have been committed to the output. On restart, Spark reads the checkpoint and continues from exactly where it left off, with no duplicate writes and no data gaps.
Do not delete checkpoint directories between runs unless you want to replay from the beginning. Do not share checkpoint directories between two streaming queries (they will corrupt each other). One checkpoint per streaming query, always.
Why This Is Better Than DStreams
One API for batch and streaming. The same from_json, groupBy, window, and join functions work in both contexts. Code that you write to explore a static dataset in a notebook translates directly to a streaming job. Schema enforcement, null handling, and type casting all work the same way. And when something goes wrong, the error messages are the same ones you already know how to interpret.
DStreams were not bad. Structured Streaming is just a meaningfully better abstraction for the same problem space, and it will likely be where Spark streaming development is focused going forward. I am here to help if you are migrating an existing DStream job or starting fresh.