MQTT: The Protocol Running Your Building's Sensors (And What to Do With the Data)

A client called me earlier this year with an interesting problem. They had a building automation system — HVAC, power meters, door sensors — all generating data. They wanted that data in their analytics stack. The problem was that the vendor-supplied monitoring software only showed real-time dashboards and stored nothing past 30 days. "We need the history," they said. "We need to trend it."

What they had was an MQTT broker they did not know about. What they needed was someone to explain what to do with it.

What MQTT Is

MQTT (Message Queuing Telemetry Transport) is a publish/subscribe protocol designed for constrained environments: low bandwidth, unreliable connections, tiny devices. It was invented in 1999 at IBM for pipeline monitoring in remote oil fields and has found a second life as the backbone of IoT device communication.

The model is simple. Devices publish messages to a broker on a topic. Subscribers connect to the broker and receive messages on topics they care about. The broker is the hub — it handles fanout, connection management, and QoS negotiation. Devices do not talk to each other directly.

Topics use a slash-separated hierarchy: building/floor2/hvac/zone3/temp. Subscribers can use wildcards: building/+/hvac/+/temp matches any floor and any zone. building/# matches everything under building/. This is flexible in a way that Kafka topics are not — it is part of the protocol, not a consumer-side concern.

QoS: The Part Everyone Skips Until It Bites Them

MQTT defines three QoS levels that control delivery guarantees between the broker and the subscriber:

  • QoS 0 — fire and forget. The broker sends the message once with no acknowledgment. If the subscriber is offline, the message is gone.
  • QoS 1 — at least once. The broker retries until it gets an ack. The subscriber may receive duplicates.
  • QoS 2 — exactly once. Four-step handshake. Guaranteed delivery without duplicates. Highest overhead.

For a temperature sensor reading every 30 seconds, QoS 0 is fine — missing one sample is acceptable. For a door access event that triggers an audit log, QoS 1 or 2 is the right call. Choose based on whether you care more about throughput or completeness. Most IoT analytics work lands on QoS 1.

The Data Engineering Problem

Here is the thing about MQTT brokers: they are not designed for history. Mosquitto, the most common open-source broker, persists messages only if you configure the retained flag — and even then, only the last message per topic is retained. If your subscriber is offline, you get one message when it reconnects. That is not a time series. That is a snapshot.

If you want history — trending, anomaly detection, capacity planning — you need something that stores every message. MQTT alone does not do that.

There are two common solutions:

  1. Write a persistent subscriber that records every message to a database as it arrives
  2. Bridge MQTT to Kafka, and let Kafka be the durable store

Option 1 is quick to stand up but creates a fragile single point of ingestion. Option 2 is more infrastructure but gets you replay, fanout, and decoupled consumers.

A Persistent Subscriber in Python

import paho.mqtt.client as mqtt
import json
import time

BROKER_HOST = 'mqtt-broker.local'
BROKER_PORT = 1883
RAW_STORE = '/data/raw/mqtt'  # write raw payloads here first

def on_connect(client, userdata, flags, rc):
    print(f"Connected with result code {rc}")
    client.subscribe('building/#', qos=1)

def on_message(client, userdata, msg):
    # Land raw: topic, timestamp, payload as bytes
    record = {
        'topic': msg.topic,
        'ts': int(time.time() * 1000),
        'payload': msg.payload.decode('utf-8', errors='replace'),
        'qos': msg.qos
    }
    # Write to raw zone — do NOT parse payload here
    with open(f"{RAW_STORE}/{int(time.time()*1000)}.json", 'w') as f:
        json.dump(record, f)

client = mqtt.Client()
client.on_connect = on_connect
client.on_message = on_message
client.connect(BROKER_HOST, BROKER_PORT, 60)
client.loop_forever()

Notice what this does not do: it does not try to parse the payload. It does not validate the message structure. It lands the raw bytes (decoded as UTF-8 for JSON storage, but without interpretation) and timestamps them. Deserialization is a separate job.

Why This Architecture Works

Your building automation vendor will change the payload format. I promise. At some client I have worked with, the HVAC vendor pushed a firmware update that changed temperature readings from Celsius strings to Fahrenheit integers. The subscriber kept running, the raw records kept landing, and we caught the format change during deserialization without losing a single measurement.

If the subscriber had been doing inline deserialization — parsing the payload and writing structured records to the database directly — that firmware update would have either crashed the subscriber or, worse, silently inserted wrong values for every reading until someone noticed the temperature data looked weird.

Land raw first. Always. I will keep saying this until it stops being necessary.

Next in this series: Kafka — the durable event store that makes MQTT's persistence problem go away at scale, and the natural landing point for any high-volume IoT ingestion pipeline. If you are already running an MQTT broker and want to talk through bridge options, I am here to help.

Read more