Electronic circuit board with glowing lines on dark background
Systems · Mar 2026 · 10 min read

Real-Time IoT at Scale: WebSockets, MQTT, and Lessons Learned

A breakdown of the architecture behind a national law-enforcement IoT dashboard — how we handled thousands of concurrent device connections without melting the server.

By Asrafirrizal · Mar 2026

The brief: a real-time monitoring dashboard for a national law enforcement agency. Hundreds of field devices — vehicles, sensors, body cameras — streaming location, status, and event data. Sub-second latency requirement. 24/7 uptime. No room for failure.

Here's the architecture we built, what broke, and what held.

The Problem With Polling

The naive approach — client polls the server every second — dies at scale. 500 devices x 1 request/second = 500 req/s just for status updates, before any user traffic. Polling also gives eventual consistency, not real-time. A device state change shows up 0-1000ms late depending on poll timing.

Real-time IoT needs a push architecture. The device pushes data when something changes; the server pushes to clients immediately. No polling, no lag.

The Stack: MQTT + WebSockets

We split the problem in two:

  • Device to Server: MQTT. Lightweight, designed for IoT, handles unreliable connections gracefully, runs on constrained hardware.
  • Server to Browser: WebSockets. Full-duplex, works everywhere, no polling overhead.
Field devices
  -> MQTT (port 8883, TLS)
  -> MQTT Broker (Mosquitto)
  -> Node.js bridge service
  -> Redis Pub/Sub
  -> WebSocket servers (multiple instances)
  -> Browser clients

Redis Pub/Sub is the critical middle layer. It decouples the MQTT bridge from the WebSocket servers, letting you scale each independently. A message published to Redis reaches every WebSocket server instance — every connected browser client — in under 5ms.

MQTT: The Device Layer

MQTT runs on a publish/subscribe model. Devices publish to topics; subscribers receive messages. Our topic structure:

devices/{device_id}/location      # GPS coordinates, heading, speed
devices/{device_id}/status        # online/offline, battery, signal
devices/{device_id}/events        # alerts, triggers, incidents
fleet/+/location                  # wildcard: all device locations

QoS level matters. We use QoS 1 (at-least-once delivery) for events and QoS 0 (fire-and-forget) for location updates. Location data is high-frequency and stale the moment it arrives — a dropped packet doesn't matter. An incident event must be delivered.

Handling Disconnections

Field devices go offline constantly — tunnels, dead zones, reboots. MQTT's Last Will and Testament (LWT) handles this gracefully: the broker publishes a "device offline" message automatically when a connection drops unexpectedly. No application-level heartbeat logic needed.

// Device connects with LWT configured
client.connect({
  will: {
    topic: `devices/${deviceId}/status`,
    payload: JSON.stringify({ online: false, timestamp: Date.now() }),
    qos: 1,
    retain: true   // new subscribers see last known state immediately
  }
})

Retained messages are equally important — a new browser client connecting to the dashboard sees the current state of all devices instantly, without waiting for the next update from each device.

WebSockets: The Browser Layer

We run multiple Node.js WebSocket server instances behind a load balancer. The problem: WebSocket connections are stateful. A browser connected to Instance A can't receive messages published by Instance B — unless they share state.

Redis Pub/Sub solves this. Every WebSocket instance subscribes to the same Redis channels. Every message from a device reaches every instance, which forwards it to connected browser clients.

// WebSocket server (simplified)
const redisSubscriber = createClient()
await redisSubscriber.subscribe('device-updates', (message) => {
  const update = JSON.parse(message)
  broadcastToRoom(update.deviceId, update)
})

wss.on('connection', (ws, req) => {
  const { deviceIds } = parseSubscription(req)
  deviceIds.forEach(id => addClientToRoom(id, ws))
  ws.on('close', () => {
    deviceIds.forEach(id => removeClientFromRoom(id, ws))
  })
})

What Broke at Scale

The Memory Leak

At around 800 concurrent WebSocket connections, memory climbed and never came back down. Root cause: event listeners on the ws object weren't being cleaned up on disconnect. Every closed connection left a dangling listener. Fixed with explicit cleanup in the close handler and a WeakMap for client tracking.

Message Storm on Reconnect

When the broker restarted, all 400+ devices reconnected simultaneously and published their retained state. The bridge service received 400 messages in ~200ms, overwhelmed the Redis pipeline, and backed up. Fixed with connection jitter (random 0-5s reconnect delay on device firmware) and a message queue with backpressure on the bridge.

The Database Write Problem

We were writing every location update to PostgreSQL in real-time. At 2 updates/second per device x 400 devices = 800 writes/second. Postgres handled it, but barely, and query latency spiked. Solution: write location to Redis (fast, ephemeral) for real-time display; batch-write to Postgres every 30 seconds for historical queries. Different data, different storage, different access patterns.

Real-time and persistent are different requirements. Don't force the same storage layer to serve both.

Monitoring

A real-time system that breaks silently is the worst outcome. We instrument:

  • MQTT broker: connected clients, message rate, dropped connections
  • Bridge service: queue depth, processing latency, Redis publish errors
  • WebSocket servers: connected clients per instance, message broadcast latency
  • End-to-end: synthetic device-to-browser latency measured every 30 seconds

The end-to-end synthetic test is the most valuable. It's the only metric that catches cascading failures across multiple layers simultaneously.

Numbers

  • Peak concurrent devices: ~600
  • Peak concurrent browser clients: ~120
  • Average device to browser latency: 180ms
  • p99 device to browser latency: 420ms
  • Uptime over 12 months: 99.94%

The full stack: Mosquitto as MQTT broker, Node.js for the bridge and WebSocket servers, Redis 7 for Pub/Sub and hot data, PostgreSQL for historical data, React on the frontend with a custom WebSocket hook, deployed on bare-metal VMs behind Nginx. No managed services — the client's security requirements mandated on-premise.

← Back to all posts Get in touch →