Files
Alex Vanderpot 72aa6be3d3 ingest: resubscribe on reconnect + staleness watchdog for zombie MQTT conns (#38)
The letsmesh broker was migrated behind Cloudflare and changed its topic
layout on 2026-06-02, which left prod's MQTT client in a zombie state:
connected per paho's IsConnected() (so the 30s monitor never rebuilt it) but
receiving zero messages, because the subscription was established only once
after the initial connect and never re-applied on paho auto-reconnects. Result:
12 days of silently missing letsmesh ingestion while davekeogh masked the loss.

Make reconnection robust instead of relying on broker-side session persistence:

- Subscribe inside the OnConnect handler so every (re)connect — including paho
  auto-reconnects — restores delivery. Use CleanSession(true)+ResumeSubs(false)
  so we never depend on the broker remembering our session.
- Add a per-broker data-staleness watchdog: a broker that reports connected but
  delivers no messages for MQTT_STALE_AFTER_SECONDS (default 300) is treated as a
  zombie and force-rebuilt (disconnect + fresh connect/subscribe). This catches
  exactly the failure IsConnected() misses.
- Reduce the external monitor to that watchdog role; transient drops are left to
  paho auto-reconnect rather than racing it with a brand-new client.
- Stable per-broker client IDs (by index) and pre-sized MQTTClients slice so
  indices stay aligned when an earlier broker fails; guard BrokerStatus/lastActivity
  with a mutex; promote connect/subscribe logs to Info for visibility.

Adds unit tests for the watchdog and env parsing; documents the new env var.

Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:55:33 -04:00
..