Files
MarekWo 422e7a3b34 feat(watchdog): catch sluggish-device failures via soft-pattern counting
The container watchdog only restarted on three legacy "device clearly dead"
log lines, so today's failure mode (firmware briefly stalls and get_stats_*
/ get_battery commands time out with an empty error while passive RX
keeps working) never tripped it — leaving the user with 10-15 s freezes
several times a day and no automatic recovery.

DeviceManager now tracks two liveness signals:
- _last_rx_at, bumped on every RX_LOG_DATA event
- _consecutive_stats_failures, incremented on get_stats_* / get_bat
  exceptions and cleared on success

New /health/strict endpoint exposes these to the watchdog. It returns 503
when the device is connected but has 5+ consecutive stats failures, or
when no RX event has been seen for over 5 minutes on a serial transport.
The cheap /health endpoint keeps its lenient behavior so Docker's
healthcheck doesn't suddenly start tripping.

The watchdog's check_device_unresponsive() gains a "soft" pattern class
with a count threshold of 5 in the last 2 minutes — matching against
get_stats_core/radio/packets failed:, Failed to get battery:, and
Failed to get channel. Hard patterns still trigger on a single hit.

Deploy note: the watchdog runs as a host-level systemd service and is
NOT restarted by mcupdate, so after deploy run:
  sudo systemctl restart mc-webui-watchdog.service

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 09:43:43 +02:00
..