docs: cover analyzer settings, vacuum/optimize, path apply, watchdog soft patterns

User-guide: new Settings > Analyzer tab (custom analyzer services with default/disabled toggles and {packetHash} placeholder), apply-path upload button in DM Path Management, Backup modal's Optimize button + live size label, console change_path now accepts arrow/whitespace separators with consistent multi-byte chunk length and "path" output shows hop count and byte size. Architecture: new /api/analyzers CRUD + default endpoints, /api/db/size and the split /api/db/vacuum kickoff + /api/db/vacuum/status polling (worker-thread VACUUM to survive proxy idle timeouts), /api/contacts/<key>/paths/<id>/apply, /health and /health/strict top-level routes, analyzers table and direct_messages.delivery_path_hash_size column, recombined path_len byte storage. DeviceManager: per-send channel-secret refresh, liveness telemetry (_last_rx_at + _consecutive_stats_failures), TCP self-heal via _liveness_watcher_loop + in-place reconnect. Retention scheduler: on-by-default 90/90/60/30, post-cleanup VACUUM at >=1000 deletions, app-context wrapping, archiver emoji-name fallback. Socket.IO clients forced to polling transport. Watchdog: documented hard- vs soft-pattern detection (5 hits in 2 min for sluggish get_stats / get_battery failures), pointer to /health/strict, and the systemd-restart deploy note for scripts/watchdog/ changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-07-05 17:31:39 +02:00 · 2026-06-08 11:53:41 +02:00
parent 7e9ff2e3aa
commit 53ef2759d5
3 changed files with 83 additions and 4 deletions
@@ -78,6 +78,9 @@ The `DeviceManager` handles the connection to the MeshCore device via a direct s
 - **BLE keepalive & reconnect** - When using Bluetooth transport, a 60s keepalive loop detects "zombie" connections (reads still succeed but writes silently fail). On disconnect or keepalive failure, the manager marks the session as permanently failed and the `/health` endpoint returns 503, letting the Docker healthcheck trigger a fast container restart (~5s) to get a clean BLE state rather than attempting unreliable in-process reconnects
 - **Echo correlation** - Sent channel messages pre-compute their expected `pkt_payload` using the channel secret and send timestamp (±3s for clock drift), so incoming echoes are matched exactly instead of only by 1-byte channel hash (prevents misattribution when two messages go out simultaneously on the same channel)
 - **Per-channel region scope** - Before each channel send, the channel's mapped region scope key (16 bytes) is pushed to the firmware via `CMD_SET_FLOOD_SCOPE_KEY` (54). The scope-set + send pair is serialised under a `_send_lock` so concurrent sends on different channels can't swap each other's scope. Channels without a mapping get an all-zero key so a previously-set scope doesn't leak across channels
+- **Per-send channel-secret refresh** - Channel indices on the device compact down after a deletion, so the boot-time `_load_channel_secrets()` cache can drift. `send_channel_message` calls `_refresh_channel_secret(idx)` first (one extra `get_channel(idx)` round-trip) to fetch the current secret straight from firmware, update the in-memory cache and DB if they had drifted, and use it for the `pkt_payload` echo correlation
+- **Liveness telemetry** - Tracks `_last_rx_at` (bumped on every `RX_LOG_DATA` event) and `_consecutive_stats_failures` (incremented on `get_stats_*` / `get_bat` exceptions, cleared on success). Surfaced via `/health/strict` for the external watchdog
+- **TCP self-heal** - A `_liveness_watcher_loop` task on the DM event loop calls `force_reconnect()` when no RX event has arrived for `HEALTH_STRICT_MAX_RX_STALE_SEC` (5 min). `send_channel_message` also detects empty-string `concurrent.futures.TimeoutError` from `set_flood_scope_key` (the symptom of a degraded long-lived TCP) and runs an in-place reconnect + one retry before failing. A 30 s cooldown and `_reconnect_lock` prevent churn; `_intentional_disconnect` keeps the DISCONNECTED handler from racing the reconnect

 ---

@@ -138,9 +141,22 @@ Key tables:
 - `regions` - User-curated MeshCore flood scopes (`name`, `key_hex`, `is_default`)
 - `channel_scopes` - Per-channel region mapping (`channel_idx` → `region_id`, CASCADE on region delete; absent row = no override → firmware default applies)
 - `read_status` - Per-channel read counters and favorites (`is_favorite` column; used to pin channels in the sidebar/dropdown sort order)
+- `analyzers` - User-configured MeshCore Analyzer services (`name`, `url_template` with `{packetHash}` placeholder, `is_default`, `is_disabled`; partial unique index enforces a single default)
+
+`direct_messages` gained a `delivery_path_hash_size` column (auto-migrated, defaults to 1) so reloaded DM bubbles render multi-byte routes correctly. The `path_len` column on `channel_messages`, `direct_messages`, and `paths` now stores the raw firmware byte (masked hop count plus path_hash_mode in the upper bits), recombined at write time via `pack_path_len()`; the API endpoints decode it back into `path_hash_size` on read.

 The use of SQLite allows for fast queries, reliable data storage, full-text search, and complex filtering (such as contact ignoring/blocking) without the risk of file corruption inherent to flat JSON files.

+### Retention scheduler
+
+Retention is enabled by default with `90 / 90 / 60 / 30` days for `channel_messages / direct_messages / advertisements / diagnostics`. The job runs daily at 03:30 local (`TZ` from `.env`) and `cleanup_old_messages()` also deletes from `echoes`, `paths`, and `acks` (the diagnostic tables — historically the bulk of DB size). When at least 1 000 rows are removed in a pass, the scheduler immediately runs `VACUUM` to reclaim file space (a SQLite `DELETE` only marks pages free).
+
+The retention/cleanup scheduler runs APScheduler jobs in worker threads, so each job is decorated with `@_with_app_context` and the Flask app is passed in via `set_flask_app()`; the `init_*_schedule()` callers also wrap themselves in `app.app_context()` so the boot-time read of `current_app.db` doesn't blow up with "Working outside of application context".
+
+The archiver builds the `.msgs` path from `device_name`, but the `meshcore` library strips non-ASCII when writing the file (so a device renamed to include an emoji breaks the strict path match). The archiver now falls back to globbing the data directory for a single non-archive `.msgs` file when the expected path is missing — mirroring `migrate_v1`.
+
+The channels API reads from the `channels` DB table rather than iterating device slots. `_load_channel_secrets()` syncs the table on every device connect (and prunes stale rows), `set_channel()` / `remove_channel()` update it synchronously with the device, and `_refresh_channel_secret()` refreshes individual rows on per-send refresh. This makes `/api/channels` a single sub-millisecond `SELECT` and unaffected by device responsiveness — the original symptom (only "Public" showing up after a refresh when the device briefly stalls) is gone.
+
 ---

 ## API Reference
@@ -188,6 +204,7 @@ The use of SQLite allows for fast queries, reliable data storage, full-text sear
 | PUT | `/api/contacts/<key>/paths/<id>` | Update path (star, label) |
 | DELETE | `/api/contacts/<key>/paths/<id>` | Delete path |
 | POST | `/api/contacts/<key>/paths/reorder` | Reorder paths |
+| POST | `/api/contacts/<key>/paths/<id>/apply` | Push a configured path to the firmware as the active route (mirrors `change_path`); invalidates the contacts cache |
 | POST | `/api/contacts/<key>/paths/reset_flood` | Reset to FLOOD routing |
 | POST | `/api/contacts/<key>/paths/clear` | Clear all paths |
 | GET | `/api/contacts/<key>/no_auto_flood` | Get "Keep path" flag |
@@ -219,6 +236,21 @@ The use of SQLite allows for fast queries, reliable data storage, full-text sear
 | POST | `/api/regions/<id>/default` | Mark default in DB AND push to firmware (CMD_SET_DEFAULT_FLOOD_SCOPE = 63, requires firmware v1.15+) |
 | DELETE | `/api/regions/default` | Clear default region in DB and on firmware |

+The `PUT /api/channels/<index>/scope` endpoint accepts any `index` in `[0, device_manager._max_channels)` (40 on current firmwares; falls back to 8 if the DM is unreachable).
+
+### Analyzers
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/analyzers` | List configured analyzer services |
+| POST | `/api/analyzers` | Create analyzer (`{name, url_template}`); template must contain `{packetHash}` |
+| PUT | `/api/analyzers/<id>` | Update analyzer (name / url / is_disabled) |
+| DELETE | `/api/analyzers/<id>` | Delete analyzer |
+| POST | `/api/analyzers/<id>/default` | Mark as default (enforced single-default via partial unique index) |
+| DELETE | `/api/analyzers/default` | Clear the default analyzer |
+
+The backend no longer ships a pre-built `analyzer_url` per message — channel-message payloads include `packet_hash` instead, and the frontend substitutes `{packetHash}` in the chosen URL template at click time.
+
 ### Direct Messages

 | Method | Endpoint | Description |
@@ -259,6 +291,18 @@ The use of SQLite allows for fast queries, reliable data storage, full-text sear
 | GET | `/api/backup/list` | List database backups |
 | POST | `/api/backup/create` | Create database backup |
 | GET | `/api/backup/download` | Download backup file |
+| GET | `/api/db/size` | Current DB file size (bytes) |
+| POST | `/api/db/vacuum` | Kick off SQLite `VACUUM` in a worker thread. Returns 202 immediately; 409 if already running. The kickoff endpoint deliberately splits from polling so reverse proxies with ~30 s idle timeouts can't kill it mid-rewrite |
+| GET | `/api/db/vacuum/status` | Poll vacuum progress: `{running, elapsed_seconds, size_before, size_after}` |
+
+### Health endpoints
+
+These are top-level routes (not under `/api/`), consumed by Docker's healthcheck and the host-level watchdog.
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/health` | Lenient liveness check. Returns 503 only when BLE reconnection has permanently failed (so Docker triggers a container restart to clear BLE state). Returns 200 otherwise |
+| GET | `/health/strict` | Strict device-health check for the external watchdog. JSON response. Returns 503 when (a) BLE permanently failed, (b) `_consecutive_stats_failures` ≥ 5, or (c) transport is serial/usb/tcp and no RX event for > `HEALTH_STRICT_MAX_RX_STALE_SEC` (5 min). Returns 200 with the same counters when healthy |

 ### Other

@@ -284,6 +328,8 @@ The use of SQLite allows for fast queries, reliable data storage, full-text sear

 ## WebSocket API

+All Socket.IO clients (`/chat`, `/console`, `/logs`) are configured with `transports: ['polling']`. The Werkzeug dev server can't upgrade WebSockets, so every `io()` upgrade attempt previously returned HTTP 500 and clients fell into a polling/upgrade reconnect loop — visible as 10–15 s freezes on app load. Long-polling keeps real-time pushes working with ~1–2 s latency.
+
 ### Console Namespace (`/console`)

 Interactive console via Socket.IO WebSocket connection.
@@ -456,6 +456,7 @@ Configure message routing paths for individual contacts:
  - **Repeater picker** - Browse available repeaters by name or ID
  - **Map picker** - Select repeaters from a map view showing their GPS locations
  - **Import current path** - Import the path currently stored on the device
+- **Apply to device** (upload-arrow icon) - Push a configured path to the firmware as the active route without leaving the modal. The device-path line refreshes once the change is confirmed, mirroring the console's `change_path` command
 - **Reorder** - Drag paths to change priority (starred path is used first)
 - **Star** - Mark a preferred primary path (used first in retry rotation)
 - **Delete** - Remove individual paths
@@ -500,8 +501,8 @@ The console supports a comprehensive set of MeshCore commands organized into cat
 - `.pending_contacts` - List pending contacts
 - `add_pending <key>` - Approve pending contact
 - `remove_contact <name>` - Remove contact
- `change_path <name> <path>` - Change contact's routing path. Accepts comma-separated hex bytes (`D1,90,05`), continuous hex (`D19005`), or space-separated bytes. Use the keyword `direct` to set a Direct (0-hop) path. Hash size is auto-detected from the chunk length. Use `reset_path <name>` to switch back to Flood
- `path <name>` - Show the current path for a contact
+- `change_path <name> <path>` - Change contact's routing path. Accepts comma-, whitespace-, or arrow-separated hex chunks (`D1,90,05`, `D103 5E34`, `D1->90->05`) or continuous hex (`D19005`). For multi-byte paths all chunks must have a consistent length — that length determines the hash-size mode (1, 2, or 3 bytes per hop). Use the keyword `direct` to set a Direct (0-hop) path; use `reset_path <name>` to switch back to Flood
+- `path <name>` - Show the current path for a contact (e.g. `D103,5E34 (2 hops, 2B)` — hop count and byte size)

 **Device & Channel Management:**
 - `infos` / `ver` - Device info / firmware version
@@ -592,7 +593,7 @@ Access the Settings modal to configure application behavior:
 1. Click the menu icon (☰) in the navbar (or tap the gear FAB button)
 2. Select "Settings" from the menu

-The modal is organized into tabs: **Device**, **Messages**, **Group Chat**, **Interface**, **Appearance**, **Contacts**, **Regions**, and **Notifications**. A global **Close** button at the bottom of the modal dismisses Settings from any tab.
+The modal is organized into tabs: **Device**, **Messages**, **Group Chat**, **Interface**, **Appearance**, **Contacts**, **Regions**, **Analyzer**, and **Notifications**. A global **Close** button at the bottom of the modal dismisses Settings from any tab.

 ### Device Tab

@@ -676,6 +677,22 @@ Manage MeshCore region scopes (also called flood scopes). See [Region Scopes](#r
 - Pick **None** to clear the firmware default
 - Delete regions you no longer need (channels using a deleted region revert to "no scope")

+### Analyzer Tab
+
+Configure MeshCore Analyzer services used by the chart icon under each group-chat message. The icon resolves at click time depending on what you configure here:
+
+- **No custom analyzers (or all disabled)** → opens the built-in Letsmesh analyzer
+- **One default analyzer set** → opens that service directly
+- **Multiple enabled analyzers, no default** → opens a chooser modal
+
+Each row supports:
+
+- **Star toggle** — mark this analyzer as the default. Only one default is allowed
+- **Enabled switch** — temporarily disable a service without deleting it
+- **Edit / Delete** buttons
+
+When adding or editing, the URL template must contain the placeholder `{packetHash}` — it is substituted with the message's packet hash at click time.
+
 ### Notifications Tab

 Enable or disable browser push notifications for new messages received while the app is hidden or in the background.
@@ -710,6 +727,8 @@ Create and manage database backups:
 - **Create backup** - Creates a timestamped copy of the current database
 - **List backups** - View all available backups with timestamps and file sizes
 - **Download** - Download any backup file to your local machine
+- **Current size** - Live label showing the active DB file size
+- **Optimize now** - Run `VACUUM` on demand to reclaim free pages left behind by the retention job. The kickoff returns immediately and the UI polls for completion; a toast reports `freed X bytes in Y s` when done. A concurrent request returns HTTP 409. A nightly `VACUUM` already runs automatically when the retention job deletes 1000+ rows, so use this only when you want to reclaim space before the next 03:30 run

 Backups are stored in the `./data/` directory alongside the main database.

@@ -5,7 +5,7 @@ The Container Watchdog is a systemd service that monitors the `mc-webui` Docker
 ## Features

 - **Health monitoring** - Checks container status every 30 seconds
- **Log monitoring** - Monitors `mc-webui` logs for specific "unresponsive LoRa device" errors
+- **Log monitoring** - Two pattern classes (see [Failure detection](#failure-detection))
 - **Automatic restart** - Restarts the container when issues are detected
 - **Auto-start stopped container** - Starts the container if it has stopped (configurable)
 - **Hardware USB reset** - Performs a low-level USB bus reset (unbind/bind or DTR/RTS) if the LoRa device freezes. *Note: USB reset is automatically skipped if a TCP connection is used.*
@@ -13,6 +13,20 @@ The Container Watchdog is a systemd service that monitors the `mc-webui` Docker
 - **HTTP status endpoint** - Query watchdog status via HTTP API
 - **Restart history** - Tracks all automatic restarts with timestamps

+## Failure detection
+
+`check_device_unresponsive()` scans the last 2 minutes of container logs against two pattern classes:
+
+- **Hard patterns** — any single occurrence triggers a restart. These are the long-standing "device clearly dead" messages: `No response from meshcore node, disconnecting`, `Device connected but self_info is empty`, `Failed to connect after 10 attempts`.
+- **Soft patterns** — any of these failing **5 or more times in the last 2 minutes** triggers a restart. Catches the "sluggish but not dead" mode where the firmware briefly stalls on `get_stats_*` / `get_battery` commands (empty-string `concurrent.futures.TimeoutError`) while passive RX still works: `get_stats_core failed:`, `get_stats_radio failed:`, `get_stats_packets failed:`, `Failed to get battery:`, `Failed to get channel`.
+
+In parallel, the app exposes [`/health/strict`](architecture.md#health-endpoints) — a stricter device-health check that the watchdog (or any external monitor) can consume to react before the soft-pattern threshold is reached.
+
+> **Deploy note:** the watchdog runs as a host-level systemd service and is **not** restarted by `mcupdate`. After deploying changes to `scripts/watchdog/`, run:
+> ```bash
+> sudo systemctl restart mc-webui-watchdog.service
+> ```
+
 ## Installation

 ```bash