The packets-message parser typed every metric field as a string, so any
gateway that emits SNR, RSSI, len, payload_len, packet_type, score, or
duration as a bare JSON number (including negative and fractional values)
failed json.Unmarshal and dropped the entire packet — including the raw
bytes we need.
Introduce a flexString type that accepts either a JSON string or a bare
number (kept verbatim as text) and apply it to those numeric metric
fields. Add tests covering both the numeric and string encodings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fetchNodes() reads config.selectedRegion but its useCallback dep array
omitted it, so changing the region produced a stale closure that refetched
with the previous region — neighbor lines stayed on the old region until a
full page refresh. Add config.selectedRegion to the dep array so the fetch
picks up the new region immediately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
getAllNodeNeighbors() hard-defaulted to region='SEA' when no region was
selected, so the map only ever returned PNW (Seattle/Salish catch-all)
neighbor edges even though node dots render for all regions. Drop the
region filter entirely when no region or group is selected so edges show
for every region, still bounded by the viewport bbox, lastSeen, and node
types. The neighbor-edge MV is intra-region only, so this never produces
cross-region lines.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a sub-option under "Show all neighbors" that filters neighbor lines
to only direct (MQTT) connections, hiding multi-hop path connections.
Defaults to off. Also hides the path-traffic gradient legend rows when
the option is enabled, leaving the "MQTT connections" entry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Node search now scans/merges meshcore_adverts_latest once and fans each row
out to matching query branches via arrayJoin + a window function, instead of
one UNION ALL branch per query each re-merging the full AggregatingMergeTree.
Chat message queries push the region filter onto the raw table's stored
`region` column before the GROUP BY (instead of a post-aggregation arrayExists
over origin_path_info), eliminating the multi-second full-table scans.
Grafana: rewrite the nodes-over-time panel to drop the rolling self-join, and
fix the messages-by-channel-hash panel which referenced a nonexistent
`regions` column (ClickHouse error code 47).
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MeshCore path_length byte encodes hop_count in its low 6 bits and a
hash_size_code in its high 2 bits (0/1/2 -> 1/2/3 bytes per hop), and
transport route types (TRANSPORT_FLOOD/TRANSPORT_DIRECT) insert a 4-byte
transport_codes field between the header and path_length. The packet
decoder previously assumed every hop was a single byte and that
path_length always sat at byte 2, so it only handled 1-byte-hash,
non-transport packets; anything else decoded to an over-long path and an
empty payload.
Migration 007 reworks the meshcore_packets read-time aliases to honor the
transport_codes offset and compute the path as hop_count * hash_size
bytes, and exposes hop_count / hash_size_code / hash_size (bytes per hop)
as columns. payload, path and packet_hash now decode correctly for every
route type and hash size; the adverts and public-channel derived tables
are rebuilt from the corrected decode (invalid hash_size_code 3 packets
are skipped per spec).
hash_size is carried through the chat and advert APIs so the path
visualization splits a path into hops of the correct width
(pathUtils/PathVisualization), instead of always slicing one byte per hop.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operators can opt a node out of the public-facing surfaces by putting an
opt-out emoji in the node name. Hidden nodes are removed from the map,
neighbor edges/lists, search, and their own detail page (server-side
ClickHouse filters), and chat messages from a hidden sender are dropped
client-side after decryption. Matching keys on the base codepoint so the
variation-selector form (⛔️) is caught too.
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ingest: batch ClickHouse inserts to stop MQTT flapping & packet loss
The meshcore handler did a synchronous per-message ClickHouse insert on
paho's single inbound goroutine. At ~86ms/insert (single-row inserts +
async_insert wait + materialized views) the goroutine couldn't keep up
with the high-volume letsmesh feed, so it stalled past PingTimeout and
paho declared "pingresp not received" and reconnected — ~847 cycles in
19.5h, ~45% downtime, ~50% of letsmesh packets lost. The low-volume
davekeogh broker never saturated the goroutine and was unaffected.
Decouple receipt from insertion: the handler now enqueues decoded rows
onto a buffered channel and a single background writer flushes them to
meshcore_packets in batched native inserts (every MESHCORE_BATCH_FLUSH_
SECONDS or MESHCORE_BATCH_MAX_ROWS rows). The inbound goroutine never
blocks, so PINGRESP is always processed in time.
- New batch writer with env-configurable flush interval / max rows /
buffer size (MESHCORE_BATCH_* ), wired in docker-compose.
- Drop server-side async_insert (redundant once we batch app-side).
- Bump PingTimeout 10s -> 20s (env MQTT_PING_TIMEOUT_SECONDS) for margin
against Cloudflare WebSocket buffering jitter.
- Enqueue is non-blocking; rows are dropped+counted only if the buffer
fills (ClickHouse unavailable). A failed batch is dropped and retried
by the next flush (native blocks commit atomically).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ingest: make MQTT KeepAlive configurable (MQTT_KEEPALIVE_SECONDS)
As a near-silent subscriber, paho emits a PINGREQ roughly every KeepAlive
seconds; lowering it sends client->server frames more often to keep the
Cloudflare-proxied WebSocket path warm in both directions, a lever for the
residual mid-stream "pingresp not received" stalls on the letsmesh broker.
Default unchanged (30s); wired through docker-compose.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ingest: add configurable MQTT write timeout (MQTT_WRITE_TIMEOUT_SECONDS)
Bounds PINGREQ/SUBSCRIBE writes so a stalled write through the Cloudflare
WebSocket proxy can't hang the client. Default 0 (paho's existing no-timeout
behavior); wired through docker-compose. Recommended ~20s when behind a
buffering reverse proxy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the Location History coordinate table with an embedded Leaflet
map (new LocationHistoryMap component): one circle marker per position
with timestamp/coord popups, the most-recent point highlighted, and a
subtle polyline connecting points chronologically, auto-fit to bounds.
Cap Recent Adverts at the 5 most recent with a "Show more"/"Show less"
toggle and a count in the subtitle.
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The map page slowed the browser at high node counts (~5k) because every
marker eagerly rendered its popover to an HTML string at creation, and the
visual-update effects re-rendered every marker's icon and popup on each
selection change. Hovering a meshcore node (the default type) re-rendered
all markers.
Bind popups lazily so PopupContent is only rendered when a popup actually
opens, drop the now-unnecessary popup setContent calls, and re-skin only the
markers whose selected state changed instead of the whole set.
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-request argMax/GROUP-BY views with insert-triggered
(incremental) materialized views so map node positions, node search, and
public-channel chat read pre-aggregated state instead of re-scanning all of
meshcore_packets on every query.
- 005: meshcore_adverts_latest_state (AggregatingMergeTree of argMaxState/
min/maxState) + incremental MV + backfill; meshcore_adverts_latest becomes a
-Merge view with the identical column contract. Node search reads it directly;
map (unified_latest_nodeinfo) is unchanged.
- 006: meshcore_public_channel_messages_raw, a decoded payload_type=5 MergeTree
keyed (channel_hash, ingest_timestamp); chat dedups by message_id at read time
over a timestamp-bounded scan. Streaming/pagination push channel+cursor onto
the primary key.
- Neighbor-edge MVs stay hourly REFRESH (they read the preserved view).
Verified against full prod data (14.5M rows): exact parity (0 mismatches) and
5-9x faster reads with no regressions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-curated region groups with data-driven ones and make the
region_groups ClickHouse table the single source of truth.
- scripts/generate-region-groups.ts: offline generator — clusters regions by
cross-region packet co-occurrence (min-share single-linkage) at two levels
(broad "region" + tight "metro"), names clusters via `claude -p`, reconciles
codes by member overlap so permalinks stay stable, and emits the region_groups
seed. Migration 004 reseeded with the resulting 39 groups.
- Groups are DB-sourced: getRegionGroups() (cached) feeds /api/regions and the
dropdown/labels; filtering resolves a selector in SQL to a region
(region = 'X') or a group (region IN / hasAny ... SELECT region_code FROM
region_groups WHERE group_code = ...). No hardcoded membership in TS;
resolveSelector removed.
- Drop the TS<->SQL parity script (no membership left to sync); regionSql and
the migration ALIAS are kept in sync by hand.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace hardcoded (broker, topic) region slugs with uppercase IATA codes
derived from the meshcore/{IATA} base topic, discovered dynamically from
data (adding a region needs no code change). Adds region groups, Grafana
region/group filtering, and fixes the neighbor graph.
- regions.ts: single source of truth — regionFromTopic / normalizeRegion /
regionSql / resolveSelector / selectorLabel. Legacy slugs (seattle->SEA)
and bare meshcore + meshcore/salish -> SEA still resolve.
- regionGroups.ts + seeded region_groups table: PNW/CAL/DEU/POL.
- migration 004: region ALIAS column on meshcore_packets; 001 views expose
region / regions[]; reworked neighbor MV (region-scoped, no cross-region
edges, drops implausible >150km and (0,0) edges); scheduled meshcore_regions MV.
- API/streaming/actions resolve selectors; stream routes drop the hardcoded
region allow-lists; map node query excludes (0,0) sentinel nodes.
- Dynamic region/group dropdowns (useRegions/RegionSelect); /api/regions.
- Grafana: cascading $region / $region_group template vars + panel filters.
- region-parity.ts (npm run check:regions) guards TS<->SQL drift.
- nix dev shell (flake.nix, Node 24).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The letsmesh broker was migrated behind Cloudflare and changed its topic
layout on 2026-06-02, which left prod's MQTT client in a zombie state:
connected per paho's IsConnected() (so the 30s monitor never rebuilt it) but
receiving zero messages, because the subscription was established only once
after the initial connect and never re-applied on paho auto-reconnects. Result:
12 days of silently missing letsmesh ingestion while davekeogh masked the loss.
Make reconnection robust instead of relying on broker-side session persistence:
- Subscribe inside the OnConnect handler so every (re)connect — including paho
auto-reconnects — restores delivery. Use CleanSession(true)+ResumeSubs(false)
so we never depend on the broker remembering our session.
- Add a per-broker data-staleness watchdog: a broker that reports connected but
delivers no messages for MQTT_STALE_AFTER_SECONDS (default 300) is treated as a
zombie and force-rebuilt (disconnect + fresh connect/subscribe). This catches
exactly the failure IsConnected() misses.
- Reduce the external monitor to that watchdog role; transient drops are left to
paho auto-reconnect rather than racing it with a brand-new client.
- Stable per-broker client IDs (by index) and pre-sized MQTTClients slice so
indices stay aligned when an earlier broker fails; guard BrokerStatus/lastActivity
with a mutex; promote connect/subscribe logs to Info for visibility.
Adds unit tests for the watchdog and env parsing; documents the new env var.
Co-authored-by: Alex Vanderpot <alex@Alexs-MacBook-Pro-2.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes a 25.x global memory-tracker drift where the tracker pinned at the
max-memory cap (RSS far below it), causing the OvercommitTracker to kill every
query (map/stats/neighbors all 500ing). Deployed in-place on prod over the
existing data dir after a cold backup.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The /api/chat endpoint queried the meshcore_public_channel_messages
VIEW, which does GROUP BY payload over all payload_type=5 packets.
Filtering its output on ingest_timestamp/channel_hash can't push below
the GROUP BY (they're max(...)/derived-from-grouped-payload), so every
call re-aggregated the entire history (~8M rows / 1.2 GiB / ~700ms),
ignoring the ingest_timestamp primary key.
Replace the view reference with an inline subquery
(publicChannelMessagesSubquery) that pushes the time/channel filters
into the inner meshcore_packets scan, so partition + primary-key
pruning applies. Region filtering stays on the outer query since
origin_path_info only exists post-aggregation. Same change to the chat
streaming poller.
Verified on prod: identical output, 8.06M->114K rows read, ~700ms->28ms.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Point the seattle region at the letsmesh broker (wss://mqtt-us-v1.letsmesh.net:443,
topic meshcore/SEA) where Seattle traffic now lives.
- Fix a pre-existing bug in the path-edge extraction: `path` is a hex string of
1-byte hop prefixes, so use substring(path, 2*i-1, 2) instead of
hex(substring(path, i, 1)) (which re-hexed a single hex char and never matched
the 2-char repeater prefixes -> path edges were always empty). Seattle now yields
path edges again.
Verified on a full prod snapshot: the MV-backed "show all neighbors" query drops
from ~1.6s / 145M rows / 11.8 GiB to ~1ms / 108 rows / 3.8 KiB.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The two slow neighbor queries are converted to read precomputed tables that an
hourly REFRESH EVERY 1 HOUR materialized view maintains, instead of re-aggregating
meshcore_packets per request:
- meshcore_all_neighbor_edges: the global per-region edge graph (direct path_len=0
adverts + repeater-prefix path edges) with endpoint details. getAllNodeNeighbors
now filters it by region + bbox + lastSeen + has_location.
- meshcore_node_direct_neighbors: per-node direct adjacency (both directions) with
neighbor details. getMeshcoreNodeNeighbors now filters it by node_public_key.
Also add the meshcore/SEA topic to the seattle region. Validated on a clean local
stack: migration 001->003 applies, both refreshable MVs create + refresh + populate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The readonly profile's max_rows_to_read / max_bytes_to_read (500MB) is exceeded by
the map/stats views, which scan the full (growing) meshcore_packets table -> the web
app failed with TOO_MANY_BYTES. Remove the read-size caps; readonly=1, allow_ddl=0,
max_memory_usage and max_execution_time remain the guardrails.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ClickHouse's internal diagnostics grew unbounded (text_log at Trace level and the
1s query profiler -> trace_log accumulated ~160G over months). Add short TTLs to
all system *_log tables, cap text_log at warning level, and disable the query
profiler in both profiles so trace_log stays empty.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pin the bundled service images to known versions for reproducible releases and
safe in-place reuse of an existing data dir (matching the production deployment).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the MeshCore dashboard (exported from prod) as a provisioned file
dashboard, with a file provider config. Pin the ClickHouse datasource
uid to "clickhouse" so the dashboard's panel datasource references
resolve at provision time.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bundle Grafana (127.0.0.1:3000) with the grafana-clickhouse-datasource plugin
and an auto-provisioned ClickHouse datasource using the read-only user. Adds
GRAFANA_ADMIN_PASSWORD to .env.example. Verified: datasource health returns OK.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Module is now github.com/ajvpot/meshexplorer/ingest (the code lives under
ingest/ in the meshexplorer repo), updated from the old standalone
clickhouse-meshingest path. build/vet/test pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add a deploy-focused root README; update the web app README (meshcore-only,
point Docker usage at the unified root compose).
- Fix the migration runner: set the goose clickhouse dialect (it defaulted to
postgres and failed to create its version table). Migrations now apply cleanly.
- Remove the unused meshcore decrypt UDF (meshcore_try_decrypt was never called
by any view/query/code) and simplify the ClickHouse image to a single stage.
Verified end-to-end: `docker compose up` brings up clickhouse -> migrate ->
meshcoreingest + meshexplorer; live ingestion from the real MQTT brokers lands
packets in ClickHouse and the web API serves decoded meshcore nodes via the
readonly user.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Single root compose brings up the whole stack on one internal network:
clickhouse (healthchecked) -> migrate (one-shot) -> meshcoreingest + meshexplorer,
with the discord-bot behind a "bot" profile. Web app/bot connect as the readonly
ClickHouse user; ingest/migrate use the default user. Named volume replaces the
host /tank path. .env.example documents every variable with placeholders; root
.gitignore keeps real .env out of git. Drops the per-project compose files.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Meshtastic was UI-filtering only (no meshtastic data backend). Drop it as a
node type/option, and simplify the map marker/cluster/popup rendering now that
every node is meshcore. Update product copy to MeshCore-only. The nodeTypes
query plumbing stays (the unified view's type is always 'meshcore').
Production build passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Vendor the ingest service under ingest/ and move the web app under meshexplorer/.
The ingest builds the meshcoreingest daemon and the goose migration runner,
applies the meshcore ClickHouse schema (packets, adverts, unified node view),
and loads its MQTT broker list and ClickHouse settings entirely from environment
variables (MQTT_BROKERS as a JSON array, CLICKHOUSE_*). No credentials are baked
into the source.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>