mirror of
https://github.com/rightup/pyMC_Repeater.git
synced 2026-05-03 20:22:15 +02:00
Add self._tx_lock (asyncio.Lock) to RepeaterHandler and acquire it inside delayed_send after the per-packet sleep completes. Problem 1 — radio interleave: concurrent delayed_send coroutines (one per queued packet) could both exit their sleep at nearly the same moment and call dispatcher.send_packet simultaneously, interleaving SPI/serial register writes to the half-duplex LoRa radio. Problem 2 — TOCTOU gap: the upfront can_transmit() check in __call__ and the record_tx() call in delayed_send are separated by the entire TX delay (up to several seconds). Under burst conditions two tasks both pass the check before either has recorded its airtime, causing both to transmit and the duty-cycle budget to be exceeded. Fix: acquire _tx_lock after the sleep so delay timers still run concurrently (matching firmware behaviour), then immediately re-check can_transmit() inside the lock before sending. Because only one task holds the lock at a time, airtime state is stable; check and record_tx() are effectively atomic — no TOCTOU window. Airtime is recorded only on a successful send, so a radio failure never inflates the budget. Also move `import random` from inside _calculate_tx_delay to module level (stdlib imports belong at the top; the lazy-import pattern is unnecessary here). Docs: docs/pr_tx_serialization.md — problem statement, root-cause analysis, alternative approaches considered, invariant table, full unit + field test plan, and proof of correctness for the asyncio.Lock approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
396 lines
16 KiB
Markdown
396 lines
16 KiB
Markdown
# PR: Serialise Radio TX and Close Duty-Cycle TOCTOU Race
|
||
|
||
**Branch:** `fix/tx-serialization`
|
||
**Base:** `rightup/fix-perfom-speed`
|
||
**Files changed:** `repeater/engine.py` (1 file, ~30 lines net)
|
||
|
||
---
|
||
|
||
## Problem
|
||
|
||
Two separate bugs share the same root cause: concurrent `delayed_send` coroutines
|
||
racing each other at transmission time.
|
||
|
||
### Bug 1 — Interleaved SPI/serial commands to the radio
|
||
|
||
The queue loop (added in an earlier commit) dispatches each incoming packet as an
|
||
`asyncio.create_task`, so multiple `delayed_send` coroutines can have their sleep
|
||
timers running concurrently. That is correct and intentional — it mirrors how
|
||
firmware nodes use a hardware timer so the radio keeps listening during a TX delay.
|
||
|
||
However the LoRa radio is **half-duplex**: it can only transmit one packet at a
|
||
time. When two delay timers expire at nearly the same moment both coroutines call
|
||
`dispatcher.send_packet` simultaneously. `send_packet` issues a sequence of
|
||
SPI/serial register writes to the radio; two tasks interleaving these writes
|
||
produces undefined radio state and the transmission of neither packet is reliable.
|
||
|
||
### Bug 2 — TOCTOU gap in duty-cycle enforcement
|
||
|
||
`__call__` calls `can_transmit()` before scheduling a task:
|
||
|
||
```python
|
||
# __call__ (before this fix)
|
||
can_tx, wait_time = self.airtime_mgr.can_transmit(airtime_ms)
|
||
if not can_tx:
|
||
... # drop or defer
|
||
tx_task = await self.schedule_retransmit(fwd_pkt, delay, airtime_ms, ...)
|
||
```
|
||
|
||
`record_tx()` is only called later, inside `delayed_send`, after the sleep
|
||
completes. Between the check and the debit there is a window that spans the
|
||
entire TX delay (up to several seconds). Two packets that both pass the check
|
||
before either has slept and recorded its airtime will **both** be transmitted even
|
||
if transmitting both would exceed the duty-cycle budget.
|
||
|
||
Under normal single-packet conditions this window is harmless. Under burst
|
||
conditions — multi-hop amplification, collision retries, or a busy mesh segment
|
||
where several packets arrive within the same delay window — multiple tasks pass
|
||
the advisory check simultaneously, and the duty-cycle limit is exceeded.
|
||
|
||
---
|
||
|
||
## Root Cause
|
||
|
||
There is no mutual exclusion around the radio send path. Each `delayed_send`
|
||
coroutine independently checks duty-cycle, sleeps, and transmits without
|
||
coordinating with any other concurrent coroutine doing the same thing.
|
||
|
||
---
|
||
|
||
## Solution
|
||
|
||
Add `self._tx_lock = asyncio.Lock()` (initialised in `__init__`) and acquire it
|
||
inside `delayed_send` **after** the sleep completes:
|
||
|
||
```
|
||
Delay timers run concurrently (unchanged):
|
||
Task A: sleep(1.2s) ──────────────────► acquire _tx_lock → check → TX A → release
|
||
Task B: sleep(0.9s) ──────────────────► acquire _tx_lock (waits) ──────────► check → TX B → release
|
||
Task C: sleep(2.1s) ────────────────────────────────────────────────────────────────► ...
|
||
|
||
Radio: one packet at a time, duty-cycle state always stable inside the lock.
|
||
```
|
||
|
||
Inside the lock, a **second** `can_transmit()` call is made immediately before
|
||
sending. Because only one task holds the lock at a time, airtime state is stable
|
||
at this point and `record_tx()` follows on success — check and debit are
|
||
effectively atomic. This closes the TOCTOU window completely.
|
||
|
||
The upfront `can_transmit()` in `__call__` is retained as an **advisory** fast
|
||
path: it still drops or defers packets that are obviously over budget before a
|
||
delay task is even scheduled, avoiding unnecessary sleep timers. It is no longer
|
||
the enforcement point.
|
||
|
||
---
|
||
|
||
## Why This Is the Right Approach
|
||
|
||
### Alternative A — Move `record_tx()` before the sleep
|
||
|
||
```python
|
||
# hypothetical
|
||
self.airtime_mgr.record_tx(airtime_ms) # reserve before sleeping
|
||
await asyncio.sleep(delay)
|
||
await self.dispatcher.send_packet(...) # actual TX
|
||
```
|
||
|
||
Records airtime even if the send fails (exception, LBT busy, radio error) —
|
||
the budget is debited for a packet that was never transmitted. Over time this
|
||
inflates the apparent airtime, causing the node to throttle legitimate traffic
|
||
it actually has budget for. Requires a compensating `release_airtime()` on
|
||
every failure path, creating new complexity and failure modes.
|
||
|
||
### Alternative B — A single global advisory check (status quo before this PR)
|
||
|
||
Already demonstrated to fail under burst conditions (two tasks both pass before
|
||
either records its airtime).
|
||
|
||
### Alternative C — asyncio.Lock (this PR)
|
||
|
||
- Delay timers remain concurrent — no regression on the primary non-blocking TX
|
||
improvement.
|
||
- The check-and-debit pair is atomic within the lock — no TOCTOU window.
|
||
- No phantom airtime on send failure — `record_tx()` is only called on success.
|
||
- One `asyncio.Lock` object, no new state machines or compensating paths.
|
||
- The lock is `async`, so it only blocks other TX tasks, not the event loop or
|
||
the packet RX queue.
|
||
|
||
### Why `asyncio.Lock` rather than `threading.Lock`
|
||
|
||
The entire repeater runs on a single asyncio event loop. `asyncio.Lock` only
|
||
yields at `await` points; it does not involve OS threads or context switches.
|
||
A `threading.Lock` would work but is semantically wrong here (this is not a
|
||
thread-safety problem) and would block the event loop thread if held across an
|
||
`await`.
|
||
|
||
---
|
||
|
||
## Changes
|
||
|
||
### `repeater/engine.py`
|
||
|
||
**1. Move `import random` to module level**
|
||
|
||
```python
|
||
# before (inside _calculate_tx_delay):
|
||
def _calculate_tx_delay(self, packet, snr=0.0):
|
||
import random
|
||
...
|
||
|
||
# after (top of file, with other stdlib imports):
|
||
import random
|
||
```
|
||
|
||
This is a housekeeping fix bundled with this PR because `random` is a stdlib
|
||
module that should never be imported inside a hot-path function — Python caches
|
||
the import after the first call, but the attribute lookup and cache check still
|
||
run on every call. Moving it to module level is the standard pattern.
|
||
|
||
**2. Add `self._tx_lock` to `__init__`**
|
||
|
||
```python
|
||
# Serialise all radio TX calls.
|
||
#
|
||
# Background: since the queue loop dispatches each packet as an
|
||
# asyncio.create_task, multiple _route_packet coroutines can have their
|
||
# TX delay timers running concurrently — which is the intended behaviour
|
||
# (firmware nodes do the same with a hardware timer). However, the
|
||
# LoRa radio is half-duplex: it can only transmit one packet at a time.
|
||
# Without serialisation, two tasks whose delay timers expire near-
|
||
# simultaneously both call dispatcher.send_packet, interleaving SPI/serial
|
||
# commands to the radio and both passing the LBT check before either has
|
||
# actually transmitted.
|
||
#
|
||
# _tx_lock is acquired after each delay sleep and held for the entire
|
||
# send_packet call. Delays still run concurrently; only the radio
|
||
# access is serialised. This also eliminates the TOCTOU gap in duty-cycle
|
||
# enforcement — see schedule_retransmit / delayed_send for details.
|
||
self._tx_lock = asyncio.Lock()
|
||
```
|
||
|
||
**3. Acquire lock inside `delayed_send`, add authoritative duty-cycle gate**
|
||
|
||
```python
|
||
async def delayed_send():
|
||
await asyncio.sleep(delay)
|
||
|
||
# Acquire the TX lock *after* the delay so that delay timers for
|
||
# multiple packets still run concurrently (matching firmware). Only
|
||
# one coroutine enters the radio send path at a time.
|
||
async with self._tx_lock:
|
||
# ── Authoritative duty-cycle gate ─────────────────────────────
|
||
# The upfront can_transmit() call in __call__ is advisory: it
|
||
# avoids scheduling packets that are obviously over budget, but
|
||
# it cannot prevent a race between two tasks whose delay timers
|
||
# expire at almost the same moment. Both tasks pass the advisory
|
||
# check before either has recorded its airtime, then both try to
|
||
# transmit.
|
||
#
|
||
# Inside _tx_lock only one task runs at a time, so airtime state
|
||
# is stable here. The check and the subsequent record_tx() are
|
||
# effectively atomic — no TOCTOU window.
|
||
if airtime_ms > 0:
|
||
can_tx_now, _ = self.airtime_mgr.can_transmit(airtime_ms)
|
||
if not can_tx_now:
|
||
logger.warning(
|
||
"Packet dropped at TX time: duty-cycle exceeded "
|
||
"(airtime=%.1fms)", airtime_ms,
|
||
)
|
||
return
|
||
|
||
last_error = None
|
||
for attempt in range(2 if local_transmission else 1):
|
||
try:
|
||
await self.dispatcher.send_packet(fwd_pkt, wait_for_ack=False)
|
||
self._record_packet_sent(fwd_pkt)
|
||
if airtime_ms > 0:
|
||
self.airtime_mgr.record_tx(airtime_ms)
|
||
...
|
||
```
|
||
|
||
---
|
||
|
||
## Invariants Maintained
|
||
|
||
| Property | Before | After |
|
||
|----------|--------|-------|
|
||
| Delay timers run concurrently | ✅ | ✅ |
|
||
| Radio accessed by one task at a time | ❌ | ✅ |
|
||
| Duty-cycle check and debit atomic | ❌ | ✅ |
|
||
| Airtime recorded only on TX success | ✅ | ✅ |
|
||
| Event loop not blocked by lock | ✅ | ✅ (asyncio.Lock) |
|
||
|
||
---
|
||
|
||
## Test Plan
|
||
|
||
### Unit tests (can run without hardware)
|
||
|
||
**T1 — Serial TX ordering**
|
||
|
||
```python
|
||
import asyncio
|
||
from unittest.mock import AsyncMock, MagicMock, patch
|
||
|
||
async def test_tx_serialized():
|
||
"""Two tasks whose delays expire simultaneously must not interleave."""
|
||
send_order = []
|
||
send_lock = asyncio.Lock()
|
||
|
||
async def mock_send(pkt, **kw):
|
||
# Confirm the _tx_lock is already held when we enter send_packet
|
||
assert send_lock.locked(), "send_packet called without _tx_lock held"
|
||
send_order.append(pkt)
|
||
await asyncio.sleep(0) # yield; a second task must not enter here
|
||
|
||
engine._tx_lock = send_lock # replace with the mock lock reference
|
||
engine.dispatcher.send_packet = mock_send
|
||
|
||
t1 = asyncio.create_task(engine.schedule_retransmit(pkt_a, delay=0.01, airtime_ms=100))
|
||
t2 = asyncio.create_task(engine.schedule_retransmit(pkt_b, delay=0.01, airtime_ms=100))
|
||
await asyncio.gather(t1, t2)
|
||
|
||
assert len(send_order) == 2 # both transmitted
|
||
assert send_order[0] is not send_order[1] # different packets
|
||
```
|
||
|
||
**T2 — Authoritative duty-cycle gate blocks over-budget second packet**
|
||
|
||
```python
|
||
async def test_second_packet_dropped_when_over_budget():
|
||
"""When first TX fills the budget, second task must be dropped inside the lock."""
|
||
# Set a tiny budget: 50ms per minute
|
||
engine.airtime_mgr.max_airtime_per_minute = 50
|
||
|
||
sent = []
|
||
async def mock_send(pkt, **kw):
|
||
sent.append(pkt)
|
||
|
||
engine.dispatcher.send_packet = mock_send
|
||
|
||
# Each packet costs ~111ms (SF8, BW125, 30-byte payload) — first passes, second must not
|
||
t1 = asyncio.create_task(engine.schedule_retransmit(pkt_a, delay=0.01, airtime_ms=111))
|
||
t2 = asyncio.create_task(engine.schedule_retransmit(pkt_b, delay=0.01, airtime_ms=111))
|
||
await asyncio.gather(t1, t2)
|
||
|
||
assert len(sent) == 1, f"Expected 1 TX, got {len(sent)}"
|
||
```
|
||
|
||
**T3 — Airtime not debited on TX failure**
|
||
|
||
```python
|
||
async def test_airtime_not_recorded_on_send_failure():
|
||
before = engine.airtime_mgr.total_airtime_ms
|
||
|
||
async def failing_send(pkt, **kw):
|
||
raise RuntimeError("radio error")
|
||
|
||
engine.dispatcher.send_packet = failing_send
|
||
|
||
with pytest.raises(RuntimeError):
|
||
await engine.schedule_retransmit(pkt, delay=0, airtime_ms=100)
|
||
|
||
assert engine.airtime_mgr.total_airtime_ms == before, \
|
||
"Airtime must not be recorded when send raises"
|
||
```
|
||
|
||
**T4 — Advisory check still drops before scheduling (fast path not regressed)**
|
||
|
||
```python
|
||
async def test_advisory_check_still_drops_obvious_overage():
|
||
"""__call__ should not even schedule a task when clearly over budget."""
|
||
engine.airtime_mgr.max_airtime_per_minute = 0 # budget exhausted
|
||
|
||
tasks_created = []
|
||
original = asyncio.create_task
|
||
asyncio.create_task = lambda coro: tasks_created.append(coro) or original(coro)
|
||
|
||
await engine(over_budget_packet, metadata={})
|
||
|
||
assert not tasks_created, "No task should be created when advisory check fails"
|
||
```
|
||
|
||
### Integration / field tests (with hardware)
|
||
|
||
**T5 — Burst scenario: 5 packets arrive within the same delay window**
|
||
|
||
1. Connect the repeater to a radio.
|
||
2. Using a second node, send 5 FLOOD packets in quick succession (< 100 ms apart)
|
||
with a low RSSI score so the repeater's delay is ~1–2 s for all of them.
|
||
3. Monitor the radio with a spectrum analyser or a third node running in monitor
|
||
mode.
|
||
|
||
**Expected (after this fix):**
|
||
- Transmissions are sequential — no overlapping on-air signals.
|
||
- `Retransmitted packet` log lines appear one after another, each with a non-zero
|
||
airtime value.
|
||
- No `Retransmit failed` errors in the log.
|
||
- Duty-cycle log shows airtime accumulating correctly.
|
||
|
||
**Expected (before this fix, to confirm the bug existed):**
|
||
- Occasional `Retransmit failed` errors under burst load.
|
||
- Airtime tracking diverging from actual on-air time (double-counted or missed).
|
||
|
||
**T6 — Duty-cycle enforcement under burst**
|
||
|
||
1. Set `max_airtime_per_minute` to a low value (e.g. 500 ms) in config.
|
||
2. Send 10 packets rapidly so the repeater tries to forward all 10.
|
||
3. Observe logs.
|
||
|
||
**Expected:**
|
||
- First N packets transmitted (total airtime ≤ 500 ms).
|
||
- Subsequent packets log `"Packet dropped at TX time: duty-cycle exceeded"` from
|
||
inside `delayed_send` (not just the advisory drop).
|
||
- `airtime_mgr.get_stats()["utilization_percent"]` reads ≤ 100%.
|
||
|
||
**T7 — Normal single-packet forwarding not regressed**
|
||
|
||
1. Send one packet every 5 seconds (well within duty-cycle budget).
|
||
2. Verify each packet is forwarded with correct airtime logged.
|
||
3. Verify no lock contention warnings in the log.
|
||
|
||
**T8 — Local TX retry path (local_transmission=True) still works**
|
||
|
||
1. Send a command that triggers a local transmission (e.g. a ping reply).
|
||
2. Briefly block the radio (simulate with a mock) so the first attempt fails.
|
||
3. Verify the retry fires after 1 s and the packet is eventually transmitted.
|
||
|
||
---
|
||
|
||
## Proof of Correctness
|
||
|
||
### Why `asyncio.Lock` is sufficient (no OS-level synchronisation needed)
|
||
|
||
Python's asyncio event loop is **single-threaded**. All coroutines share one
|
||
thread and only yield execution at `await` points. Between two consecutive
|
||
`await` calls in a coroutine, the event loop does not switch to another coroutine.
|
||
|
||
`asyncio.Lock.acquire()` suspends the current coroutine if the lock is held,
|
||
returning control to the event loop. `asyncio.Lock.release()` wakes the next
|
||
waiter. Because `send_packet` is awaited inside the lock, no other TX task can
|
||
run until the current one releases the lock and the event loop gets a chance to
|
||
schedule the next waiter.
|
||
|
||
There is no possibility of the race seen with `threading.Lock` where an OS thread
|
||
can be preempted mid-instruction.
|
||
|
||
### Why the advisory check in `__call__` cannot be removed
|
||
|
||
The advisory check is still necessary as a fast path. If it were removed, every
|
||
incoming packet — even when the node is clearly at 100% duty-cycle — would
|
||
schedule a `delayed_send` task that would sleep for the full TX delay (up to
|
||
several seconds) before the lock drops it. Under a sustained flood of incoming
|
||
packets this wastes memory and CPU. The advisory check prunes the queue early at
|
||
negligible cost.
|
||
|
||
### Why `record_tx()` must be inside the lock (not before or after)
|
||
|
||
- **Before the send:** records airtime for a packet that may never be transmitted
|
||
(send could fail, LBT could reject it). Budget is overcounted.
|
||
- **After releasing the lock:** a second task could pass the authoritative
|
||
`can_transmit()` check between `send_packet` returning and `record_tx()` being
|
||
called — the TOCTOU window reopens at a smaller scale.
|
||
- **Inside the lock, after a successful send:** the budget is debited exactly once
|
||
for exactly the packets that were actually transmitted. The lock ensures no
|
||
other task reads airtime state between the check and the debit.
|