# PR: Serialise Radio TX and Close Duty-Cycle TOCTOU Race **Branch:** `fix/tx-serialization` **Base:** `rightup/fix-perfom-speed` **Files changed:** `repeater/engine.py` (1 file, ~30 lines net) --- ## Problem Two separate bugs share the same root cause: concurrent `delayed_send` coroutines racing each other at transmission time. ### Bug 1 — Interleaved SPI/serial commands to the radio The queue loop (added in an earlier commit) dispatches each incoming packet as an `asyncio.create_task`, so multiple `delayed_send` coroutines can have their sleep timers running concurrently. That is correct and intentional — it mirrors how firmware nodes use a hardware timer so the radio keeps listening during a TX delay. However the LoRa radio is **half-duplex**: it can only transmit one packet at a time. When two delay timers expire at nearly the same moment both coroutines call `dispatcher.send_packet` simultaneously. `send_packet` issues a sequence of SPI/serial register writes to the radio; two tasks interleaving these writes produces undefined radio state and the transmission of neither packet is reliable. ### Bug 2 — TOCTOU gap in duty-cycle enforcement `__call__` calls `can_transmit()` before scheduling a task: ```python # __call__ (before this fix) can_tx, wait_time = self.airtime_mgr.can_transmit(airtime_ms) if not can_tx: ... # drop or defer tx_task = await self.schedule_retransmit(fwd_pkt, delay, airtime_ms, ...) ``` `record_tx()` is only called later, inside `delayed_send`, after the sleep completes. Between the check and the debit there is a window that spans the entire TX delay (up to several seconds). Two packets that both pass the check before either has slept and recorded its airtime will **both** be transmitted even if transmitting both would exceed the duty-cycle budget. Under normal single-packet conditions this window is harmless. Under burst conditions — multi-hop amplification, collision retries, or a busy mesh segment where several packets arrive within the same delay window — multiple tasks pass the advisory check simultaneously, and the duty-cycle limit is exceeded. --- ## Root Cause There is no mutual exclusion around the radio send path. Each `delayed_send` coroutine independently checks duty-cycle, sleeps, and transmits without coordinating with any other concurrent coroutine doing the same thing. --- ## Solution Add `self._tx_lock = asyncio.Lock()` (initialised in `__init__`) and acquire it inside `delayed_send` **after** the sleep completes: ``` Delay timers run concurrently (unchanged): Task A: sleep(1.2s) ──────────────────► acquire _tx_lock → check → TX A → release Task B: sleep(0.9s) ──────────────────► acquire _tx_lock (waits) ──────────► check → TX B → release Task C: sleep(2.1s) ────────────────────────────────────────────────────────────────► ... Radio: one packet at a time, duty-cycle state always stable inside the lock. ``` Inside the lock, a **second** `can_transmit()` call is made immediately before sending. Because only one task holds the lock at a time, airtime state is stable at this point and `record_tx()` follows on success — check and debit are effectively atomic. This closes the TOCTOU window completely. The upfront `can_transmit()` in `__call__` is retained as an **advisory** fast path: it still drops or defers packets that are obviously over budget before a delay task is even scheduled, avoiding unnecessary sleep timers. It is no longer the enforcement point. --- ## Why This Is the Right Approach ### Alternative A — Move `record_tx()` before the sleep ```python # hypothetical self.airtime_mgr.record_tx(airtime_ms) # reserve before sleeping await asyncio.sleep(delay) await self.dispatcher.send_packet(...) # actual TX ``` Records airtime even if the send fails (exception, LBT busy, radio error) — the budget is debited for a packet that was never transmitted. Over time this inflates the apparent airtime, causing the node to throttle legitimate traffic it actually has budget for. Requires a compensating `release_airtime()` on every failure path, creating new complexity and failure modes. ### Alternative B — A single global advisory check (status quo before this PR) Already demonstrated to fail under burst conditions (two tasks both pass before either records its airtime). ### Alternative C — asyncio.Lock (this PR) - Delay timers remain concurrent — no regression on the primary non-blocking TX improvement. - The check-and-debit pair is atomic within the lock — no TOCTOU window. - No phantom airtime on send failure — `record_tx()` is only called on success. - One `asyncio.Lock` object, no new state machines or compensating paths. - The lock is `async`, so it only blocks other TX tasks, not the event loop or the packet RX queue. ### Why `asyncio.Lock` rather than `threading.Lock` The entire repeater runs on a single asyncio event loop. `asyncio.Lock` only yields at `await` points; it does not involve OS threads or context switches. A `threading.Lock` would work but is semantically wrong here (this is not a thread-safety problem) and would block the event loop thread if held across an `await`. --- ## Changes ### `repeater/engine.py` **1. Move `import random` to module level** ```python # before (inside _calculate_tx_delay): def _calculate_tx_delay(self, packet, snr=0.0): import random ... # after (top of file, with other stdlib imports): import random ``` This is a housekeeping fix bundled with this PR because `random` is a stdlib module that should never be imported inside a hot-path function — Python caches the import after the first call, but the attribute lookup and cache check still run on every call. Moving it to module level is the standard pattern. **2. Add `self._tx_lock` to `__init__`** ```python # Serialise all radio TX calls. # # Background: since the queue loop dispatches each packet as an # asyncio.create_task, multiple _route_packet coroutines can have their # TX delay timers running concurrently — which is the intended behaviour # (firmware nodes do the same with a hardware timer). However, the # LoRa radio is half-duplex: it can only transmit one packet at a time. # Without serialisation, two tasks whose delay timers expire near- # simultaneously both call dispatcher.send_packet, interleaving SPI/serial # commands to the radio and both passing the LBT check before either has # actually transmitted. # # _tx_lock is acquired after each delay sleep and held for the entire # send_packet call. Delays still run concurrently; only the radio # access is serialised. This also eliminates the TOCTOU gap in duty-cycle # enforcement — see schedule_retransmit / delayed_send for details. self._tx_lock = asyncio.Lock() ``` **3. Acquire lock inside `delayed_send`, add authoritative duty-cycle gate** ```python async def delayed_send(): await asyncio.sleep(delay) # Acquire the TX lock *after* the delay so that delay timers for # multiple packets still run concurrently (matching firmware). Only # one coroutine enters the radio send path at a time. async with self._tx_lock: # ── Authoritative duty-cycle gate ───────────────────────────── # The upfront can_transmit() call in __call__ is advisory: it # avoids scheduling packets that are obviously over budget, but # it cannot prevent a race between two tasks whose delay timers # expire at almost the same moment. Both tasks pass the advisory # check before either has recorded its airtime, then both try to # transmit. # # Inside _tx_lock only one task runs at a time, so airtime state # is stable here. The check and the subsequent record_tx() are # effectively atomic — no TOCTOU window. if airtime_ms > 0: can_tx_now, _ = self.airtime_mgr.can_transmit(airtime_ms) if not can_tx_now: logger.warning( "Packet dropped at TX time: duty-cycle exceeded " "(airtime=%.1fms)", airtime_ms, ) return last_error = None for attempt in range(2 if local_transmission else 1): try: await self.dispatcher.send_packet(fwd_pkt, wait_for_ack=False) self._record_packet_sent(fwd_pkt) if airtime_ms > 0: self.airtime_mgr.record_tx(airtime_ms) ... ``` --- ## Invariants Maintained | Property | Before | After | |----------|--------|-------| | Delay timers run concurrently | ✅ | ✅ | | Radio accessed by one task at a time | ❌ | ✅ | | Duty-cycle check and debit atomic | ❌ | ✅ | | Airtime recorded only on TX success | ✅ | ✅ | | Event loop not blocked by lock | ✅ | ✅ (asyncio.Lock) | --- ## Test Plan ### Unit tests (can run without hardware) **T1 — Serial TX ordering** ```python import asyncio from unittest.mock import AsyncMock, MagicMock, patch async def test_tx_serialized(): """Two tasks whose delays expire simultaneously must not interleave.""" send_order = [] send_lock = asyncio.Lock() async def mock_send(pkt, **kw): # Confirm the _tx_lock is already held when we enter send_packet assert send_lock.locked(), "send_packet called without _tx_lock held" send_order.append(pkt) await asyncio.sleep(0) # yield; a second task must not enter here engine._tx_lock = send_lock # replace with the mock lock reference engine.dispatcher.send_packet = mock_send t1 = asyncio.create_task(engine.schedule_retransmit(pkt_a, delay=0.01, airtime_ms=100)) t2 = asyncio.create_task(engine.schedule_retransmit(pkt_b, delay=0.01, airtime_ms=100)) await asyncio.gather(t1, t2) assert len(send_order) == 2 # both transmitted assert send_order[0] is not send_order[1] # different packets ``` **T2 — Authoritative duty-cycle gate blocks over-budget second packet** ```python async def test_second_packet_dropped_when_over_budget(): """When first TX fills the budget, second task must be dropped inside the lock.""" # Set a tiny budget: 50ms per minute engine.airtime_mgr.max_airtime_per_minute = 50 sent = [] async def mock_send(pkt, **kw): sent.append(pkt) engine.dispatcher.send_packet = mock_send # Each packet costs ~111ms (SF8, BW125, 30-byte payload) — first passes, second must not t1 = asyncio.create_task(engine.schedule_retransmit(pkt_a, delay=0.01, airtime_ms=111)) t2 = asyncio.create_task(engine.schedule_retransmit(pkt_b, delay=0.01, airtime_ms=111)) await asyncio.gather(t1, t2) assert len(sent) == 1, f"Expected 1 TX, got {len(sent)}" ``` **T3 — Airtime not debited on TX failure** ```python async def test_airtime_not_recorded_on_send_failure(): before = engine.airtime_mgr.total_airtime_ms async def failing_send(pkt, **kw): raise RuntimeError("radio error") engine.dispatcher.send_packet = failing_send with pytest.raises(RuntimeError): await engine.schedule_retransmit(pkt, delay=0, airtime_ms=100) assert engine.airtime_mgr.total_airtime_ms == before, \ "Airtime must not be recorded when send raises" ``` **T4 — Advisory check still drops before scheduling (fast path not regressed)** ```python async def test_advisory_check_still_drops_obvious_overage(): """__call__ should not even schedule a task when clearly over budget.""" engine.airtime_mgr.max_airtime_per_minute = 0 # budget exhausted tasks_created = [] original = asyncio.create_task asyncio.create_task = lambda coro: tasks_created.append(coro) or original(coro) await engine(over_budget_packet, metadata={}) assert not tasks_created, "No task should be created when advisory check fails" ``` ### Integration / field tests (with hardware) **T5 — Burst scenario: 5 packets arrive within the same delay window** 1. Connect the repeater to a radio. 2. Using a second node, send 5 FLOOD packets in quick succession (< 100 ms apart) with a low RSSI score so the repeater's delay is ~1–2 s for all of them. 3. Monitor the radio with a spectrum analyser or a third node running in monitor mode. **Expected (after this fix):** - Transmissions are sequential — no overlapping on-air signals. - `Retransmitted packet` log lines appear one after another, each with a non-zero airtime value. - No `Retransmit failed` errors in the log. - Duty-cycle log shows airtime accumulating correctly. **Expected (before this fix, to confirm the bug existed):** - Occasional `Retransmit failed` errors under burst load. - Airtime tracking diverging from actual on-air time (double-counted or missed). **T6 — Duty-cycle enforcement under burst** 1. Set `max_airtime_per_minute` to a low value (e.g. 500 ms) in config. 2. Send 10 packets rapidly so the repeater tries to forward all 10. 3. Observe logs. **Expected:** - First N packets transmitted (total airtime ≤ 500 ms). - Subsequent packets log `"Packet dropped at TX time: duty-cycle exceeded"` from inside `delayed_send` (not just the advisory drop). - `airtime_mgr.get_stats()["utilization_percent"]` reads ≤ 100%. **T7 — Normal single-packet forwarding not regressed** 1. Send one packet every 5 seconds (well within duty-cycle budget). 2. Verify each packet is forwarded with correct airtime logged. 3. Verify no lock contention warnings in the log. **T8 — Local TX retry path (local_transmission=True) still works** 1. Send a command that triggers a local transmission (e.g. a ping reply). 2. Briefly block the radio (simulate with a mock) so the first attempt fails. 3. Verify the retry fires after 1 s and the packet is eventually transmitted. --- ## Proof of Correctness ### Why `asyncio.Lock` is sufficient (no OS-level synchronisation needed) Python's asyncio event loop is **single-threaded**. All coroutines share one thread and only yield execution at `await` points. Between two consecutive `await` calls in a coroutine, the event loop does not switch to another coroutine. `asyncio.Lock.acquire()` suspends the current coroutine if the lock is held, returning control to the event loop. `asyncio.Lock.release()` wakes the next waiter. Because `send_packet` is awaited inside the lock, no other TX task can run until the current one releases the lock and the event loop gets a chance to schedule the next waiter. There is no possibility of the race seen with `threading.Lock` where an OS thread can be preempted mid-instruction. ### Why the advisory check in `__call__` cannot be removed The advisory check is still necessary as a fast path. If it were removed, every incoming packet — even when the node is clearly at 100% duty-cycle — would schedule a `delayed_send` task that would sleep for the full TX delay (up to several seconds) before the lock drops it. Under a sustained flood of incoming packets this wastes memory and CPU. The advisory check prunes the queue early at negligible cost. ### Why `record_tx()` must be inside the lock (not before or after) - **Before the send:** records airtime for a packet that may never be transmitted (send could fail, LBT could reject it). Budget is overcounted. - **After releasing the lock:** a second task could pass the authoritative `can_transmit()` check between `send_packet` returning and `record_tx()` being called — the TOCTOU window reopens at a smaller scale. - **Inside the lock, after a successful send:** the budget is debited exactly once for exactly the packets that were actually transmitted. The lock ensures no other task reads airtime state between the check and the debit.