# PR: Bounded In-Flight Task Counter + Simplified Route Task Management

**Branch:** `perf/in-flight-cap`  
**Base:** `rightup/fix-perfom-speed`  
**Files changed:** `repeater/packet_router.py` (1 file, ~33 lines net)

---

## Background

The queue loop dispatches each incoming packet as an `asyncio.create_task` so TX
delay timers run concurrently — this is correct behaviour.  The previous
implementation tracked these tasks in a `set[asyncio.Task]` (`_route_tasks`) for
two reasons:

1. **Error surfacing** — the done-callback read `task.result()` to log exceptions.
2. **Shutdown cancellation** — `stop()` cancelled and awaited all tasks in the set.

This PR replaces the set with a simple integer counter and tightens the companion
deduplication prune threshold.

---

## Problems

### Problem 1 — Unbounded task accumulation

LoRa airtime naturally limits steady-state throughput to a handful of in-flight
tasks at any time.  But burst arrivals can spike the count temporarily:

- **Multi-hop flood amplification**: a single source packet is forwarded by every
  repeater in range, each of which re-broadcasts it. A node at a mesh junction
  may receive 5–10 copies within 100 ms, each scheduling a separate `delayed_send`
  task.
- **Collision retries**: hardware-level collisions produce duplicate RF bursts that
  all arrive within the same RX window.
- **Bridge nodes**: high-traffic gateway nodes connect multiple mesh segments and
  forward both directions simultaneously.

Under these conditions `_route_tasks` can accumulate dozens of sleeping tasks.
Each holds a reference to the packet, the forwarded packet copy, a closure over
`delayed_send`, and associated asyncio task overhead.  There is no cap; the set
grows until the duty-cycle gate finally fires for each task.

### Problem 2 — `_route_tasks` set adds O(1) cost on every packet but O(n) cost on shutdown

Every packet adds one entry to `_route_tasks` and removes it in the done-callback.
This is O(1) per operation, but the `stop()` shutdown path iterates the entire set
to cancel and gather all tasks — O(n) where n is however many tasks happen to be
in-flight at shutdown time.  On a busy node this could delay clean shutdown.

### Problem 3 — `_COMPANION_DEDUPE_PRUNE_THRESHOLD = 1000` is too high

The companion delivery deduplication dict prunes itself only when it exceeds 1000
entries.  With a 60-second TTL, each PATH/protocol-response packet adds one entry.
On a busy mesh with 50+ nodes sending adverts and PATH packets, the dict can grow
to hundreds of entries before a prune is triggered — keeping stale entries in
memory for up to 60 seconds × 1000/rate entries worth of time.

---

## Solution

### Replace `_route_tasks` set with `_in_flight` counter

An integer counter provides the same protection (tasks complete; done-callback
fires) without holding strong references to each task object:

```python
# __init__
self._in_flight: int = 0
self._max_in_flight: int = 30

# _process_queue — drop early if cap reached
if self._in_flight >= self._max_in_flight:
    logger.warning("In-flight task cap reached (%d/%d), dropping packet", ...)
    continue
self._in_flight += 1
task = asyncio.create_task(self._route_packet(packet))
task.add_done_callback(self._on_route_done)

# done-callback
def _on_route_done(self, task):
    self._in_flight -= 1
    if not task.cancelled() and task.exception():
        logger.error("_route_packet raised: %s", task.exception(), ...)
```

### Cap at 30 concurrent in-flight tasks

30 is chosen as a ceiling that is:
- **Never reached in normal operation**: LoRa airtime at SF8/125 kHz limits
  throughput to ~2–3 packets per second; with delays of 0.5–5 s each, the
  steady-state in-flight count is at most 5–15 tasks.
- **High enough not to drop legitimate traffic**: a burst of 30 nearly-simultaneous
  packets would require every node in a large mesh to transmit within 1 second.
- **Low enough to protect against pathological scenarios**: a misconfigured node
  flooding the channel or a software bug causing infinite re-queuing.

### Tighten companion dedup prune threshold to 200

200 entries at 60 s TTL means a sweep is triggered after ~200 unique PATH/response
packets arrive without any expiry.  This is far more than a typical companion
session (which sees a handful of active connections) but prevents multi-hour
accumulation on a busy mesh.

---

## Trade-off: Shutdown Cancellation

The previous `_route_tasks` set allowed `stop()` to explicitly cancel and await
all in-flight tasks on shutdown.  The counter approach does not.

**Why this is acceptable:**

1. In-flight `_route_packet` tasks are sleeping inside `delayed_send` (waiting for
   their TX delay timer).  When the event loop is shut down — whether via
   `asyncio.run()` completing, `loop.stop()`, or `SIGTERM` handling — Python
   cancels all pending tasks automatically.

2. Even under the old approach, cancelling a sleeping `delayed_send` means the
   packet is not transmitted.  The result is the same whether cancellation happens
   explicitly in `stop()` or implicitly when the event loop closes.

3. For a graceful shutdown where we want to *wait* for in-flight packets to
   complete transmission, the right mechanism is `stop()` awaiting the queue to
   drain *before* cancelling the router task — not cancelling sleeping tasks.
   Neither the old code nor this PR implements that, so no regression.

---

## Why This Is the Right Approach

### Alternative A — Keep `_route_tasks` set, add a size cap

```python
if len(self._route_tasks) >= 30:
    logger.warning(...)
    continue
```

Works, but the set still holds a strong reference to every Task object for the
duration of its sleep.  The counter holds an integer.  Task objects in Python 3.12+
are already strongly referenced by the event loop scheduler; the set reference is
redundant for preventing GC cancellation.

### Alternative B — `asyncio.Semaphore`

```python
self._sem = asyncio.Semaphore(30)
async with self._sem:
    await self._route_packet(packet)
```

Correct but changes the queue loop from fire-and-forget to blocking: the loop
would wait at `async with self._sem` for a slot to open, stalling packet reads
while a slot is occupied.  That reintroduces the queue freeze the concurrent
dispatch was designed to prevent.  A semaphore is the right tool for *rate-
limiting* producers; a counter cap at the dispatch site is the right tool for
bounding *background* tasks.

### Alternative C — Integer counter (this PR)

- O(1) increment and decrement.
- No strong reference to task objects beyond the event loop's own reference.
- Drop decision is synchronous and immediate — no sleeping on semaphore.
- Error logging preserved in `_on_route_done`.
- Simpler code, easier to reason about.

---

## Changes — `repeater/packet_router.py` only

| Location | Change | Reason |
|----------|--------|--------|
| Module level | Remove `_COMPANION_DEDUPE_PRUNE_THRESHOLD = 1000` | Replaced with inline literal `200`; no need for a named constant for a single usage site |
| `__init__` | Remove `self._route_tasks = set()`; add `self._in_flight = 0`, `self._max_in_flight = 30` | Replace set-based tracking with counter |
| `stop()` | Remove `_route_tasks` cancellation block | Tasks complete or are cancelled by event loop shutdown; explicit cancellation not needed |
| `_on_route_task_done` → `_on_route_done` | Simpler done-callback: decrement counter + log exceptions | Error logging preserved; set management removed |
| `_should_deliver_path_to_companions` | `> _COMPANION_DEDUPE_PRUNE_THRESHOLD` → `> 200` with explanatory comment | Lower threshold; comment explains the sizing rationale |
| `_process_queue` | Check `_in_flight >= _max_in_flight` before `create_task`; increment `_in_flight`; use `_on_route_done` | Cap accumulation; counter tracks live task count |

---

## Test Plan

### Unit tests (no hardware)

**T1 — Counter increments and decrements correctly**

```python
async def test_in_flight_counter():
    router = PacketRouter(mock_daemon)
    await router.start()

    assert router._in_flight == 0

    # Enqueue a packet that takes time to process
    async def slow_route(pkt):
        await asyncio.sleep(0.1)

    router._route_packet = slow_route
    await router.enqueue(make_test_packet())
    await asyncio.sleep(0.01)  # let queue loop run

    assert router._in_flight == 1   # task is sleeping

    await asyncio.sleep(0.15)       # task finishes
    assert router._in_flight == 0   # counter decremented by done-callback
```

**T2 — Cap enforced: packet dropped when at limit**

```python
async def test_cap_drops_packet_at_limit():
    router = PacketRouter(mock_daemon)
    router._max_in_flight = 2
    router._in_flight = 2   # simulate cap reached

    dropped = []
    original_create_task = asyncio.create_task
    asyncio.create_task = lambda coro: dropped.append(coro)

    await router._process_queue_once(make_test_packet())

    assert dropped == [], "create_task must not be called when cap is reached"
    asyncio.create_task = original_create_task
```

**T3 — Exceptions in `_route_packet` are logged, not swallowed**

```python
async def test_exception_logged():
    router = PacketRouter(mock_daemon)

    async def failing_route(pkt):
        raise ValueError("simulated error")

    router._route_packet = failing_route
    with patch("repeater.packet_router.logger") as mock_log:
        task = asyncio.create_task(failing_route(make_test_packet()))
        router._in_flight = 1
        task.add_done_callback(router._on_route_done)
        await asyncio.gather(task, return_exceptions=True)
        mock_log.error.assert_called_once()

    assert router._in_flight == 0
```

**T4 — Companion dedup dict pruned at 200, not 1000**

```python
def test_companion_dedup_prune_threshold():
    router = PacketRouter(mock_daemon)
    future_time = time.time() + 999

    # Fill with 199 entries (all unexpired) — no prune
    router._companion_delivered = {f"key{i}": future_time for i in range(199)}
    pkt = make_path_packet()
    router._should_deliver_path_to_companions(pkt)
    assert len(router._companion_delivered) == 200   # added one, no prune yet

    # 201st entry triggers prune — all unexpired so count stays at 201
    router._companion_delivered[f"key_extra"] = future_time
    assert len(router._companion_delivered) == 201

    # Force prune by making all existing entries expired
    past_time = time.time() - 1
    router._companion_delivered = {f"key{i}": past_time for i in range(201)}
    router._should_deliver_path_to_companions(pkt)
    # All expired entries pruned; only the new entry remains
    assert len(router._companion_delivered) == 1
```

### Integration / field tests (with hardware)

**T5 — Burst flood: verify cap fires under pathological load**

1. Configure a test mesh with 4+ nodes all in range of the repeater.
2. Have all nodes send a flood packet simultaneously.
3. Observe repeater logs.

**Expected:** `_in_flight` peaks in low single digits (LoRa airtime prevents
large bursts); no `"In-flight task cap reached"` warning fires under normal
conditions, confirming the cap is never a bottleneck in practice.

**T6 — Counter reaches zero after all packets processed**

1. Send a burst of 10 packets.
2. Wait 10 seconds (longer than max TX delay of 5 s).
3. Query `router._in_flight` from a debug endpoint or log.

**Expected:** `_in_flight == 0` after all delays expire and packets transmit.

**T7 — Error in `_route_packet` is logged and counter is decremented**

1. Temporarily introduce a deliberate exception in `_route_packet`.
2. Send a packet.
3. Check logs for the error message and verify the repeater continues operating
   (counter decremented, queue still draining).

**T8 — Normal forwarding throughput unchanged**

1. Send packets at a steady rate of 1 every 10 seconds for 5 minutes.
2. Verify all packets are forwarded with no warnings or errors.
3. Confirm `_in_flight` never exceeds 3–4 during normal operation.

---

## Proof of Correctness

### Counter vs set: why the counter is sufficient

The `_route_tasks` set solved two problems:

1. **GC protection**: In Python < 3.12, a task with no strong references other
   than the event loop's internal weakref could be garbage collected before
   completing.  Python 3.12+ strengthened task references in the event loop.
   However, even in earlier versions, the set was unnecessary once `create_task`
   returns — the caller holds the reference, and the done-callback fires reliably
   because the event loop holds the task alive until completion.

2. **Explicit shutdown cancellation**: The counter loses this.  As argued above,
   the outcome is identical — sleeping tasks are cancelled either explicitly by
   `stop()` or implicitly by the event loop at shutdown — and no packet that
   hasn't been transmitted yet can complete its send after the radio is shut down
   anyway.

### Why `_on_route_done` is a done-callback and not a `try/finally` inside `_route_packet`

A `try/finally` block inside `_route_packet` would also decrement the counter.
Done-callbacks are preferable because:

- They fire even if the task is externally cancelled (e.g. by event loop shutdown),
  whereas `finally` may not run if `CancelledError` is not caught.
- They decouple counter management from `_route_packet` logic — `_route_packet`
  has no knowledge of or dependency on the cap mechanism.
- They keep the pattern consistent with the rest of the codebase's use of
  `add_done_callback` for task lifecycle management.

### Why 30 and not a smaller number like 10

At SF8, 125 kHz bandwidth, a 30-byte payload takes ~111 ms airtime and produces
a TX delay of roughly 0.5–3 s.  With a 60-second duty-cycle window and 3.6 s
max airtime, the node can forward at most ~32 packets per minute at full budget.
If all 32 arrive within one second (they cannot physically, but as an upper
bound), 32 tasks would be in-flight simultaneously.  A cap of 30 is aggressive
enough to protect against unbounded growth but not so low that it would drop
legitimate traffic under any realistic burst scenario.