## zlay memory leak investigation — status update

### Setup
zlay is a Zig 0.15 AT Protocol relay crawling ~2,700 PDS hosts. Steady-state memory growth of ~240 MiB/hr with both LRU caches (validator 250K, DID 500K) already full. All bounded data structures have been ruled out — the leak is somewhere in the per-resolve allocation path.

---

### Experiments Run

**1. `RESOLVER_RECYCLE_INTERVAL=0` (disable periodic resolver destroy/recreate)**
Steady-state slope ~313 MiB/hr — worse than baseline ~240 MiB/hr. Confirms recycling was partially containing the problem by periodically destroying accumulated state.

**2. `RESOLVER_THREADS=0` (disable DID resolution entirely)**
Malloc flat at ~277 MiB after 49 minutes. The entire firehose pipeline (2,681 PDS connections, CBOR decode, RocksDB writes, broadcasting) runs without memory growth. **This definitively isolates the leak to the resolver path.**

**3. `RESOLVER_KEEP_ALIVE=false` (current, running ~75 min)**
Disables HTTP keepalive on the resolver's `std.http.Client`, so TLS connections are not reused across resolves. Early signal is promising — the initial slope is visibly flatter than any previous run at the same stage. Caches are still filling (DID cache at 47%, validator at 19%), so steady-state slope can't be measured yet.

---

### What We Think Is Happening

The leak is inside Zig's `std.http.Client` connection reuse path. When keepalive is enabled, the client maintains a connection pool (bounded at 32 slots) with TLS sessions. Something in the TLS connection lifecycle — possibly session tickets, certificate chain buffers, or connection metadata — accumulates and is never fully reclaimed even when connections are evicted from the pool.

**Evidence:**
- `THREADS=0` (no HTTP at all) → flat memory
- `RECYCLE=0` (never destroy the client) → faster growth than baseline
- baseline with `RECYCLE=1000` (destroy client every 1000 resolves) → slower growth, but still leaking
- `KEEP_ALIVE=false` (no connection reuse) → early signal looks flat, pending confirmation

---

### What We're Waiting For

Both caches need to fill completely (~2–3 hours from deploy), then 1–2 hours of steady-state to measure the residual slope. If the slope drops to near-zero with keepalive disabled, the fix is confirmed — options are to either ship `keepalive=false` as the production config (at the cost of slightly higher resolve latency from per-request TLS handshakes), or dig into `std.http.Client` to find the actual leak.

**ETA for conclusive data: ~3 hours from now.**