## zlay memory leak investigation — status update ### Setup zlay is a Zig 0.15 AT Protocol relay crawling ~2,700 PDS hosts. Steady-state memory growth of ~240 MiB/hr with both LRU caches (validator 250K, DID 500K) already full. All bounded data structures have been ruled out — the leak is somewhere in the per-resolve allocation path. --- ### Experiments Run **1. `RESOLVER_RECYCLE_INTERVAL=0` (disable periodic resolver destroy/recreate)** Steady-state slope ~313 MiB/hr — worse than baseline ~240 MiB/hr. Confirms recycling was partially containing the problem by periodically destroying accumulated state. **2. `RESOLVER_THREADS=0` (disable DID resolution entirely)** Malloc flat at ~277 MiB after 49 minutes. The entire firehose pipeline (2,681 PDS connections, CBOR decode, RocksDB writes, broadcasting) runs without memory growth. **This definitively isolates the leak to the resolver path.** **3. `RESOLVER_KEEP_ALIVE=false` (current, running ~75 min)** Disables HTTP keepalive on the resolver's `std.http.Client`, so TLS connections are not reused across resolves. Early signal is promising — the initial slope is visibly flatter than any previous run at the same stage. Caches are still filling (DID cache at 47%, validator at 19%), so steady-state slope can't be measured yet. --- ### What We Think Is Happening The leak is inside Zig's `std.http.Client` connection reuse path. When keepalive is enabled, the client maintains a connection pool (bounded at 32 slots) with TLS sessions. Something in the TLS connection lifecycle — possibly session tickets, certificate chain buffers, or connection metadata — accumulates and is never fully reclaimed even when connections are evicted from the pool. **Evidence:** - `THREADS=0` (no HTTP at all) → flat memory - `RECYCLE=0` (never destroy the client) → faster growth than baseline - baseline with `RECYCLE=1000` (destroy client every 1000 resolves) → slower growth, but still leaking - `KEEP_ALIVE=false` (no connection reuse) → early signal looks flat, pending confirmation --- ### What We're Waiting For Both caches need to fill completely (~2–3 hours from deploy), then 1–2 hours of steady-state to measure the residual slope. If the slope drops to near-zero with keepalive disabled, the fix is confirmed — options are to either ship `keepalive=false` as the production config (at the cost of slightly higher resolve latency from per-request TLS handshakes), or dig into `std.http.Client` to find the actual leak. **ETA for conclusive data: ~3 hours from now.**