atproto utils for zig zat.dev
atproto sdk zig

possible std.http.Client allocator leak in long-lived DidResolver/HandleResolver #6

closed opened by zat.dev

the problem#

zlay (zig atproto relay) uses a single DidResolver instance for the lifetime of the process, calling resolve() at sustained ~700+ req/sec. over time, malloc_in_use grows linearly at ~119 KB/sec (~411 MiB/hr), eventually OOMing.

the suspected leak is inside glibc malloc-managed memory — we've confirmed via prometheus attribution metrics that all application-level data structures (caches, buffers, thread stacks, RocksDB) are stable. the growth is entirely in malloc_in_use from c_allocator.

where we think it leaks#

transport.zig:76self.http_client.fetch() accumulates allocations inside std.http.Client's internal connection pool. the allocating writer at line 42 is properly deferred, and the duped body at line 89 is freed by the caller. but std.http.Client itself retains per-connection TLS state and connection pool entries that grow over time when making requests to many distinct hosts.

the test at did_resolver.zig:102 acknowledges this:

  // use arena for http client internals that may leak
  var arena = std.heap.ArenaAllocator.init(std.testing.allocator);

in tests this is fine — the arena catches everything. but in production, DidResolver.init(allocator) passes c_allocator straight through to std.http.Client, so the leaked internal state accumulates forever.

reproduction#

const allocator = std.heap.c_allocator; var resolver = zat.DidResolver.init(allocator); defer resolver.deinit();

// resolve ~10K distinct DIDs (simulating relay traffic) for (dids) |did| { var doc = resolver.resolve(did) catch continue; doc.deinit(); } // check: allocator has retained memory from http_client internals // (visible via mallinfo() or process RSS growth)

at scale (700 req/sec × hours), this grows to GiBs. the leak rate is proportional to request volume and number of distinct target hosts (plc.directory is one host, but did:web resolves to thousands).

suggested fix options#

  1. periodic client recycling: add a request counter to HttpTransport, and after N requests (e.g., 10K), call self.http_client.deinit() and reinitialize. this caps the leaked state at a bounded amount.
  2. arena-wrap the http client (like the test does): use an ArenaAllocator for the std.http.Client's allocator, periodically reset it.
  3. expose a reset() or clearConnections() method on HttpTransport so callers can manage the lifecycle.

option 1 is the simplest and most effective. the connection pool doesn't benefit much from long-term persistence since TLS sessions to plc.directory will be re-established quickly.

environment#

  • zig 0.15.0, zat v0.2.10
  • linux/amd64, glibc, MALLOC_ARENA_MAX=2
  • ~2700 concurrent PDS connections, ~700 frames/sec sustained
  • each frame triggers 0-1 DID resolutions (cache miss rate)
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:mkqt76xvfgxuemlwlx6ruc3w/sh.tangled.repo.issue/3mgihbz5kyv22