atproto utils for zig zat.dev
atproto sdk zig

docs: devlog 007 — proving the firehose

sync 1.1 verification in production, SDK API surface analysis,
lightrail's influence on the collection index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+88
+88
devlog/007-proving-the-firehose.md
··· 1 + # proving the firehose 2 + 3 + the previous devlog covered building [zlay](https://tangled.org/zzstoatzz.io/zlay) — architecture, deployment, war stories. this one picks up where it left off: wiring the inductive proof chain into the relay's hot path, and what that taught us about zat's API surface. 4 + 5 + ## the verification gap 6 + 7 + zlay launched with signature verification but not structural verification. every commit's ECDSA signature was checked against the account's signing key (via `zat.verifyCommitCar`), but the MST inversion step — proving that the list of operations actually explains the state transition — was skipped. the relay trusted that if the signature was valid, the operations were complete and correct. 8 + 9 + this is the same assumption most relay implementations make. [indigo](https://github.com/bluesky-social/indigo) doesn't do MST inversion. [collectiondir](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) trusts its upstream entirely. only fig's [lightrail](https://tangled.org/microcosm.blue/lightrail) — a Rust service that implements `listReposByCollection` — was working toward full sync 1.1 verification. 10 + 11 + but the spec is clear: a relay *should* verify `blocks` against `ops` and `prevData` via MST inversion. 12 + 13 + ## the inductive proof, briefly 14 + 15 + the sync spec describes a mechanism called operation inversion. each `#commit` event carries a CAR with the new partial MST, a list of record operations, and `prevData` — the MST root CID from the previous commit. the verification: 16 + 17 + 1. load the partial MST from the CAR — blocks not included become stubs (known CID, trusted content) 18 + 2. copy the tree 19 + 3. invert each operation: create → delete, update → restore old value, delete → reinsert 20 + 4. compute the root CID of the inverted tree 21 + 5. compare against `prevData` 22 + 23 + if they match, the operations fully explain the state change. if not, something is missing, reordered, or fabricated. 24 + 25 + it's inductive because each verified transition chains to the last. commit N proves against commit N-1, which proved against N-2, back to a known-good state. the relay only needs to track two values per account: `rev` (the last verified TID) and `data` (the last verified MST root CID). 26 + 27 + zat has had all the primitives since v0.2.8: `verifyCommitDiff`, `loadCommitFromCAR`, `MstOperation`, `normalizeOps`, `invertOp`. the question was how they compose in production. 28 + 29 + ## observation before enforcement 30 + 31 + zlay's approach was to wire up the chain checks without enforcing them. every commit now runs two continuity checks: 32 + 33 + - **since/rev check** — the event's `since` field should match the stored `rev` for that account 34 + - **prevData check** — the event's `prevData` CID should match the stored MST root 35 + 36 + mismatches increment a prometheus counter (`relay_chain_breaks_total`) and log the details, but the commit still flows through. this lets the operator measure the chain break rate across ~2,750 PDS hosts before deciding whether strict enforcement is safe. 37 + 38 + separately, `verifyCommitDiff` itself is wired but behind a config flag — disabled in production for now. the path is: observation → confidence → enforcement. 39 + 40 + the conditional state upsert matters here too. frame pool workers process commits concurrently, so two commits from the same account can race. the postgres upsert uses `WHERE rev < new_rev` to prevent a stale worker from rolling back an account's chain state. 41 + 42 + ## what the relay taught us about the SDK 43 + 44 + ### the extractOps bug 45 + 46 + the first attempt to wire `verifyCommitDiff` failed silently. zlay's `extractOps` was looking for separate `collection` and `rkey` fields in the CBOR payload. the firehose wire format uses a single `path` field (`collection/rkey`). the operations array arrived empty, and `verifyCommitDiff` with zero ops trivially succeeded. 47 + 48 + this is a spec-level mapping that every consumer must get right: firehose `repoOp` has `{action, path, cid, prev}`, but zat's `MstOperation` infers the action from which fields are null. the conversion is simple once you know it, but it's easy to miss — and the failure mode is silent success. 49 + 50 + ### `loadCommitFromCAR` — already right 51 + 52 + when we studied the integration, we asked: what should the SDK expose that it doesn't? the answer turned out to be nothing new. the function we wanted — parse a commit from a CAR without running full verification — was already public since v0.2.8. 53 + 54 + zlay uses exactly this pattern. on a validator cache miss (signing key not yet resolved), it needs the commit's DID to know *which* key to look up, but can't verify yet. `loadCommitFromCAR` extracts the commit metadata and pre-computes the unsigned bytes for later verification. parse now, verify when the key arrives. 55 + 56 + the principle this validates: **parsing should be separable from validation.** an SDK that only offers all-or-nothing verification forces consumers to reimplement the parsing layer. zat's decomposition — `loadCommitFromCAR` for parsing, `verifyCommitCar`/`verifyCommitDiff` for verification — matches how relays actually work. 57 + 58 + ### what we didn't add 59 + 60 + we considered three other affordances: 61 + 62 + 1. **standalone frame parsing** — expose `parseFirehoseFrame(header, payload) → FirehoseEvent` independent of the firehose client. principled in theory (wire format parsing ≠ connection lifecycle), but zlay is the only consumer so far. waiting for a second signal. 63 + 64 + 2. **operation type bridging** — a function to convert firehose `repoOp` to `MstOperation`. would have prevented the extractOps bug, but the conversion is ~5 lines. probably belongs in documentation, not code. 65 + 66 + 3. **chain continuity helper** — a pure function that compares stored (rev, data) against incoming (since, prevData). useful, but also ~5 lines of comparison logic. 67 + 68 + for a pre-1.0 library: *yes is forever, no is temporary.* none of these cleared the bar yet. 69 + 70 + ## lightrail and the collection index 71 + 72 + fig's [lightrail](https://tangled.org/microcosm.blue/lightrail) is a Rust implementation of `listReposByCollection` — the endpoint that answers "which accounts have records in a given collection?" it takes a different approach from indigo's collectiondir: instead of trusting the upstream relay, lightrail inspects the CAR blocks in each `#commit` to detect when an account's first record appears in (or last record disappears from) a collection. 73 + 74 + this is a deeper use of the partial MST than pure verification. lightrail doesn't just prove the diff is valid — it reads the adjacent keys in the CAR slice to determine collection boundaries. it's the difference between asking "is this diff correct?" and asking "what does this diff mean?" 75 + 76 + zlay's collection index drew on lightrail's design. the dual-column-family pattern — `rbc` (collection→DID for the query) and `cbr` (DID→collection for deletion) — comes from lightrail. so does the philosophy of indexing inline from the firehose rather than running a separate sidecar process. 77 + 78 + where they differ: lightrail aims for CAR-based probing (inspecting MST adjacency to detect add/remove), while zlay currently does simpler operation-level tracking — for any `create` op, extract the collection from the path and add that (DID, collection) pair to the index. accurate removal (knowing when the *last* record in a collection is deleted) requires either getRecord probing or MST adjacency analysis, and zlay defers this to a later phase. 79 + 80 + the backfill strategy also differs. collectiondir calls `describeRepo` on every account (O(N) API calls, N ≈ 30M+). zlay calls `listReposByCollection` on an upstream relay that already has the data — bootstrapping from the network rather than re-crawling. first backfill: 1,287 collections, 61M DIDs. 81 + 82 + ## what this means for zat 83 + 84 + the relay exercises every module in the SDK at scale — 50M+ CBOR decodes per day, thousands of signature verifications per minute, continuous DID resolution. but the sync 1.1 integration is where the API surface gets tested hardest, because it requires *composing* modules: CBOR for frame decode → CAR for block extraction → `loadCommitFromCAR` for commit parsing → `verifyCommitDiff` for structural proof → multibase for CID storage. 85 + 86 + the composition works. the decomposition between parsing and verification works. the main friction was at the boundary — the firehose wire format vs the MST internal format — and that's a documentation problem, not an API problem. 87 + 88 + we're watching the chain break metrics. when they stabilize, the relay enables strict enforcement, and the full inductive proof chain goes live.