feat: O(1) block lookup in CAR parser + fix verifyRepo size limits

atproto utils for zig zat.dev

atproto sdk zig

CAR parser now builds a StringHashMap index during read(), findBlock()
uses O(1) hash lookup instead of linear scan. verifyRepo bypasses
default 2MB/10k block limits for fetched repos.

Clarify Rust ecosystem in devlog: rsky (BlackSky), jacquard
(@nonbinary.computer), and hand-rolled RustCrypto bench.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz.io 3 weeks ago 8205e974 d9fb5fe1

+25 -7

4 changed files

expand all

CHANGELOG.md

devlog

005-three-way-verify.md

src

internal

repo

car.zig

repo_verifier.zig

CHANGELOG.md

··· 1 1 # changelog 2 2 3 + ## 0.2.5 4 + 5 + - **feat**: O(1) block lookup in CAR parser — `StringHashMap` index built during `read()`/`readWithOptions()`, `findBlock()` uses index instead of linear scan 6 + - **fix**: `verifyRepo` bypasses default 2 MB / 10k block limits so large repos (e.g. pfrazee.com at 70 MB / 243k blocks) actually work 7 + - **docs**: devlog 005 — clarify Rust ecosystem (rsky, jacquard, hand-rolled RustCrypto) 8 + 3 9 ## 0.2.4 4 10 5 11 - **feat**: configurable CAR size limits — `max_size` and `max_blocks` options in `readWithOptions` for large repo verification

+4 -4

devlog/005-three-way-verify.md

··· 22 22 23 23 **go (indigo)** — uses bluesky's official Go SDK: `identity.BaseDirectory` for handle/DID resolution, `repo.LoadRepoFromCAR` for parsing, `commit.VerifySignature` for sig verify, `MST.Walk()` + `MST.RootCID()` for MST. 24 24 25 - **rust (RustCrypto)** — manual implementation since no indigo-equivalent exists in Rust. HTTP + DNS TXT handle resolution, plc.directory DID resolution, hand-rolled CAR parser with SHA-256, k256/p256 for ECDSA, recursive CBOR MST traversal. skips MST rebuild (no crate for it). 25 + **rust (RustCrypto)** — hand-rolled using the same low-level crates that [rsky](https://github.com/blacksky-algorithms/rsky) (Rudy Fraser / BlackSky) uses internally: k256/p256 for ECDSA, serde_ipld_dagcbor for CBOR, sha2 for hashing. no Rust equivalent of indigo's all-in-one `LoadRepoFromCAR` exists yet — rsky provides the building blocks (rsky-repo for MST/CAR, rsky-crypto for signatures, rsky-identity for DID resolution) and [jacquard](https://tangled.sh/@nonbinary.computer/jacquard) (@nonbinary.computer) also has MST/CAR/identity support, but the end-to-end verify pipeline is assembled manually here. HTTP + DNS TXT handle resolution, plc.directory DID resolution, hand-rolled CAR parser with SHA-256, recursive CBOR MST traversal. skips MST rebuild (no crate for it in either rsky or jacquard yet). 26 26 27 27 ## the O(n) bug 28 28 ··· 50 50 51 51 the story is different from the decode benchmarks. there, zig was 19x faster than Go. here, the gap is ~1.4x. the reason: signature verification is a single ECDSA verify (sub-millisecond for everyone), and CAR parsing on a 70 MB file is less dominated by per-block overhead than the firehose's thousands of small CARs. the MST rebuild (zig-only) is the biggest single cost — serializing 192k entries into a fresh tree and hashing. 52 52 53 - go's MST walk is fastest (5.8ms vs zig's 45.5ms) because indigo's `LoadRepoFromCAR` builds the MST in memory during CAR parse. walking it is just pointer chasing. zig and rust decode MST nodes from raw CBOR on each visit. 53 + go's MST walk is fastest (5.8ms vs zig's 45.5ms) because indigo's MST nodes are decoded from CBOR once on first access and cached as Go structs — subsequent traversal is pure pointer chasing. zig and rust decode MST nodes from raw CBOR on each visit. the same pattern explains go's 0.0ms MST rebuild: `LoadRepoFromCAR` pre-computes and caches the root CID during load. 54 54 55 55 ## what changed in zat 56 56 57 - the O(n) block lookup was the only code change — CAR blocks are now indexed in a `StringHashMap` instead of a flat slice. the rest of the verify pipeline (handle resolution, DID resolution, CAR parsing, signature verification, MST walk + rebuild) was already working from 0.2.0. 57 + two changes in the CAR parser: blocks are now indexed in a `StringHashMap` for O(1) lookup (the O(n) linear scan was the 79s → 48ms fix), and `verifyRepo` now bypasses the default 2 MB / 10k block limits so large repos like pfrazee's 70 MB actually work. 58 58 59 - also exported the `jwt` module directly (not just the `Jwt` type) so the verify tool can call `jwt.verifySecp256k1` without reaching into internals. and made CAR size limits configurable (`max_size`, `max_blocks` in `readWithOptions`) — pfrazee's 70 MB repo blows past the 2 MB default. 59 + also exported the `jwt` module directly (not just the `Jwt` type) so the verify tool can call `jwt.verifySecp256k1` without reaching into internals, and made CAR size limits configurable (`max_size`, `max_blocks` in `readWithOptions`) for callers who need custom limits. 60 60 61 61 the three-way comparison and chart tooling live in [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench).

+10 -1

src/internal/repo/car.zig

··· 23 23 pub const Car = struct { 24 24 roots: []const cbor.Cid, 25 25 blocks: []const Block, 26 + /// CID bytes → block data for O(1) lookup. built by read/readWithOptions. 27 + /// empty for manually-constructed Cars (findBlock falls back to linear scan). 28 + block_index: std.StringHashMapUnmanaged([]const u8) = .empty, 26 29 }; 27 30 28 31 pub const CarError = error{ ··· 87 90 88 91 // read blocks 89 92 var blocks: std.ArrayList(Block) = .{}; 93 + var block_index: std.StringHashMapUnmanaged([]const u8) = .empty; 90 94 91 95 while (pos < data.len) { 92 96 // block: [varint total_len] [CID bytes] [data bytes] ··· 115 119 .cid_raw = cid_bytes, 116 120 .data = content, 117 121 }); 122 + try block_index.put(allocator, cid_bytes, content); 118 123 119 124 pos = block_end; 120 125 } ··· 122 127 return .{ 123 128 .roots = try roots.toOwnedSlice(allocator), 124 129 .blocks = try blocks.toOwnedSlice(allocator), 130 + .block_index = block_index, 125 131 }; 126 132 } 127 133 ··· 166 172 return pos + digest_len_usize; 167 173 } 168 174 169 - /// find a block by matching CID bytes 175 + /// find a block by matching CID bytes. 176 + /// uses the hash index when available (O(1)), falls back to linear scan for 177 + /// manually-constructed Cars without an index. 170 178 pub fn findBlock(c: Car, cid_raw: []const u8) ?[]const u8 { 179 + if (c.block_index.count() > 0) return c.block_index.get(cid_raw); 171 180 for (c.blocks) |block| { 172 181 if (std.mem.eql(u8, block.cid_raw, cid_raw)) return block.data; 173 182 }

+5 -2

src/internal/repo/repo_verifier.zig

··· 79 79 // 5. fetch repo CAR 80 80 const car_bytes = try fetchRepo(allocator, pds_endpoint, did_str); 81 81 82 - // 6. parse CAR 83 - const repo_car = car.read(allocator, car_bytes) catch return error.InvalidCommit; 82 + // 6. parse CAR (no size limits — we fetched this ourselves from the PDS) 83 + const repo_car = car.readWithOptions(allocator, car_bytes, .{ 84 + .max_size = car_bytes.len, 85 + .max_blocks = car_bytes.len, // effectively unlimited 86 + }) catch return error.InvalidCommit; 84 87 if (repo_car.roots.len == 0) return error.NoRootsInCar; 85 88 86 89 // 7. find commit block