Fast and robust atproto CAR file processing in rust

get ready for release

+27 -11
+2 -2
Cargo.toml
··· 1 1 [package] 2 2 name = "repo-stream" 3 - version = "0.2.2" 3 + version = "0.3.0" 4 4 edition = "2024" 5 5 license = "MIT OR Apache-2.0" 6 - description = "A robust CAR file -> MST walker for atproto" 6 + description = "Fast and robust atproto CAR file processing" 7 7 repository = "https://tangled.org/@microcosm.blue/repo-stream" 8 8 9 9 [dependencies]
+11
changelog.md
··· 1 + # v0.3.0 2 + 3 + _2026-01-15_ 4 + 5 + - drop sqlite, pick up fjall v3 for some speeeeeeed (and code simplification and easier build requirements and) 6 + - no more `Processable` trait, process functions are just `Vec<u8> -> Vec<u8>` now (bring your own ser/de). there's a potential small cost here where processors need to now actually go through serialization even for in-memory car walking, but i think zero-copy approaches (eg. rkyv) are low-cost enough 7 + - custom deserialize for MST nodes that does as much depth calculation and rkey validation as - possible in-line. (not clear if it actually made anything faster) 8 + - check MST depth at every node properly (previously it could do some walking before being able to check and included some assumptions) 9 + - check MST for empty leaf nodes (which not allowed) 10 + - shave 0.6 nanoseconds (really) from MST depth calculation (don't ask) 11 + - drop and swap some dependencies: `bincode`, `futures`, `futures-core`, `ipld-core` -> `cid`, `multibase`, `rusqlite` -> `fjall`. and add `hashbrown` bc it benchmarked a bit faster. (we hash on user-controlled CIDs -- is the lower DOS-resistance a risk to worry about?)
+14 -9
readme.md
··· 58 58 ``` 59 59 60 60 more recent todo 61 + - [ ] add a zero-copy rkyv process function example 61 62 - [ ] repo car slices 62 63 - [ ] lazy-value stream (rkey -> CID diffing for tap-like `#sync` handling) 63 64 - [x] get an *emtpy* car for the test suite 64 65 - [x] implement a max size on disk limit 65 66 67 + some ideas 68 + - [ ] since the disk k/v get/set interface is now so similar to HashMap (blocking, no transactions,), it's probably possible to make a single `Driver` and move the thread stuff from the disk one to generic helper functions. (might create async footguns though) 69 + - [ ] fork iroh-car into a sync version so we can drop tokio as a hard requirement, and offer async via wrapper helper things 70 + - [ ] feature-flag the sha2 crate for hmac-sha256? if someone wanted fewer deps?? then maybe make `hashbrown` also optional vs builtin hashmap? 66 71 67 72 ----- 68 73 ··· 132 137 - [x] car file test fixtures & validation tests 133 138 - [x] make sure we can get the did and signature out for verification 134 139 -> yeah the commit is returned from init 135 - - [ ] spec compliance todos 140 + - [x] spec compliance todos 136 141 - [x] assert that keys are ordered and fail if not 137 142 - [x] verify node mst depth from key (possibly pending [interop test fixes](https://github.com/bluesky-social/atproto-interop-tests/issues/5)) 138 - - [ ] performance todos 143 + - [x] performance todos 139 144 - [x] consume the serialized nodes into a mutable efficient format 140 - - [ ] maybe customize the deserialize impl to do that directly? 145 + - [x] maybe customize the deserialize impl to do that directly? 141 146 - [x] benchmark and profile 142 - - [ ] robustness todos 143 - - [ ] swap the blocks hashmap for a BlockStore trait that can be dumped to redb 144 - - [ ] maybe keep the redb function behind a feature flag? 145 - - [ ] can we assert a max size for node blocks? 147 + - [x] robustness todos 148 + - [x] swap the blocks hashmap for a BlockStore trait that can be dumped to redb 149 + - [x] maybe keep the redb function behind a feature flag? 150 + - [ ] can we assert a max size of entries for node blocks? 146 151 - [x] figure out why asserting the upper nibble of the fourth byte of a node fails fingerprinting 147 152 -> because it's the upper 3 bytes, not upper 4 byte nibble, oops. 148 - - [ ] max mst depth (there is actually a hard limit but a malicious repo could do anything) 149 - - [ ] i don't *think* we need a max recursion depth for processing cbor contents since we leave records to the user to decode 153 + - [x] max mst depth (to expensive to attack actually) 154 + - [x] i don't *think* we need a max recursion depth for processing cbor contents since we leave records to the user to decode 150 155 151 156 newer ideas 152 157