Update RESTORE.md for JuiceFS architecture · sans-self.org/infrastructure@94a4c53

this repo has no description

Update RESTORE.md for JuiceFS architecture

PDS blobs are on native S3 now, not in backups. Added
post-receive hook permissions fix for knot restores.
Removed stale Hetzner Volumes gotcha.

sans-self.org 3 weeks ago 94a4c53b c8f3b396

+47 -28

1 changed file

expand all

RESTORE.md

+47 -28

RESTORE.md

··· 2 2 3 3 Procedures for restoring PDS and knot from S3 backups after data loss. 4 4 5 - ## Diagnosis 5 + ## Architecture 6 6 7 - Check if volumes are empty (filesystem overhead only = data loss): 7 + All persistent volumes are backed by JuiceFS (S3-backed FUSE filesystem). Data survives node rescheduling and pod restarts — the old Hetzner Volumes data-loss-on-reschedule bug is eliminated. 8 8 9 - ```sh 10 - kubectl exec -n pds deployment/pds -- df -h /pds 11 - kubectl exec -n knot deployment/knot -c knot -- sh -c 'stat /home/git/' 12 - ``` 9 + **What's backed up:** 10 + - **PDS**: SQLite databases only (`account.sqlite`, `did_cache.sqlite`, `sequencer.sqlite`). Blob storage is natively on S3 — not on the PVC, not in backups. 11 + - **Knot**: SQLite database (`knotserver.db`) + git repositories (`repositories/`). 12 + 13 + **What's not backed up:** 14 + - **Zot registry**: Container images are rebuildable artifacts. No backups. 15 + - **PDS blobs**: Stored natively in S3 by the PDS process. Already durable — not part of backup/restore. 13 16 14 - PDS: ~8MB on 20GB = empty. Knot: ~7MB on 10GB = empty. 17 + **Schedule:** PDS at 02:00 UTC, knot at 02:30 UTC. Daily. 15 18 16 - Verify services are running on stale data: 19 + ## Diagnosis 20 + 21 + Check if services have lost their data: 17 22 18 23 ```sh 19 24 # PDS — should return your account, not RepoNotFound ··· 23 28 curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure" 24 29 ``` 25 30 31 + Check volume contents directly: 32 + 33 + ```sh 34 + kubectl exec -n pds deployment/pds -- ls -la /pds/ 35 + kubectl exec -n knot deployment/knot -c knot -- ls -la /home/git/data/ 36 + ``` 37 + 26 38 ## Inspect S3 Backups 27 39 28 - List bucket contents to find available snapshots: 40 + Get S3 credentials: 41 + 42 + ```sh 43 + kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d 44 + kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d 45 + ``` 46 + 47 + List available snapshots: 29 48 30 49 ```sh 31 50 kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \ ··· 37 56 --s3-region nbg1 --s3-no-check-bucket 38 57 ``` 39 58 40 - Get the S3 credentials from either namespace: 41 - 42 - ```sh 43 - kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d 44 - kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d 45 - ``` 46 - 47 - ### What to look for 48 - 49 - **DB snapshots** are timestamped copies (`rclone copyto`) — older ones survive even if a bad backup ran. Pick the newest snapshot from *before* the data loss. 50 - 51 - **Directory backups** (`actors/`, `blocks/`, `repositories/`) use `rclone copy` and are not deleted by subsequent runs. They represent the latest state at backup time. 59 + DB snapshots are timestamped (e.g. `account-20260225-020012.sqlite`). Pick the newest one from *before* the data loss. 52 60 53 61 ## Restore PDS 54 62 ··· 61 69 62 70 ### 2. Run restore job 63 71 64 - Pick the right DB timestamp. The backup runs daily at 02:00 UTC. 72 + Replace `TIMESTAMP` with the chosen snapshot timestamp (e.g. `20260225-020012`). 73 + 74 + Only SQLite databases are restored — blob storage lives natively on S3 and doesn't need restoration. 65 75 66 76 ```yaml 67 77 # kubectl apply -f - <<'YAML' ··· 94 104 rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite" /data/account.sqlite ${S3} 95 105 rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3} 96 106 rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3} 97 - 98 - rclone copy ":s3:sans-self-net/pds/actors" /data/actors ${S3} 99 - rclone copy ":s3:sans-self-net/pds/blocks" /data/blocks ${S3} 100 107 101 108 ls -la /data/ 102 109 echo "PDS restore complete" ··· 243 250 claimName: knot-data 244 251 ``` 245 252 246 - ### 3. Fix repo ACLs (if needed) 253 + ### 3. Fix post-receive hooks 254 + 255 + Restored git repositories may have non-executable `post-receive` hooks (FUSE `default_permissions` prevents root-in-container from chmod on files owned by `git`). Fix as the `git` user: 256 + 257 + ```sh 258 + kubectl exec -n knot deploy/knot -- su -s /bin/sh git -c \ 259 + 'find /home/git/repositories -name post-receive -exec chmod +x {} \;' 260 + ``` 261 + 262 + Without this, pushes land in the bare repo but knot never processes them — no feed updates, no diff indexing, no notifications. 263 + 264 + ### 4. Fix repo ACLs (if needed) 247 265 248 266 The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with `access denied: user not allowed` even though SSH auth succeeds. 249 267 ··· 298 316 claimName: knot-data 299 317 ``` 300 318 301 - ### 4. Scale up and verify 319 + ### 5. Scale up and verify 302 320 303 321 ```sh 304 322 kubectl scale deployment -n knot knot --replicas=1 ··· 328 346 329 347 ## Known Gotchas 330 348 331 - - **Hetzner volumes lose data on node reschedule.** The volumes stay `Bound` and mounted but are empty. The pod restarts fine on the empty volume, masking the data loss. 349 + - **PDS blobs are not in backups.** They live natively on S3 via the PDS process. If the S3 bucket itself is lost, blobs are gone. The backup only covers SQLite databases. 332 350 - **Choose the right DB snapshot.** Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records. 333 351 - **Sequencer cursor mismatch kills federation.** Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore. 334 352 - **Knot ACLs are per-repo.** The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually. 353 + - **Knot post-receive hooks may lose execute permissions.** After restoring from S3, hooks may not be executable due to FUSE `default_permissions`. Must chmod as the `git` user, not root. 335 354 - **SSH host keys change on pod restart.** Every knot scale-down/up regenerates sshd host keys. Run `ssh-keygen -R` to clear stale entries.