Added backups restoration instructions · sans-self.org/infrastructure@d67c62b

sans-self.org / infrastructure

fork atom

this repo has no description

fork atom

Added backups restoration instructions

sans-self.org 4 weeks ago d67c62bf c38693f3

+336

2 changed files

expand all

CHANGELOG.md

RESTORE.md

CHANGELOG.md

··· 11 11 - Add tarpit for vulnerability scanners hitting known exploit paths (#18) 12 12 13 13 ### Added 14 + - Add backup restoration guide for PDS and knot (#35) 14 15 - Create vesper and nyx accounts on PDS (#31) 15 16 - Add daily S3 backup cronjob for Tangled knot data (#9) 16 17 - Add Tangled knot with Spindle CI/CD to k3s cluster (#1)

+335

RESTORE.md

··· 1 + # Backup Restoration Guide 2 + 3 + Procedures for restoring PDS and knot from S3 backups after data loss. 4 + 5 + ## Diagnosis 6 + 7 + Check if volumes are empty (filesystem overhead only = data loss): 8 + 9 + ```sh 10 + kubectl exec -n pds deployment/pds -- df -h /pds 11 + kubectl exec -n knot deployment/knot -c knot -- sh -c 'stat /home/git/' 12 + ``` 13 + 14 + PDS: ~8MB on 20GB = empty. Knot: ~7MB on 10GB = empty. 15 + 16 + Verify services are running on stale data: 17 + 18 + ```sh 19 + # PDS — should return your account, not RepoNotFound 20 + curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" 21 + 22 + # Knot — should return branches, not RepoNotFound 23 + curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure" 24 + ``` 25 + 26 + ## Inspect S3 Backups 27 + 28 + List bucket contents to find available snapshots: 29 + 30 + ```sh 31 + kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \ 32 + ls :s3:sans-self-net/ \ 33 + --s3-provider Other \ 34 + --s3-access-key-id "${S3_ACCESS_KEY}" \ 35 + --s3-secret-access-key "${S3_SECRET_KEY}" \ 36 + --s3-endpoint nbg1.your-objectstorage.com \ 37 + --s3-region nbg1 --s3-no-check-bucket 38 + ``` 39 + 40 + Get the S3 credentials from either namespace: 41 + 42 + ```sh 43 + kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d 44 + kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d 45 + ``` 46 + 47 + ### What to look for 48 + 49 + **DB snapshots** are timestamped copies (`rclone copyto`) — older ones survive even if a bad backup ran. Pick the newest snapshot from *before* the data loss. 50 + 51 + **Directory backups** (`actors/`, `blocks/`, `repositories/`) use `rclone copy` and are not deleted by subsequent runs. They represent the latest state at backup time. 52 + 53 + ## Restore PDS 54 + 55 + ### 1. Scale down 56 + 57 + ```sh 58 + kubectl scale deployment -n pds pds --replicas=0 59 + kubectl wait --for=delete pod -n pds -l app=pds --timeout=60s 60 + ``` 61 + 62 + ### 2. Run restore job 63 + 64 + Pick the right DB timestamp. The backup runs daily at 02:00 UTC. 65 + 66 + ```yaml 67 + # kubectl apply -f - <<'YAML' 68 + apiVersion: batch/v1 69 + kind: Job 70 + metadata: 71 + name: pds-restore 72 + namespace: pds 73 + spec: 74 + backoffLimit: 0 75 + template: 76 + spec: 77 + restartPolicy: Never 78 + securityContext: 79 + fsGroup: 1000 80 + runAsUser: 1000 81 + runAsGroup: 1000 82 + containers: 83 + - name: restore 84 + image: rclone/rclone:1.69 85 + command: ["sh", "-c"] 86 + args: 87 + - | 88 + set -eux 89 + S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket" 90 + 91 + rm -rf /data/* 92 + 93 + # Replace TIMESTAMP with the chosen snapshot (e.g. 20260225-020012) 94 + rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite" /data/account.sqlite ${S3} 95 + rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3} 96 + rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3} 97 + 98 + rclone copy ":s3:sans-self-net/pds/actors" /data/actors ${S3} 99 + rclone copy ":s3:sans-self-net/pds/blocks" /data/blocks ${S3} 100 + 101 + ls -la /data/ 102 + echo "PDS restore complete" 103 + env: 104 + - name: S3_ACCESS_KEY 105 + valueFrom: 106 + secretKeyRef: { name: pds-s3-credentials, key: access-key } 107 + - name: S3_SECRET_KEY 108 + valueFrom: 109 + secretKeyRef: { name: pds-s3-credentials, key: secret-key } 110 + volumeMounts: 111 + - { name: data, mountPath: /data } 112 + volumes: 113 + - name: data 114 + persistentVolumeClaim: 115 + claimName: pds-data 116 + ``` 117 + 118 + ### 3. Fix sequencer cursor 119 + 120 + The relay (bsky.network) tracks the last sequence number it consumed. After a restore, the sequencer's autoincrement is behind the relay's cursor, so new events are invisible to the network. 121 + 122 + Check the relay's cursor from PDS logs after scaling back up: 123 + 124 + ```sh 125 + kubectl logs -n pds deployment/pds --tail=100 | grep subscribeRepos 126 + # Look for: "cursor":NNN 127 + ``` 128 + 129 + Then bump the autoincrement past that cursor. Scale down again first: 130 + 131 + ```yaml 132 + # kubectl apply -f - <<'YAML' 133 + apiVersion: batch/v1 134 + kind: Job 135 + metadata: 136 + name: pds-seq-fix 137 + namespace: pds 138 + spec: 139 + backoffLimit: 0 140 + template: 141 + spec: 142 + restartPolicy: Never 143 + securityContext: { fsGroup: 1000, runAsUser: 0 } 144 + containers: 145 + - name: fix 146 + image: keinos/sqlite3:3.47.2 147 + command: ["sh", "-c"] 148 + args: 149 + - | 150 + set -eux 151 + # Set to at least relay_cursor + 100 152 + sqlite3 /data/sequencer.sqlite "UPDATE sqlite_sequence SET seq = 1000 WHERE name = 'repo_seq';" 153 + sqlite3 /data/sequencer.sqlite "SELECT seq FROM sqlite_sequence WHERE name='repo_seq';" 154 + chown 1000:1000 /data/sequencer.sqlite 155 + volumeMounts: 156 + - { name: data, mountPath: /data } 157 + volumes: 158 + - name: data 159 + persistentVolumeClaim: 160 + claimName: pds-data 161 + ``` 162 + 163 + ### 4. Scale up and request crawl 164 + 165 + ```sh 166 + kubectl scale deployment -n pds pds --replicas=1 167 + kubectl wait --for=condition=ready pod -n pds -l app=pds --timeout=120s 168 + 169 + # Tell the relay to re-subscribe 170 + curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \ 171 + -H "Content-Type: application/json" \ 172 + -d '{"hostname": "sans-self.org"}' 173 + ``` 174 + 175 + ### 5. Verify 176 + 177 + ```sh 178 + # All accounts resolve 179 + curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" | jq .handle 180 + curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:sg4udwrlnokqtpteaswzcps5" | jq .handle 181 + curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:uog7vhnxiskidenntic67g3z" | jq .handle 182 + 183 + # Test that new posts propagate — create a post via the app and check it appears on bsky.app 184 + ``` 185 + 186 + ## Restore Knot 187 + 188 + ### 1. Scale down 189 + 190 + ```sh 191 + kubectl scale deployment -n knot knot --replicas=0 192 + kubectl wait --for=delete pod -n knot -l app=knot --timeout=60s 193 + ``` 194 + 195 + ### 2. Run restore job 196 + 197 + ```yaml 198 + # kubectl apply -f - <<'YAML' 199 + apiVersion: batch/v1 200 + kind: Job 201 + metadata: 202 + name: knot-restore 203 + namespace: knot 204 + spec: 205 + backoffLimit: 0 206 + template: 207 + spec: 208 + restartPolicy: Never 209 + securityContext: 210 + fsGroup: 1000 211 + runAsUser: 1000 212 + runAsGroup: 1000 213 + containers: 214 + - name: restore 215 + image: rclone/rclone:1.69 216 + command: ["sh", "-c"] 217 + args: 218 + - | 219 + set -eux 220 + S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket" 221 + 222 + rm -rf /data/* 223 + 224 + # Replace TIMESTAMP (e.g. 20260224-023011) 225 + mkdir -p /data/data 226 + rclone copyto ":s3:sans-self-net/knot/db/knotserver-TIMESTAMP.db" /data/data/knotserver.db ${S3} 227 + rclone copy ":s3:sans-self-net/knot/repositories" /data/repositories ${S3} 228 + 229 + ls -la /data/ 230 + echo "knot restore complete" 231 + env: 232 + - name: S3_ACCESS_KEY 233 + valueFrom: 234 + secretKeyRef: { name: knot-s3-credentials, key: access-key } 235 + - name: S3_SECRET_KEY 236 + valueFrom: 237 + secretKeyRef: { name: knot-s3-credentials, key: secret-key } 238 + volumeMounts: 239 + - { name: data, mountPath: /data } 240 + volumes: 241 + - name: data 242 + persistentVolumeClaim: 243 + claimName: knot-data 244 + ``` 245 + 246 + ### 3. Fix repo ACLs (if needed) 247 + 248 + The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with `access denied: user not allowed` even though SSH auth succeeds. 249 + 250 + Copy the DB out, inspect, and patch: 251 + 252 + ```sh 253 + # Copy DB out of the running pod (after scale-up) 254 + kubectl cp knot/$(kubectl get pod -n knot -l app=knot -o jsonpath='{.items[0].metadata.name}'):/home/git/data/knotserver.db /tmp/knotserver.db -c knot 255 + 256 + # Check existing ACLs 257 + sqlite3 /tmp/knotserver.db "SELECT * FROM acl;" 258 + ``` 259 + 260 + To add ACL entries for a missing repo, scale down and run: 261 + 262 + ```yaml 263 + # kubectl apply -f - <<'YAML' 264 + apiVersion: batch/v1 265 + kind: Job 266 + metadata: 267 + name: knot-acl-fix 268 + namespace: knot 269 + spec: 270 + backoffLimit: 0 271 + template: 272 + spec: 273 + restartPolicy: Never 274 + securityContext: { fsGroup: 1000, runAsUser: 0 } 275 + containers: 276 + - name: fix 277 + image: keinos/sqlite3:3.47.2 278 + command: ["sh", "-c"] 279 + args: 280 + - | 281 + set -eux 282 + DID="did:plc:wydyrngmxbcsqdvhmd7whmye" 283 + REPO="${DID}/REPO_NAME" 284 + sqlite3 /data/data/knotserver.db " 285 + INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:settings','',''); 286 + INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:push','',''); 287 + INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:owner','',''); 288 + INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:invite','',''); 289 + INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:delete','',''); 290 + INSERT INTO acl VALUES ('p','server:owner','thisserver','${REPO}','repo:delete','',''); 291 + " 292 + chown 1000:1000 /data/data/knotserver.db 293 + volumeMounts: 294 + - { name: data, mountPath: /data } 295 + volumes: 296 + - name: data 297 + persistentVolumeClaim: 298 + claimName: knot-data 299 + ``` 300 + 301 + ### 4. Scale up and verify 302 + 303 + ```sh 304 + kubectl scale deployment -n knot knot --replicas=1 305 + kubectl wait --for=condition=ready pod -n knot -l app=knot --timeout=120s 306 + 307 + # Verify repos resolve 308 + curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure" 309 + 310 + # Test push 311 + git push --dry-run origin main 312 + ``` 313 + 314 + ## Post-Restore Cleanup 315 + 316 + Delete completed restore jobs: 317 + 318 + ```sh 319 + kubectl delete job -n pds pds-restore pds-seq-fix 2>/dev/null 320 + kubectl delete job -n knot knot-restore knot-acl-fix 2>/dev/null 321 + ``` 322 + 323 + Remove stale SSH host keys (knot regenerates host keys on every pod restart): 324 + 325 + ```sh 326 + ssh-keygen -R knot.sans-self.org 327 + ``` 328 + 329 + ## Known Gotchas 330 + 331 + - **Hetzner volumes lose data on node reschedule.** The volumes stay `Bound` and mounted but are empty. The pod restarts fine on the empty volume, masking the data loss. 332 + - **Choose the right DB snapshot.** Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records. 333 + - **Sequencer cursor mismatch kills federation.** Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore. 334 + - **Knot ACLs are per-repo.** The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually. 335 + - **SSH host keys change on pod restart.** Every knot scale-down/up regenerates sshd host keys. Run `ssh-keygen -R` to clear stale entries.