tangled
alpha
login
or
join now
sans-self.org
/
infrastructure
0
fork
atom
this repo has no description
0
fork
atom
overview
issues
pulls
pipelines
Added backups restoration instructions
sans-self.org
4 weeks ago
d67c62bf
c38693f3
+336
2 changed files
expand all
collapse all
unified
split
CHANGELOG.md
RESTORE.md
+1
CHANGELOG.md
reviewed
···
11
11
- Add tarpit for vulnerability scanners hitting known exploit paths (#18)
12
12
13
13
### Added
14
14
+
- Add backup restoration guide for PDS and knot (#35)
14
15
- Create vesper and nyx accounts on PDS (#31)
15
16
- Add daily S3 backup cronjob for Tangled knot data (#9)
16
17
- Add Tangled knot with Spindle CI/CD to k3s cluster (#1)
+335
RESTORE.md
reviewed
···
1
1
+
# Backup Restoration Guide
2
2
+
3
3
+
Procedures for restoring PDS and knot from S3 backups after data loss.
4
4
+
5
5
+
## Diagnosis
6
6
+
7
7
+
Check if volumes are empty (filesystem overhead only = data loss):
8
8
+
9
9
+
```sh
10
10
+
kubectl exec -n pds deployment/pds -- df -h /pds
11
11
+
kubectl exec -n knot deployment/knot -c knot -- sh -c 'stat /home/git/'
12
12
+
```
13
13
+
14
14
+
PDS: ~8MB on 20GB = empty. Knot: ~7MB on 10GB = empty.
15
15
+
16
16
+
Verify services are running on stale data:
17
17
+
18
18
+
```sh
19
19
+
# PDS — should return your account, not RepoNotFound
20
20
+
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye"
21
21
+
22
22
+
# Knot — should return branches, not RepoNotFound
23
23
+
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"
24
24
+
```
25
25
+
26
26
+
## Inspect S3 Backups
27
27
+
28
28
+
List bucket contents to find available snapshots:
29
29
+
30
30
+
```sh
31
31
+
kubectl run s3-check --rm -it --restart=Never --image=rclone/rclone:1.69 -- \
32
32
+
ls :s3:sans-self-net/ \
33
33
+
--s3-provider Other \
34
34
+
--s3-access-key-id "${S3_ACCESS_KEY}" \
35
35
+
--s3-secret-access-key "${S3_SECRET_KEY}" \
36
36
+
--s3-endpoint nbg1.your-objectstorage.com \
37
37
+
--s3-region nbg1 --s3-no-check-bucket
38
38
+
```
39
39
+
40
40
+
Get the S3 credentials from either namespace:
41
41
+
42
42
+
```sh
43
43
+
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.access-key}' | base64 -d
44
44
+
kubectl get secret -n pds pds-s3-credentials -o jsonpath='{.data.secret-key}' | base64 -d
45
45
+
```
46
46
+
47
47
+
### What to look for
48
48
+
49
49
+
**DB snapshots** are timestamped copies (`rclone copyto`) — older ones survive even if a bad backup ran. Pick the newest snapshot from *before* the data loss.
50
50
+
51
51
+
**Directory backups** (`actors/`, `blocks/`, `repositories/`) use `rclone copy` and are not deleted by subsequent runs. They represent the latest state at backup time.
52
52
+
53
53
+
## Restore PDS
54
54
+
55
55
+
### 1. Scale down
56
56
+
57
57
+
```sh
58
58
+
kubectl scale deployment -n pds pds --replicas=0
59
59
+
kubectl wait --for=delete pod -n pds -l app=pds --timeout=60s
60
60
+
```
61
61
+
62
62
+
### 2. Run restore job
63
63
+
64
64
+
Pick the right DB timestamp. The backup runs daily at 02:00 UTC.
65
65
+
66
66
+
```yaml
67
67
+
# kubectl apply -f - <<'YAML'
68
68
+
apiVersion: batch/v1
69
69
+
kind: Job
70
70
+
metadata:
71
71
+
name: pds-restore
72
72
+
namespace: pds
73
73
+
spec:
74
74
+
backoffLimit: 0
75
75
+
template:
76
76
+
spec:
77
77
+
restartPolicy: Never
78
78
+
securityContext:
79
79
+
fsGroup: 1000
80
80
+
runAsUser: 1000
81
81
+
runAsGroup: 1000
82
82
+
containers:
83
83
+
- name: restore
84
84
+
image: rclone/rclone:1.69
85
85
+
command: ["sh", "-c"]
86
86
+
args:
87
87
+
- |
88
88
+
set -eux
89
89
+
S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"
90
90
+
91
91
+
rm -rf /data/*
92
92
+
93
93
+
# Replace TIMESTAMP with the chosen snapshot (e.g. 20260225-020012)
94
94
+
rclone copyto ":s3:sans-self-net/pds/db/account-TIMESTAMP.sqlite" /data/account.sqlite ${S3}
95
95
+
rclone copyto ":s3:sans-self-net/pds/db/did_cache-TIMESTAMP.sqlite" /data/did_cache.sqlite ${S3}
96
96
+
rclone copyto ":s3:sans-self-net/pds/db/sequencer-TIMESTAMP.sqlite" /data/sequencer.sqlite ${S3}
97
97
+
98
98
+
rclone copy ":s3:sans-self-net/pds/actors" /data/actors ${S3}
99
99
+
rclone copy ":s3:sans-self-net/pds/blocks" /data/blocks ${S3}
100
100
+
101
101
+
ls -la /data/
102
102
+
echo "PDS restore complete"
103
103
+
env:
104
104
+
- name: S3_ACCESS_KEY
105
105
+
valueFrom:
106
106
+
secretKeyRef: { name: pds-s3-credentials, key: access-key }
107
107
+
- name: S3_SECRET_KEY
108
108
+
valueFrom:
109
109
+
secretKeyRef: { name: pds-s3-credentials, key: secret-key }
110
110
+
volumeMounts:
111
111
+
- { name: data, mountPath: /data }
112
112
+
volumes:
113
113
+
- name: data
114
114
+
persistentVolumeClaim:
115
115
+
claimName: pds-data
116
116
+
```
117
117
+
118
118
+
### 3. Fix sequencer cursor
119
119
+
120
120
+
The relay (bsky.network) tracks the last sequence number it consumed. After a restore, the sequencer's autoincrement is behind the relay's cursor, so new events are invisible to the network.
121
121
+
122
122
+
Check the relay's cursor from PDS logs after scaling back up:
123
123
+
124
124
+
```sh
125
125
+
kubectl logs -n pds deployment/pds --tail=100 | grep subscribeRepos
126
126
+
# Look for: "cursor":NNN
127
127
+
```
128
128
+
129
129
+
Then bump the autoincrement past that cursor. Scale down again first:
130
130
+
131
131
+
```yaml
132
132
+
# kubectl apply -f - <<'YAML'
133
133
+
apiVersion: batch/v1
134
134
+
kind: Job
135
135
+
metadata:
136
136
+
name: pds-seq-fix
137
137
+
namespace: pds
138
138
+
spec:
139
139
+
backoffLimit: 0
140
140
+
template:
141
141
+
spec:
142
142
+
restartPolicy: Never
143
143
+
securityContext: { fsGroup: 1000, runAsUser: 0 }
144
144
+
containers:
145
145
+
- name: fix
146
146
+
image: keinos/sqlite3:3.47.2
147
147
+
command: ["sh", "-c"]
148
148
+
args:
149
149
+
- |
150
150
+
set -eux
151
151
+
# Set to at least relay_cursor + 100
152
152
+
sqlite3 /data/sequencer.sqlite "UPDATE sqlite_sequence SET seq = 1000 WHERE name = 'repo_seq';"
153
153
+
sqlite3 /data/sequencer.sqlite "SELECT seq FROM sqlite_sequence WHERE name='repo_seq';"
154
154
+
chown 1000:1000 /data/sequencer.sqlite
155
155
+
volumeMounts:
156
156
+
- { name: data, mountPath: /data }
157
157
+
volumes:
158
158
+
- name: data
159
159
+
persistentVolumeClaim:
160
160
+
claimName: pds-data
161
161
+
```
162
162
+
163
163
+
### 4. Scale up and request crawl
164
164
+
165
165
+
```sh
166
166
+
kubectl scale deployment -n pds pds --replicas=1
167
167
+
kubectl wait --for=condition=ready pod -n pds -l app=pds --timeout=120s
168
168
+
169
169
+
# Tell the relay to re-subscribe
170
170
+
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
171
171
+
-H "Content-Type: application/json" \
172
172
+
-d '{"hostname": "sans-self.org"}'
173
173
+
```
174
174
+
175
175
+
### 5. Verify
176
176
+
177
177
+
```sh
178
178
+
# All accounts resolve
179
179
+
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:wydyrngmxbcsqdvhmd7whmye" | jq .handle
180
180
+
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:sg4udwrlnokqtpteaswzcps5" | jq .handle
181
181
+
curl -s "https://sans-self.org/xrpc/com.atproto.repo.describeRepo?repo=did:plc:uog7vhnxiskidenntic67g3z" | jq .handle
182
182
+
183
183
+
# Test that new posts propagate — create a post via the app and check it appears on bsky.app
184
184
+
```
185
185
+
186
186
+
## Restore Knot
187
187
+
188
188
+
### 1. Scale down
189
189
+
190
190
+
```sh
191
191
+
kubectl scale deployment -n knot knot --replicas=0
192
192
+
kubectl wait --for=delete pod -n knot -l app=knot --timeout=60s
193
193
+
```
194
194
+
195
195
+
### 2. Run restore job
196
196
+
197
197
+
```yaml
198
198
+
# kubectl apply -f - <<'YAML'
199
199
+
apiVersion: batch/v1
200
200
+
kind: Job
201
201
+
metadata:
202
202
+
name: knot-restore
203
203
+
namespace: knot
204
204
+
spec:
205
205
+
backoffLimit: 0
206
206
+
template:
207
207
+
spec:
208
208
+
restartPolicy: Never
209
209
+
securityContext:
210
210
+
fsGroup: 1000
211
211
+
runAsUser: 1000
212
212
+
runAsGroup: 1000
213
213
+
containers:
214
214
+
- name: restore
215
215
+
image: rclone/rclone:1.69
216
216
+
command: ["sh", "-c"]
217
217
+
args:
218
218
+
- |
219
219
+
set -eux
220
220
+
S3="--s3-provider Other --s3-access-key-id ${S3_ACCESS_KEY} --s3-secret-access-key ${S3_SECRET_KEY} --s3-endpoint nbg1.your-objectstorage.com --s3-region nbg1 --s3-no-check-bucket"
221
221
+
222
222
+
rm -rf /data/*
223
223
+
224
224
+
# Replace TIMESTAMP (e.g. 20260224-023011)
225
225
+
mkdir -p /data/data
226
226
+
rclone copyto ":s3:sans-self-net/knot/db/knotserver-TIMESTAMP.db" /data/data/knotserver.db ${S3}
227
227
+
rclone copy ":s3:sans-self-net/knot/repositories" /data/repositories ${S3}
228
228
+
229
229
+
ls -la /data/
230
230
+
echo "knot restore complete"
231
231
+
env:
232
232
+
- name: S3_ACCESS_KEY
233
233
+
valueFrom:
234
234
+
secretKeyRef: { name: knot-s3-credentials, key: access-key }
235
235
+
- name: S3_SECRET_KEY
236
236
+
valueFrom:
237
237
+
secretKeyRef: { name: knot-s3-credentials, key: secret-key }
238
238
+
volumeMounts:
239
239
+
- { name: data, mountPath: /data }
240
240
+
volumes:
241
241
+
- name: data
242
242
+
persistentVolumeClaim:
243
243
+
claimName: knot-data
244
244
+
```
245
245
+
246
246
+
### 3. Fix repo ACLs (if needed)
247
247
+
248
248
+
The knot DB stores per-repo ACL entries. If a repo was created after the backup, its ACL will be missing and pushes will fail with `access denied: user not allowed` even though SSH auth succeeds.
249
249
+
250
250
+
Copy the DB out, inspect, and patch:
251
251
+
252
252
+
```sh
253
253
+
# Copy DB out of the running pod (after scale-up)
254
254
+
kubectl cp knot/$(kubectl get pod -n knot -l app=knot -o jsonpath='{.items[0].metadata.name}'):/home/git/data/knotserver.db /tmp/knotserver.db -c knot
255
255
+
256
256
+
# Check existing ACLs
257
257
+
sqlite3 /tmp/knotserver.db "SELECT * FROM acl;"
258
258
+
```
259
259
+
260
260
+
To add ACL entries for a missing repo, scale down and run:
261
261
+
262
262
+
```yaml
263
263
+
# kubectl apply -f - <<'YAML'
264
264
+
apiVersion: batch/v1
265
265
+
kind: Job
266
266
+
metadata:
267
267
+
name: knot-acl-fix
268
268
+
namespace: knot
269
269
+
spec:
270
270
+
backoffLimit: 0
271
271
+
template:
272
272
+
spec:
273
273
+
restartPolicy: Never
274
274
+
securityContext: { fsGroup: 1000, runAsUser: 0 }
275
275
+
containers:
276
276
+
- name: fix
277
277
+
image: keinos/sqlite3:3.47.2
278
278
+
command: ["sh", "-c"]
279
279
+
args:
280
280
+
- |
281
281
+
set -eux
282
282
+
DID="did:plc:wydyrngmxbcsqdvhmd7whmye"
283
283
+
REPO="${DID}/REPO_NAME"
284
284
+
sqlite3 /data/data/knotserver.db "
285
285
+
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:settings','','');
286
286
+
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:push','','');
287
287
+
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:owner','','');
288
288
+
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:invite','','');
289
289
+
INSERT INTO acl VALUES ('p','${DID}','thisserver','${REPO}','repo:delete','','');
290
290
+
INSERT INTO acl VALUES ('p','server:owner','thisserver','${REPO}','repo:delete','','');
291
291
+
"
292
292
+
chown 1000:1000 /data/data/knotserver.db
293
293
+
volumeMounts:
294
294
+
- { name: data, mountPath: /data }
295
295
+
volumes:
296
296
+
- name: data
297
297
+
persistentVolumeClaim:
298
298
+
claimName: knot-data
299
299
+
```
300
300
+
301
301
+
### 4. Scale up and verify
302
302
+
303
303
+
```sh
304
304
+
kubectl scale deployment -n knot knot --replicas=1
305
305
+
kubectl wait --for=condition=ready pod -n knot -l app=knot --timeout=120s
306
306
+
307
307
+
# Verify repos resolve
308
308
+
curl -s "https://knot.sans-self.org/xrpc/sh.tangled.repo.branches?repo=did:plc:wydyrngmxbcsqdvhmd7whmye/infrastructure"
309
309
+
310
310
+
# Test push
311
311
+
git push --dry-run origin main
312
312
+
```
313
313
+
314
314
+
## Post-Restore Cleanup
315
315
+
316
316
+
Delete completed restore jobs:
317
317
+
318
318
+
```sh
319
319
+
kubectl delete job -n pds pds-restore pds-seq-fix 2>/dev/null
320
320
+
kubectl delete job -n knot knot-restore knot-acl-fix 2>/dev/null
321
321
+
```
322
322
+
323
323
+
Remove stale SSH host keys (knot regenerates host keys on every pod restart):
324
324
+
325
325
+
```sh
326
326
+
ssh-keygen -R knot.sans-self.org
327
327
+
```
328
328
+
329
329
+
## Known Gotchas
330
330
+
331
331
+
- **Hetzner volumes lose data on node reschedule.** The volumes stay `Bound` and mounted but are empty. The pod restarts fine on the empty volume, masking the data loss.
332
332
+
- **Choose the right DB snapshot.** Check all available timestamps in S3. The most recent snapshot before data loss is usually best, but if accounts were created between backups, a later snapshot might have more complete account records.
333
333
+
- **Sequencer cursor mismatch kills federation.** Posts succeed locally but don't reach Bluesky. Always bump the sequencer autoincrement past the relay's cursor after restore.
334
334
+
- **Knot ACLs are per-repo.** The server owner can push to repos that have ACL entries. Repos created after the backup will have git data on disk but no ACL — you must add entries manually.
335
335
+
- **SSH host keys change on pod restart.** Every knot scale-down/up regenerates sshd host keys. Run `ssh-keygen -R` to clear stale entries.