this repo has no description

Add self-hosted Spindle CI runner with Podman rootless

Spindle runs as a systemd user service on cluster nodes, using Podman
rootless for container execution. Binary packaged as OCI image in Zot
and deployed via k8s Jobs.

- Bake podman + podman-docker into MicroOS snapshot
- postinstall_exec provisions spindle user on new/replaced nodes
- Healthcheck CronJob with ConfigMap leader election and auto-failover
- Traefik ingress at spindle.sans-self.org via selectorless Service
- Makefile targets: build, push, update, start, logs

+434 -1
+3
.gitignore
··· 33 33 *.yaml.backup 34 34 k3s_kustomization_backup.yaml 35 35 36 + # Spindle binary (built externally, not checked in) 37 + spindle/spindle 38 + 36 39 # Claude 37 40 settings.local.json
+38
Makefile
··· 14 14 15 15 clean-secrets: 16 16 rm -f $(JUICEFS_METAURL) 17 + 18 + # Spindle CI runner 19 + # Full flow: build-spindle → push-spindle → update-spindle → start-spindle (first time only) 20 + # Updates: build-spindle → push-spindle → update-spindle (restarts where already active) 21 + SPINDLE_CORE ?= /tmp/tangled-core 22 + 23 + .PHONY: build-spindle push-spindle update-spindle start-spindle logs-spindle 24 + 25 + build-spindle: 26 + @test -d "$(SPINDLE_CORE)" || { echo "error: tangled core not found at $(SPINDLE_CORE)"; echo "Clone: git clone git@tangled.org:tangled.org/core.git $(SPINDLE_CORE)"; exit 1; } 27 + docker run --rm -v "$(SPINDLE_CORE)":/src -v "$(CURDIR)/spindle":/out -w /src \ 28 + -e GOARCH=arm64 -e GOOS=linux -e CGO_ENABLED=1 -e GOTOOLCHAIN=auto \ 29 + -e CC=aarch64-linux-gnu-gcc \ 30 + golang:1.25rc1 sh -c 'apt-get update -qq && apt-get install -y -qq gcc-aarch64-linux-gnu >/dev/null 2>&1 && go build -o /out/spindle ./cmd/spindle' 31 + 32 + push-spindle: spindle/spindle 33 + docker build --platform linux/arm64 -f spindle/Containerfile -t zot.sans-self.org/infra/spindle:latest spindle/ 34 + docker push zot.sans-self.org/infra/spindle:latest 35 + 36 + spindle/spindle: 37 + @echo "Binary missing. Run: make build-spindle SPINDLE_CORE=/path/to/tangled/core" 38 + @exit 1 39 + 40 + update-spindle: 41 + kubectl delete job spindle-update --ignore-not-found 42 + kubectl apply -f spindle/update-job.yaml 43 + 44 + start-spindle: 45 + kubectl delete job spindle-start --ignore-not-found 46 + kubectl apply -f spindle/start-job.yaml 47 + 48 + logs-spindle: 49 + @NODE_NAME=$$(kubectl get configmap spindle-leader -o jsonpath='{.data.node}' 2>/dev/null) && \ 50 + NODE_IP=$$(kubectl get node $$NODE_NAME -o jsonpath='{.status.addresses[?(@.type=="ExternalIP")].address}' 2>/dev/null) && \ 51 + test -n "$$NODE_IP" || { echo "error: could not find spindle leader node"; exit 1; } && \ 52 + echo "==> Streaming logs from $$NODE_NAME ($$NODE_IP)" && \ 53 + ssh -p 22222 -o StrictHostKeyChecking=no -i keypair/id_ed25519_homelab root@$$NODE_IP \ 54 + 'SPINDLE_UID=$$(id -u spindle) && runuser -u spindle -- env XDG_RUNTIME_DIR=/run/user/$$SPINDLE_UID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$$SPINDLE_UID/bus journalctl --user -u spindle.service -f --no-pager'
+52
README.md
··· 52 52 3. After bootstrap, fetch a fresh kubeconfig from the node — the one in tofu state will have the wrong CA. 53 53 4. JuiceFS CSI on SELinux (MicroOS) requires `sidecarPrivileged: true` in `juicefs-csi-values.yaml` under `node:`. Without it, the CSI socket has a label mismatch and sidecars can't connect. 54 54 55 + ## Spindle CI Runner 56 + 57 + Self-hosted [Spindle](https://tangled.org) runner at `spindle.sans-self.org` for pipeline execution on our knot. Runs as a systemd user service with Podman rootless — no privileged containers, no DinD. 58 + 59 + ### Architecture 60 + 61 + Spindle runs outside k8s (needs direct access to the Podman socket) but is exposed via Traefik through a selectorless k8s Service + Endpoints. A CronJob healthcheck (`spindle-healthcheck`) runs every 5 minutes: it heartbeats to a `spindle-leader` ConfigMap and keeps the Endpoints pointed at the active node. If the heartbeat goes stale (>10 min), the healthcheck auto-starts Spindle on whichever node it lands on. 62 + 63 + Node provisioning is automatic: `postinstall_exec` in kube.tf creates the spindle user, configures rootless Podman, and pulls the binary from Zot (`zot.sans-self.org/infra/spindle:latest`, anonymous pull). This runs on every new or replaced node. 64 + 65 + ### Build 66 + 67 + ```sh 68 + git clone git@tangled.org:tangled.org/core.git ~/Projects/tangled 69 + make build-spindle SPINDLE_CORE=~/Projects/tangled 70 + ``` 71 + 72 + Requires Docker (cross-compiles with `aarch64-linux-gnu-gcc` for CGo/sqlite3). 73 + 74 + ### Initial setup 75 + 76 + ```sh 77 + make build-spindle # compile ARM64 binary via Docker 78 + make push-spindle # build OCI image, push to Zot (needs docker login) 79 + make update-spindle # deploy binary to all nodes via k8s Job 80 + make start-spindle # start service on one node (run once) 81 + kubectl apply -f spindle/ingress.yaml 82 + kubectl apply -f spindle/healthcheck-cronjob.yaml 83 + ``` 84 + 85 + Then add the runner in the tangled.org UI with hostname `spindle.sans-self.org`. 86 + 87 + ### Binary updates 88 + 89 + ```sh 90 + make build-spindle # recompile from latest source 91 + make push-spindle # push new image to Zot 92 + make update-spindle # rolls out to all nodes, restarts where active 93 + ``` 94 + 95 + ### Node replacement 96 + 97 + Automatic. The healthcheck CronJob detects the stale heartbeat and starts Spindle on another node within 5 minutes. The Endpoints object is updated to route traffic to the new node. 98 + 99 + ### Operations 100 + 101 + ```sh 102 + make logs-spindle # stream journal from active node 103 + make start-spindle # manual start (if healthcheck hasn't kicked in yet) 104 + make update-spindle # redeploy binary + restart 105 + ``` 106 + 55 107 ## Backups 56 108 57 109 Daily S3 backups via CronJobs (02:00 PDS, 02:30 knot). See [RESTORE.md](RESTORE.md) for recovery procedures.
+1 -1
hcloud-microos-snapshots.pkr.hcl
··· 36 36 } 37 37 38 38 locals { 39 - needed_packages = join(" ", concat(["restorecond policycoreutils policycoreutils-python-utils setools-console audit bind-utils wireguard-tools fuse open-iscsi nfs-client xfsprogs cryptsetup lvm2 git cifs-utils bash-completion mtr tcpdump udica qemu-guest-agent"], var.packages_to_install)) 39 + needed_packages = join(" ", concat(["restorecond policycoreutils policycoreutils-python-utils setools-console audit bind-utils wireguard-tools fuse open-iscsi nfs-client xfsprogs cryptsetup lvm2 git cifs-utils bash-completion mtr tcpdump udica qemu-guest-agent podman podman-docker"], var.packages_to_install)) 40 40 41 41 # Add local variables for inline shell commands 42 42 download_image = "wget --timeout=5 --waitretry=5 --tries=5 --retry-connrefused --inet4-only "
+13
kube.tf
··· 95 95 } 96 96 ] 97 97 98 + # Spindle CI runner — provision user + rootless Podman on every node. 99 + # Binary pulled from Zot registry (fails gracefully on first bootstrap when Zot isn't up yet). 100 + postinstall_exec = [ 101 + "useradd --create-home --shell /bin/bash spindle 2>/dev/null || true", 102 + "grep -q '^spindle:' /etc/subuid || usermod --add-subuids 100000-165535 spindle", 103 + "grep -q '^spindle:' /etc/subgid || usermod --add-subgids 100000-165535 spindle", 104 + "loginctl enable-linger spindle", 105 + "mkdir -p /var/lib/spindle/logs && chown -R spindle:spindle /var/lib/spindle", 106 + "mkdir -p /home/spindle/.config/systemd/user && chown -R spindle:spindle /home/spindle/.config", 107 + "podman pull zot.sans-self.org/infra/spindle:latest && CID=$$(podman create zot.sans-self.org/infra/spindle:latest) && podman cp $$CID:/spindle /usr/local/bin/spindle && podman cp $$CID:/spindle.service /home/spindle/.config/systemd/user/spindle.service && podman rm $$CID && chmod 755 /usr/local/bin/spindle && chown -R spindle:spindle /home/spindle/.config || echo 'WARN: spindle image not available, run make setup-spindle after cluster is ready'", 108 + "sleep 2 && SPINDLE_UID=$$(id -u spindle) && runuser -u spindle -- env XDG_RUNTIME_DIR=/run/user/$${SPINDLE_UID} DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$${SPINDLE_UID}/bus systemctl --user daemon-reload && runuser -u spindle -- env XDG_RUNTIME_DIR=/run/user/$${SPINDLE_UID} DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$${SPINDLE_UID}/bus systemctl --user enable --now podman.socket || true", 109 + ] 110 + 98 111 create_kubeconfig = false 99 112 } 100 113
+3
spindle/Containerfile
··· 1 + FROM busybox:1.37 2 + COPY spindle /spindle 3 + COPY spindle.service /spindle.service
+159
spindle/healthcheck-cronjob.yaml
··· 1 + apiVersion: v1 2 + kind: ServiceAccount 3 + metadata: 4 + name: spindle-healthcheck 5 + --- 6 + apiVersion: rbac.authorization.k8s.io/v1 7 + kind: Role 8 + metadata: 9 + name: spindle-healthcheck 10 + rules: 11 + - apiGroups: [""] 12 + resources: ["configmaps"] 13 + verbs: ["get", "create", "update", "patch"] 14 + - apiGroups: [""] 15 + resources: ["endpoints"] 16 + verbs: ["get", "update", "patch"] 17 + --- 18 + apiVersion: rbac.authorization.k8s.io/v1 19 + kind: ClusterRole 20 + metadata: 21 + name: spindle-healthcheck-nodes 22 + rules: 23 + - apiGroups: [""] 24 + resources: ["nodes"] 25 + verbs: ["get", "list"] 26 + --- 27 + apiVersion: rbac.authorization.k8s.io/v1 28 + kind: RoleBinding 29 + metadata: 30 + name: spindle-healthcheck 31 + subjects: 32 + - kind: ServiceAccount 33 + name: spindle-healthcheck 34 + roleRef: 35 + apiGroup: rbac.authorization.k8s.io 36 + kind: Role 37 + name: spindle-healthcheck 38 + --- 39 + apiVersion: rbac.authorization.k8s.io/v1 40 + kind: ClusterRoleBinding 41 + metadata: 42 + name: spindle-healthcheck-nodes 43 + subjects: 44 + - kind: ServiceAccount 45 + name: spindle-healthcheck 46 + namespace: default 47 + roleRef: 48 + apiGroup: rbac.authorization.k8s.io 49 + kind: ClusterRole 50 + name: spindle-healthcheck-nodes 51 + --- 52 + apiVersion: batch/v1 53 + kind: CronJob 54 + metadata: 55 + name: spindle-healthcheck 56 + spec: 57 + schedule: "*/5 * * * *" 58 + concurrencyPolicy: Forbid 59 + successfulJobsHistoryLimit: 1 60 + failedJobsHistoryLimit: 3 61 + jobTemplate: 62 + spec: 63 + template: 64 + metadata: 65 + labels: 66 + app: spindle-healthcheck 67 + spec: 68 + serviceAccountName: spindle-healthcheck 69 + hostPID: true 70 + tolerations: 71 + - operator: Exists 72 + containers: 73 + - name: check 74 + image: alpine:3.21 75 + securityContext: 76 + privileged: true 77 + env: 78 + - name: NODE_NAME 79 + valueFrom: 80 + fieldRef: 81 + fieldPath: spec.nodeName 82 + command: ["/bin/sh", "-c"] 83 + args: 84 + - | 85 + apk add --no-cache --quiet util-linux curl jq >/dev/null 2>&1 86 + TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) 87 + CA=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt 88 + NS=$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace) 89 + K8S="https://kubernetes.default.svc" 90 + AUTH="-H \"Authorization: Bearer $TOKEN\"" 91 + 92 + kapi() { curl -sk --cacert $CA -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" "$@"; } 93 + 94 + # Check if spindle is running on THIS node 95 + ACTIVE=$(nsenter -t 1 -m -u -i -n -- bash -c ' 96 + uid=$(id -u spindle 2>/dev/null) || exit 1 97 + XDG_RUNTIME_DIR=/run/user/$uid DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$uid/bus \ 98 + runuser -u spindle -- systemctl --user is-active --quiet spindle.service 2>/dev/null && echo "yes" || echo "no" 99 + ') 100 + 101 + update_endpoint() { 102 + local node="$1" 103 + # Get node's InternalIP 104 + NODE_IP=$(kapi "$K8S/api/v1/nodes/$node" 2>/dev/null | jq -r '.status.addresses[] | select(.type=="InternalIP") | .address') 105 + if [ -z "$NODE_IP" ] || [ "$NODE_IP" = "null" ]; then 106 + echo "WARN: could not resolve InternalIP for $node" 107 + return 1 108 + fi 109 + EPBODY="{\"apiVersion\":\"v1\",\"kind\":\"Endpoints\",\"metadata\":{\"name\":\"spindle\"},\"subsets\":[{\"addresses\":[{\"ip\":\"$NODE_IP\"}],\"ports\":[{\"port\":6555,\"protocol\":\"TCP\"}]}]}" 110 + kapi -X PUT "$K8S/api/v1/namespaces/$NS/endpoints/spindle" -d "$EPBODY" >/dev/null 2>&1 111 + echo "Endpoints updated: $node ($NODE_IP)" 112 + } 113 + 114 + update_heartbeat() { 115 + local node="$1" 116 + TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) 117 + BODY="{\"apiVersion\":\"v1\",\"kind\":\"ConfigMap\",\"metadata\":{\"name\":\"spindle-leader\"},\"data\":{\"node\":\"$node\",\"heartbeat\":\"$TIMESTAMP\"}}" 118 + kapi -X PUT "$K8S/api/v1/namespaces/$NS/configmaps/spindle-leader" -d "$BODY" >/dev/null 2>&1 || \ 119 + kapi -X POST "$K8S/api/v1/namespaces/$NS/configmaps" -d "$BODY" >/dev/null 2>&1 120 + } 121 + 122 + if [ "$ACTIVE" = "yes" ]; then 123 + update_heartbeat "$NODE_NAME" 124 + update_endpoint "$NODE_NAME" 125 + echo "Spindle active on $NODE_NAME, heartbeat + endpoint updated" 126 + exit 0 127 + fi 128 + 129 + # Not running here. Check if someone else has a recent heartbeat. 130 + LEADER_DATA=$(kapi "$K8S/api/v1/namespaces/$NS/configmaps/spindle-leader" 2>/dev/null) 131 + HEARTBEAT=$(echo "$LEADER_DATA" | jq -r '.data.heartbeat // empty') 132 + LEADER_NODE=$(echo "$LEADER_DATA" | jq -r '.data.node // empty') 133 + 134 + if [ -n "$HEARTBEAT" ]; then 135 + HEARTBEAT_EPOCH=$(date -d "$HEARTBEAT" +%s 2>/dev/null || echo 0) 136 + NOW_EPOCH=$(date +%s) 137 + AGE=$(( NOW_EPOCH - HEARTBEAT_EPOCH )) 138 + if [ "$AGE" -lt 600 ]; then 139 + echo "Spindle healthy on $LEADER_NODE (heartbeat ${AGE}s ago), nothing to do" 140 + exit 0 141 + fi 142 + echo "Stale heartbeat from $LEADER_NODE (${AGE}s ago), taking over" 143 + fi 144 + 145 + # No healthy spindle. Start it on this node. 146 + echo "Starting spindle on $NODE_NAME" 147 + nsenter -t 1 -m -u -i -n -- bash -c ' 148 + uid=$(id -u spindle 2>/dev/null) || exit 1 149 + XDG_RUNTIME_DIR=/run/user/$uid DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$uid/bus \ 150 + runuser -u spindle -- systemctl --user enable --now podman.socket 151 + XDG_RUNTIME_DIR=/run/user/$uid DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$uid/bus \ 152 + runuser -u spindle -- systemctl --user enable --now spindle.service 153 + ' 154 + 155 + update_heartbeat "$NODE_NAME" 156 + update_endpoint "$NODE_NAME" 157 + echo "Spindle started on $NODE_NAME" 158 + restartPolicy: Never 159 + backoffLimit: 0
+43
spindle/ingress.yaml
··· 1 + apiVersion: cert-manager.io/v1 2 + kind: Certificate 3 + metadata: 4 + name: spindle-sans-self-org 5 + spec: 6 + secretName: spindle-sans-self-org-tls 7 + issuerRef: 8 + name: letsencrypt-prod 9 + kind: ClusterIssuer 10 + dnsNames: 11 + - spindle.sans-self.org 12 + --- 13 + apiVersion: v1 14 + kind: Service 15 + metadata: 16 + name: spindle 17 + spec: 18 + ports: 19 + - port: 6555 20 + targetPort: 6555 21 + protocol: TCP 22 + --- 23 + apiVersion: v1 24 + kind: Endpoints 25 + metadata: 26 + name: spindle 27 + subsets: [] 28 + --- 29 + apiVersion: traefik.io/v1alpha1 30 + kind: IngressRoute 31 + metadata: 32 + name: spindle 33 + spec: 34 + entryPoints: 35 + - websecure 36 + tls: 37 + secretName: spindle-sans-self-org-tls 38 + routes: 39 + - match: Host(`spindle.sans-self.org`) 40 + kind: Rule 41 + services: 42 + - name: spindle 43 + port: 6555
+19
spindle/spindle.service
··· 1 + [Unit] 2 + Description=Spindle CI Runner 3 + After=podman.socket 4 + 5 + [Service] 6 + Type=simple 7 + Environment=DOCKER_HOST=unix:///run/user/%U/podman/podman.sock 8 + Environment=SPINDLE_SERVER_LISTEN_ADDR=0.0.0.0:6555 9 + Environment=SPINDLE_SERVER_DB_PATH=/var/lib/spindle/spindle.db 10 + Environment=SPINDLE_SERVER_HOSTNAME=spindle.sans-self.org 11 + Environment=SPINDLE_SERVER_OWNER=did:plc:wydyrngmxbcsqdvhmd7whmye 12 + Environment=SPINDLE_PIPELINES_LOG_DIR=/var/lib/spindle/logs 13 + Environment=SPINDLE_PIPELINES_WORKFLOW_TIMEOUT=10m 14 + ExecStart=/usr/local/bin/spindle 15 + Restart=on-failure 16 + RestartSec=5 17 + 18 + [Install] 19 + WantedBy=default.target
+34
spindle/start-job.yaml
··· 1 + apiVersion: batch/v1 2 + kind: Job 3 + metadata: 4 + name: spindle-start 5 + spec: 6 + completions: 1 7 + parallelism: 1 8 + template: 9 + metadata: 10 + labels: 11 + app: spindle-start 12 + spec: 13 + hostPID: true 14 + tolerations: 15 + - operator: Exists 16 + containers: 17 + - name: start 18 + image: alpine:3.21 19 + securityContext: 20 + privileged: true 21 + command: ["/bin/sh", "-c"] 22 + args: 23 + - | 24 + apk add --no-cache --quiet util-linux >/dev/null 2>&1 25 + nsenter -t 1 -m -u -i -n -- bash -c ' 26 + uid=$(id -u spindle) 27 + export XDG_RUNTIME_DIR=/run/user/$uid 28 + export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$uid/bus 29 + runuser -u spindle -- systemctl --user enable --now podman.socket 30 + runuser -u spindle -- systemctl --user enable --now spindle.service 31 + echo "Spindle started" 32 + ' 33 + restartPolicy: Never 34 + backoffLimit: 1
+69
spindle/update-job.yaml
··· 1 + apiVersion: batch/v1 2 + kind: Job 3 + metadata: 4 + name: spindle-update 5 + spec: 6 + completions: 3 7 + parallelism: 3 8 + template: 9 + metadata: 10 + labels: 11 + app: spindle-update 12 + spec: 13 + hostPID: true 14 + affinity: 15 + podAntiAffinity: 16 + requiredDuringSchedulingIgnoredDuringExecution: 17 + - labelSelector: 18 + matchLabels: 19 + app: spindle-update 20 + topologyKey: kubernetes.io/hostname 21 + tolerations: 22 + - operator: Exists 23 + initContainers: 24 + - name: extract 25 + image: zot.sans-self.org/infra/spindle:latest 26 + securityContext: 27 + privileged: true 28 + command: ["/bin/sh", "-c"] 29 + args: 30 + - cp /spindle /host-bin/spindle && cp /spindle.service /host-service/spindle.service 31 + volumeMounts: 32 + - name: host-bin 33 + mountPath: /host-bin 34 + - name: host-service 35 + mountPath: /host-service 36 + containers: 37 + - name: deploy 38 + image: alpine:3.21 39 + securityContext: 40 + privileged: true 41 + command: ["/bin/sh", "-c"] 42 + args: 43 + - | 44 + apk add --no-cache --quiet util-linux >/dev/null 2>&1 45 + nsenter -t 1 -m -u -i -n -- bash -c ' 46 + chmod 755 /usr/local/bin/spindle 47 + chown -R spindle:spindle /home/spindle/.config 48 + uid=$(id -u spindle 2>/dev/null) || { echo "spindle user missing"; exit 0; } 49 + export XDG_RUNTIME_DIR=/run/user/$uid 50 + export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$uid/bus 51 + runuser -u spindle -- systemctl --user daemon-reload 52 + if runuser -u spindle -- systemctl --user is-active --quiet spindle.service 2>/dev/null; then 53 + runuser -u spindle -- systemctl --user restart spindle.service 54 + echo "Spindle restarted" 55 + else 56 + echo "Spindle not active on this node, skipping" 57 + fi 58 + ' 59 + volumes: 60 + - name: host-bin 61 + hostPath: 62 + path: /usr/local/bin 63 + type: Directory 64 + - name: host-service 65 + hostPath: 66 + path: /home/spindle/.config/systemd/user 67 + type: DirectoryOrCreate 68 + restartPolicy: Never 69 + backoffLimit: 3