this repo has no description

Add Telegram alerting for cluster health

CronJob (every 5min) monitors node conditions, pod crash loops, and
memory pressure via the k8s API. Alerts sent to Telegram when nodes
go NotReady, pods enter CrashLoopBackOff, or memory exceeds 85% of
limits. Secrets managed via git-crypt.

sans-self.org 98e14346 048d7c9d

Waiting for spindle ...
+196
+2
CHANGELOG.md
··· 12 12 - Add tarpit for vulnerability scanners hitting known exploit paths (#18) 13 13 14 14 ### Added 15 + - Add Telegram alerting for NotReady cluster nodes (#69) 15 16 - Add Traefik ingress for Spindle CI runner at spindle.sans-self.org (#59) 16 17 - Add self-hosted Spindle CI runner with Podman rootless (#57) 17 18 - Add Zot container registry with S3 storage and CVE scanning (#56) ··· 25 26 - Add Tangled knot with Spindle CI/CD to k3s cluster (#1) 26 27 27 28 ### Fixed 29 + - Fix knot signing key rotating on pod restart (#68) 28 30 - Fix shellcheck SC2086 warnings in backup.sh (#62) 29 31 - Fix Spindle CI runner provisioning for all nodes (#61) 30 32 - Fix knot post-receive hooks not being executable (#54)
+43
k8s/alerting/cronjob.yaml
··· 1 + apiVersion: batch/v1 2 + kind: CronJob 3 + metadata: 4 + name: node-alert 5 + namespace: kube-system 6 + spec: 7 + schedule: "*/5 * * * *" 8 + concurrencyPolicy: Forbid 9 + successfulJobsHistoryLimit: 1 10 + failedJobsHistoryLimit: 3 11 + jobTemplate: 12 + spec: 13 + activeDeadlineSeconds: 120 14 + template: 15 + spec: 16 + serviceAccountName: node-alert 17 + restartPolicy: Never 18 + containers: 19 + - name: alert 20 + image: alpine:3.21 21 + command: ["/bin/sh", "-c", "apk add --no-cache -q curl jq && /scripts/node-alert.sh"] 22 + volumeMounts: 23 + - name: scripts 24 + mountPath: /scripts 25 + readOnly: true 26 + - name: secrets 27 + mountPath: /secrets 28 + readOnly: true 29 + resources: 30 + requests: 31 + cpu: 10m 32 + memory: 32Mi 33 + limits: 34 + cpu: 100m 35 + memory: 64Mi 36 + volumes: 37 + - name: scripts 38 + configMap: 39 + name: node-alert-script 40 + defaultMode: 0755 41 + - name: secrets 42 + secret: 43 + secretName: telegram-alert
+23
k8s/alerting/kustomization.yaml
··· 1 + apiVersion: kustomize.config.k8s.io/v1beta1 2 + kind: Kustomization 3 + 4 + resources: 5 + - cronjob.yaml 6 + - rbac.yaml 7 + 8 + generatorOptions: 9 + disableNameSuffixHash: true 10 + 11 + configMapGenerator: 12 + - name: node-alert-script 13 + namespace: kube-system 14 + files: 15 + - node-alert.sh 16 + 17 + secretGenerator: 18 + - name: telegram-alert 19 + namespace: kube-system 20 + type: Opaque 21 + files: 22 + - bot-token=telegram-bot-token.secret 23 + - chat-id=telegram-chat-id.secret
+97
k8s/alerting/node-alert.sh
··· 1 + #!/bin/sh 2 + set -eu 3 + 4 + BOT_TOKEN=$(cat /secrets/bot-token) 5 + CHAT_ID=$(cat /secrets/chat-id) 6 + MEMORY_PERCENT_THRESHOLD=85 7 + 8 + TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) 9 + CA=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt 10 + API="https://kubernetes.default.svc" 11 + 12 + kube_get() { 13 + curl -sf --cacert "$CA" -H "Authorization: Bearer $TOKEN" "$API$1" 14 + } 15 + 16 + send_alert() { 17 + curl -sf -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \ 18 + -d chat_id="$CHAT_ID" \ 19 + -d parse_mode=Markdown \ 20 + -d text="$1" > /dev/null 21 + } 22 + 23 + # --- Node health --- 24 + # Alerts on: NotReady, MemoryPressure, DiskPressure 25 + 26 + NODES=$(kube_get "/api/v1/nodes") 27 + 28 + printf '%s' "$NODES" | jq -r ' 29 + .items[] | 30 + .metadata.name as $name | 31 + [ (.status.conditions[] | 32 + select(.type=="Ready" and .status!="True") | "NotReady"), 33 + (.status.conditions[] | 34 + select(.type=="MemoryPressure" and .status=="True") | "MemoryPressure"), 35 + (.status.conditions[] | 36 + select(.type=="DiskPressure" and .status=="True") | "DiskPressure") 37 + ] | select(length > 0) | 38 + "\($name)\t\(join(" "))" 39 + ' | while IFS="$(printf '\t')" read -r NAME PROBLEMS; do 40 + send_alert "🔴 *Node Unhealthy: $NAME* 41 + Status: $PROBLEMS 42 + Cluster: sans-self.org" 43 + done 44 + 45 + # --- Pod crash loops --- 46 + # Only alerts on pods currently in CrashLoopBackOff or Error waiting state 47 + 48 + PODS=$(kube_get "/api/v1/pods") 49 + 50 + printf '%s' "$PODS" | jq -r ' 51 + .items[] | 52 + select(.status.containerStatuses != null) | 53 + .metadata.namespace as $ns | .metadata.name as $pod | 54 + .status.containerStatuses[] | 55 + select(.state.waiting.reason == "CrashLoopBackOff" or .state.waiting.reason == "Error") | 56 + "\($ns)\t\($pod)\t\(.restartCount)\t\(.state.waiting.reason)" 57 + ' | while IFS="$(printf '\t')" read -r NS POD RESTARTS REASON; do 58 + send_alert "🔄 *Pod Crash Loop: $POD* 59 + Namespace: $NS 60 + Restarts: ${RESTARTS} (${REASON}) 61 + Cluster: sans-self.org" 62 + done 63 + 64 + # --- Resource pressure (memory) --- 65 + # Uses metrics.k8s.io API; skip if unavailable 66 + 67 + METRICS=$(kube_get "/apis/metrics.k8s.io/v1beta1/pods" 2>/dev/null || echo "") 68 + [ -z "$METRICS" ] && exit 0 69 + 70 + printf '%s' "$METRICS" | jq -r ' 71 + .items[] | 72 + .metadata.namespace as $ns | .metadata.name as $pod | 73 + [.containers[].usage.memory | rtrimstr("Ki") | tonumber] | add // 0 | 74 + . / 1024 | floor | 75 + "\($ns)\t\($pod)\t\(.)" 76 + ' | while IFS="$(printf '\t')" read -r NS POD MEM_MI; do 77 + MEM_LIMIT=$(printf '%s' "$PODS" | jq -r \ 78 + --arg ns "$NS" --arg pod "$POD" \ 79 + '[.items[] | select(.metadata.namespace==$ns and .metadata.name==$pod) | .spec.containers[].resources.limits.memory // empty] | first // empty') 80 + 81 + case "$MEM_LIMIT" in 82 + *Gi) LIMIT_MI=$(echo "$MEM_LIMIT" | sed 's/Gi//' | awk '{printf "%.0f", $1 * 1024}') ;; 83 + *Mi) LIMIT_MI=$(echo "$MEM_LIMIT" | sed 's/Mi//') ;; 84 + *) continue ;; 85 + esac 86 + 87 + [ -z "$LIMIT_MI" ] && continue 88 + [ "$LIMIT_MI" -eq 0 ] 2>/dev/null && continue 89 + 90 + PERCENT=$((MEM_MI * 100 / LIMIT_MI)) 91 + if [ "$PERCENT" -ge "$MEMORY_PERCENT_THRESHOLD" ]; then 92 + send_alert "⚠️ *Memory Pressure: $POD* 93 + Namespace: $NS 94 + Usage: ${MEM_MI}Mi / ${LIMIT_MI}Mi (${PERCENT}%) 95 + Cluster: sans-self.org" 96 + fi 97 + done
+30
k8s/alerting/rbac.yaml
··· 1 + apiVersion: v1 2 + kind: ServiceAccount 3 + metadata: 4 + name: node-alert 5 + namespace: kube-system 6 + --- 7 + apiVersion: rbac.authorization.k8s.io/v1 8 + kind: ClusterRole 9 + metadata: 10 + name: node-alert 11 + rules: 12 + - apiGroups: [""] 13 + resources: ["nodes", "pods"] 14 + verbs: ["get", "list"] 15 + - apiGroups: ["metrics.k8s.io"] 16 + resources: ["pods"] 17 + verbs: ["get", "list"] 18 + --- 19 + apiVersion: rbac.authorization.k8s.io/v1 20 + kind: ClusterRoleBinding 21 + metadata: 22 + name: node-alert 23 + subjects: 24 + - kind: ServiceAccount 25 + name: node-alert 26 + namespace: kube-system 27 + roleRef: 28 + kind: ClusterRole 29 + name: node-alert 30 + apiGroup: rbac.authorization.k8s.io
k8s/alerting/telegram-bot-token.secret

This is a binary file and will not be displayed.

k8s/alerting/telegram-chat-id.secret

This is a binary file and will not be displayed.

+1
k8s/kustomization.yaml
··· 7 7 - pds 8 8 - knot 9 9 - registry 10 + - alerting 10 11 11 12 generatorOptions: 12 13 disableNameSuffixHash: true