8 Commits

Author SHA1 Message Date
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
15d5e58d6c migrate: change repoURLs from GitHub to Gitea
Update all ArgoCD Application references to use Gitea (github0213.com)
instead of GitHub for K3S-HOME/observability repository.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:43:29 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
ffed27419a REFACTOR(blackbox-exporter): revert to http_2xx module
- Remove http_auth module workaround
- Authelia now bypasses internal cluster traffic
- All endpoints use standard http_2xx module
2026-01-09 21:42:35 +09:00
37c216c433 FIX(blackbox-exporter): handle Authelia-protected endpoints
- Add http_auth module accepting 401/403 status codes
- Apply http_auth to grafana, code-server, pgweb, velero-ui
- These services return 401 when accessed without authentication
2026-01-09 21:42:35 +09:00
884a38d8ad FEAT(blackbox-exporter): add external endpoint monitoring
- Add blackbox-exporter with prometheus-community Helm chart
- Configure HTTP probes for 25 external endpoints
- Include SSL certificate expiry alerting rules
- Add probe failure and slow response alerts
- Deploy 2 replicas with anti-affinity for HA
2026-01-09 21:42:35 +09:00