17 Commits

Author SHA1 Message Date
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
7cbc0c810e FIX(tempo): move resources to correct helm path
- Move resources from top-level to tempo.resources
- Fix memory limit not being applied to container
2026-01-12 00:21:12 +09:00
15d5e58d6c migrate: change repoURLs from GitHub to Gitea
Update all ArgoCD Application references to use Gitea (github0213.com)
instead of GitHub for K3S-HOME/observability repository.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:43:29 +09:00
7d5780cb97 PERF(tempo): switch from MinIO to local filesystem storage
- Change storage backend from S3 to local filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 3 days to 1 day for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Remove anti-affinity (single replica only)
2026-01-10 15:58:34 +09:00
ef7c7c2593 PERF(loki,tempo): reduce replicas to 1
- Reduce Loki singleBinary replicas from 2 to 1
- Reduce Tempo replicas from 2 to 1
- Decrease MinIO CPU load (0.5 → 0.1 cores expected)
2026-01-10 15:32:07 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
fa4d97eede REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret
- Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret)
- Remove manifests source from ArgoCD application
2026-01-09 21:42:35 +09:00
b378c6ec06 FIX(tempo): move extraEnv under tempo section for S3 credentials
- Move extraEnv from top-level to tempo section where chart expects it
- Move extraVolumeMounts under tempo section for proper WAL mounting
- Fixes Access Denied error when connecting to MinIO
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
fd6c1952ad FIX(tempo): enable env var expansion in config
- Add extraArgs config.expand-env=true
- Required for ${VAR} substitution in tempo.yaml
2026-01-09 21:41:52 +09:00
5f926cb6cf FEAT(tempo): configure S3 storage with MinIO
- Enable env var expansion in config
- Configure extraEnv for S3 credentials
- Fix OTel Collector image settings
2026-01-09 21:41:52 +09:00