15 Commits

Author SHA1 Message Date
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
67db06cf8b PERF(observability): reduce replicas to 1
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
ad9573e998 FIX(alertmanager): remove duplicate volume config
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
7ed4d69c51 PERF(alertmanager): add HA with 2 replicas
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
997893284b FEAT(alertmanager): add ServiceMonitor
- Create servicemonitor.yaml for Prometheus to scrape Alertmanager
- alertmanager chart does not include ServiceMonitor, must be added separately
- Enables Grafana Alertmanager dashboard to display data
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
2c8095a1db FIX(alertmanager): alertmanager smtp auth by
- loading config from secret
- Add ExternalSecret to generate alertmanager.yml with SMTP password
  from Vault
- Disable helm chart config (ConfigMap) and use extraSecretMounts
  instead
- Fixes "535 5.7.8 Error: authentication failed" SMTP error
2026-01-05 00:40:01 +09:00
0ce1f99fb4 CHORE(goldilocks): disable goldilocks
- and cancel trivy installation
- Comment out goldilocks/argocd.yaml from kustomization
- Comment out trivy/argocd.yaml from kustomization
- Disable autoSync in both applications
- Server overload mitigation
2026-01-05 00:40:01 +09:00
ac2abde8b5 FIX(prometheus): servicemonitor namespace
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
5c4676ca9a REFACTOR(repo): restructure monitoring folder
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
  ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00