Commit Graph

13 Commits

Author SHA1 Message Date
67db06cf8b PERF(observability): reduce replicas to 1
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
ad9573e998 FIX(alertmanager): remove duplicate volume config
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
7ed4d69c51 PERF(alertmanager): add HA with 2 replicas
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
997893284b FEAT(alertmanager): add ServiceMonitor
- Create servicemonitor.yaml for Prometheus to scrape Alertmanager
- alertmanager chart does not include ServiceMonitor, must be added separately
- Enables Grafana Alertmanager dashboard to display data
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
2c8095a1db FIX(alertmanager): alertmanager smtp auth by
- loading config from secret
- Add ExternalSecret to generate alertmanager.yml with SMTP password
  from Vault
- Disable helm chart config (ConfigMap) and use extraSecretMounts
  instead
- Fixes "535 5.7.8 Error: authentication failed" SMTP error
2026-01-05 00:40:01 +09:00
0ce1f99fb4 CHORE(goldilocks): disable goldilocks
- and cancel trivy installation
- Comment out goldilocks/argocd.yaml from kustomization
- Comment out trivy/argocd.yaml from kustomization
- Disable autoSync in both applications
- Server overload mitigation
2026-01-05 00:40:01 +09:00
ac2abde8b5 FIX(prometheus): servicemonitor namespace
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
5c4676ca9a REFACTOR(repo): restructure monitoring folder
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
  ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00