7e61af372b
PERF(observability): remove CPU limits for stability
...
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902
PERF(observability): optimize resources via VPA
...
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
67db06cf8b
PERF(observability): reduce replicas to 1
...
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
ad9573e998
FIX(alertmanager): remove duplicate volume config
...
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a
CHORE(repo): switch to emptyDir with sizeLimit
...
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
5089e8607d
CHORE(resources): set memory limits equal to memory requests
...
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
7ed4d69c51
PERF(alertmanager): add HA with 2 replicas
...
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4286296591
PERF(resources): remove CPU limits - keep memory limits only
...
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
864c2c45d8
REFACTOR(alertmanager): change storageClass
...
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
997893284b
FEAT(alertmanager): add ServiceMonitor
...
- Create servicemonitor.yaml for Prometheus to scrape Alertmanager
- alertmanager chart does not include ServiceMonitor, must be added separately
- Enables Grafana Alertmanager dashboard to display data
2026-01-05 00:40:01 +09:00
e4b477a510
REFACTOR(longhorn): migrate to local-path
...
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
2c8095a1db
FIX(alertmanager): alertmanager smtp auth by
...
- loading config from secret
- Add ExternalSecret to generate alertmanager.yml with SMTP password
from Vault
- Disable helm chart config (ConfigMap) and use extraSecretMounts
instead
- Fixes "535 5.7.8 Error: authentication failed" SMTP error
2026-01-05 00:40:01 +09:00
0ce1f99fb4
CHORE(goldilocks): disable goldilocks
...
- and cancel trivy installation
- Comment out goldilocks/argocd.yaml from kustomization
- Comment out trivy/argocd.yaml from kustomization
- Disable autoSync in both applications
- Server overload mitigation
2026-01-05 00:40:01 +09:00
ac2abde8b5
FIX(prometheus): servicemonitor namespace
...
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
5c4676ca9a
REFACTOR(repo): restructure monitoring folder
...
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00