14 Commits

Author SHA1 Message Date
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
203a8debac REFACTOR(repo): remove control-plane scheduling
- Remove nodeSelector for control-plane node
- Remove tolerations for control-plane taint
- Allow pods to schedule on any available node
2026-01-10 18:35:15 +09:00
67db06cf8b PERF(observability): reduce replicas to 1
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
735166fc9c REFACTOR(repo): standardize taint to control-plane
- Change node-role.kubernetes.io/master to control-plane
- Update vpa, goldilocks, kube-state-metrics tolerations
- Remove deprecated master taint from promtail
2026-01-09 21:41:52 +09:00
4511fd5b2e FIX(repo): correct nodeSelector label value
- Change master label value from "" to "true"
- Fix pod scheduling failure due to label mismatch
2026-01-09 21:41:52 +09:00
1c6a9dc491 PERF(repo): move system pods to master node
- Add nodeSelector for master node placement
- Add tolerations for NoExecute taint
- kube-state-metrics: schedule on master
- goldilocks-controller: schedule on master, reduce to 1 replica
- vpa-recommender: schedule on master, remove anti-affinity
- Free worker node resources for applications
2026-01-09 21:41:52 +09:00
2b7ee1fe51 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
1b7a4294dc FIX(goldilocks): merge duplicate dashboard sections
- Merge dashboard.affinity into dashboard section
- Fix YAML structure to prevent OutOfSync status
2026-01-09 21:41:52 +09:00
4515ea0b33 FEAT(observability): enable HA with replica 2 and soft anti-affinity
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics

FIX(loki): revert to replica 1 for Single Binary mode

- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
7b9abaf9c8 REFACTOR(obs): integrate ingress to helm-values
- alertmanager: move ingress to karma inline, servicemonitor to manifests
- goldilocks: move ingress to helm-values
- grafana: move ingress to helm-values
- uptime-kuma: move ingress to helm-values
2026-01-06 01:57:03 +09:00
5c4676ca9a REFACTOR(repo): restructure monitoring folder
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
  ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00