a3003d597f
PERF(observability): adjust resources based on VPA
...
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
5089e8607d
CHORE(resources): set memory limits equal to memory requests
...
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
735166fc9c
REFACTOR(repo): standardize taint to control-plane
...
- Change node-role.kubernetes.io/master to control-plane
- Update vpa, goldilocks, kube-state-metrics tolerations
- Remove deprecated master taint from promtail
2026-01-09 21:41:52 +09:00
4511fd5b2e
FIX(repo): correct nodeSelector label value
...
- Change master label value from "" to "true"
- Fix pod scheduling failure due to label mismatch
2026-01-09 21:41:52 +09:00
1c6a9dc491
PERF(repo): move system pods to master node
...
- Add nodeSelector for master node placement
- Add tolerations for NoExecute taint
- kube-state-metrics: schedule on master
- goldilocks-controller: schedule on master, reduce to 1 replica
- vpa-recommender: schedule on master, remove anti-affinity
- Free worker node resources for applications
2026-01-09 21:41:52 +09:00
4515ea0b33
FEAT(observability): enable HA with replica 2 and soft anti-affinity
...
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics
FIX(loki): revert to replica 1 for Single Binary mode
- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
4286296591
PERF(resources): remove CPU limits - keep memory limits only
...
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
ac2abde8b5
FIX(prometheus): servicemonitor namespace
...
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
5c4676ca9a
REFACTOR(repo): restructure monitoring folder
...
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00