Commit Graph

21 Commits

Author SHA1 Message Date
eea6420544 PERF(loki): switch from MinIO to local filesystem storage
- Change storage type from S3 to filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 7 days to 3 days for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Eliminates MinIO CPU load from Loki operations
2026-01-10 15:57:50 +09:00
001aa9253d PERF(loki): disable canary to reduce MinIO load
- Disable lokiCanary which queries Loki every second
- Reduces continuous S3 read operations on MinIO
2026-01-10 15:43:19 +09:00
ef7c7c2593 PERF(loki,tempo): reduce replicas to 1
- Reduce Loki singleBinary replicas from 2 to 1
- Reduce Tempo replicas from 2 to 1
- Decrease MinIO CPU load (0.5 → 0.1 cores expected)
2026-01-10 15:32:07 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
2e6b4cecbf FEAT(loki): switch storage backend to MinIO S3
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
9f3b768cd9 FIX(loki): fix lokiCanary config path
- Move lokiCanary to top-level config
- Fix toleration not being applied to DaemonSet
2026-01-09 21:41:52 +09:00
a1c347e4ff FEAT(loki): enable loki-canary with control-plane toleration
- Enable lokiCanary for log ingestion monitoring
- Add toleration for control-plane node
2026-01-09 21:41:52 +09:00
2cf35d0f76 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
2b7ee1fe51 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
4515ea0b33 FEAT(observability): enable HA with replica 2 and soft anti-affinity
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics

FIX(loki): revert to replica 1 for Single Binary mode

- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
b7ac39d68c FEAT(prometheus): enable Loki ServiceMonitor
- Enable Loki ServiceMonitor for Prometheus metrics collection
- Add monitoring configuration
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
5c4676ca9a REFACTOR(repo): restructure monitoring folder
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
  ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00