7e61af372b
PERF(observability): remove CPU limits for stability
...
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902
PERF(observability): optimize resources via VPA
...
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
c1214029a2
refactor: update Vault secret paths to new categorized structure
...
- alertmanager: alertmanager → observability/alertmanager
- grafana: postgresql → storage/postgresql
- prometheus: postgresql → storage/postgresql, minio → storage/minio
- thanos: minio → storage/minio
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-11 22:36:22 +09:00
15d5e58d6c
migrate: change repoURLs from GitHub to Gitea
...
Update all ArgoCD Application references to use Gitea (github0213.com)
instead of GitHub for K3S-HOME/observability repository.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-10 20:43:29 +09:00
a3003d597f
PERF(observability): adjust resources based on VPA
...
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
9e218a8adc
PERF(observability): reduce replicas, add priority
...
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
94af545120
REFACTOR(thanos): remove S3 storage integration
...
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
2026-01-09 21:42:35 +09:00
14bd244b98
FIX(thanos): increase compactor memory to 256Mi
...
- Compactor was OOMKilled with 128Mi limit
- Set to 256Mi for stability during compaction
2026-01-09 21:42:35 +09:00
5089e8607d
CHORE(resources): set memory limits equal to memory requests
...
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
aecb15031d
FEAT(grafana): add Thanos as default datasource
...
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus
REFACTOR(thanos): move all components to master node
- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)
FIX(thanos): allow non-Bitnami images (quay.io/thanos)
FIX(thanos): correct nodeSelector value to 'true'
2026-01-09 21:41:52 +09:00
9b052b49cf
FEAT(thanos): add Thanos for Prometheus HA
...
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure OCI registry for Bitnami chart
- Fix Vault secret path and image settings
- Add nodeSelector for master node
2026-01-09 21:41:52 +09:00
6b576d6a16
FEAT(thanos): add Thanos for Prometheus HA and long-term storage
...
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure Prometheus replicas: 2 with pod anti-affinity
- Add ExternalSecrets for MinIO credentials
- Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d
2026-01-09 21:41:52 +09:00