- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace timeseries with stat panel for OOM detection
- Show total count of OOMKilled pods instead of timeline
- Gauge metric not suitable for timeseries visualization
- Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason
- Fix OOM events not showing after pod restart
- cAdvisor metric resets on pod restart, kube-state-metrics persists
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus
REFACTOR(thanos): move all components to master node
- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)
FIX(thanos): allow non-Bitnami images (quay.io/thanos)
FIX(thanos): correct nodeSelector value to 'true'
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
- Use manual import instead of automatic provisioning
- Remove configMapGenerator for dashboards
- Remove sidecar and dashboards helm config
- Keep JSON files in dashboards/ for manual import reference
- to JSON and use sidecar ConfigMaps
- Export 14 dashboards to JSON files
- Use kustomize configMapGenerator for dashboard ConfigMaps
- Enable Grafana sidecar to load dashboards from ConfigMaps
- Keep Longhorn and Traefik Official from grafana.com
- from grafana and prometheus...
- Remove 'path: grafana' source from grafana Application
- Remove 'path: prometheus' source from prometheus Application
- ExternalSecret and Ingress will be managed manually via kubectl apply
-k
- Fixes circular dependency issue causing Progressing state
- Comment out argocd.yaml in all app kustomization.yaml files
- Prevents circular reference when apps have 'path:' source (grafana,
prometheus)
- ArgoCD Applications are managed manually, not via kustomize
- pattern
- monitoring/kustomization.yaml now only manages application.yaml (App
of Apps)
- Each app independently manages its own ArgoCD Application via
kustomization.yaml
- Apps are fully self-contained: argocd.yaml, namespace.yaml, and app-
specific resources
- Cleaner separation: no central app list to maintain
- to self-manage ArgoCD Applications
- Each app now includes its own argocd.yaml in kustomization.yaml
- Main monitoring/kustomization.yaml references app folders instead of
individual argocd.yaml files
- Better separation of concerns - each app is self-contained and
independently managed