- Change storage backend from S3 to local filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 3 days to 1 day for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Remove anti-affinity (single replica only)
- Change storage type from S3 to filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 7 days to 3 days for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Eliminates MinIO CPU load from Loki operations
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
- Override default cAdvisorMetricRelabelings
- Remove cfs_throttled_seconds_total from drop regex
- Enables CPU Throttled panels in Grafana dashboards
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Set CPU limits to null for manager container
- Set CPU limits to null for kube-rbac-proxy container
- Disable chart default CPU limits to prevent throttling
- Update opentelemetry-operator manager from 64Mi to 256Mi
- Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi
- Update opentelemetry-collector memory request from 256Mi to 512Mi
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
- Create otel-logs (DaemonSet) for logs and traces collection
- Create otel-metrics (Deployment+TA) for metrics collection
- Use consistent-hashing strategy for full target coverage
- Remove old unified collector.yaml
- Set outOfOrderTimeWindow to 5m for TSDB
- Allow slightly out-of-order samples from distributed collectors
- Prevents data loss from timing differences
- Enable Target Allocator with consistent-hashing strategy
- Configure prometheus receiver to use Target Allocator
- Add RBAC permissions for secrets and events
- Use prometheusCR for ServiceMonitor/PodMonitor discovery
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
- Move extraEnv from top-level to tempo section where chart expects it
- Move extraVolumeMounts under tempo section for proper WAL mounting
- Fixes Access Denied error when connecting to MinIO
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
- Add http_auth module accepting 401/403 status codes
- Apply http_auth to grafana, code-server, pgweb, velero-ui
- These services return 401 when accessed without authentication