Commit Graph

167 Commits

Author SHA1 Message Date
da89c8dbf0 FIX(grafana): restore gauge design with percentage display
- Restore original gauge panel type
- Keep * 100 query and percent unit
- Set max to 100 for proper gauge range
2026-01-10 17:58:11 +09:00
11f9457236 fix: increase CPU pressure threshold to 30% 2026-01-10 17:57:34 +09:00
7e375e20c6 FIX(grafana): show CPU Usage as percentage per node
- Change panel type from gauge to stat
- Add * 100 to query for percentage
- Show each node's CPU usage horizontally
- Set thresholds at 50% (orange), 80% (red)
2026-01-10 17:57:05 +09:00
b818a8c1fe fix: update CPU throttling panels to use PSI metrics with 10% threshold 2026-01-10 17:54:55 +09:00
2b1667e643 FIX(grafana): replace rate_interval with 5m in MinIO dashboard
- Change all $__rate_interval to 5m
- Fix No data issues in rate() queries
2026-01-10 17:50:47 +09:00
38e0c68ddb CHORE(grafana): rearrange Bucket Scans panels side by side
- Move Finished to left (x=0)
- Move Started next to Finished (x=12, same y)
2026-01-10 17:48:43 +09:00
4afdf04ef2 CHORE(grafana): remove KMS panels from MinIO dashboard
- Remove 5 KMS-related panels (KMS not configured)
- KMS Uptime, Request rates, Online/Offline status
2026-01-10 17:46:45 +09:00
20b796f9e4 FIX(grafana): fix MinIO CPU Usage panel query
- Hardcode job=minio and 5m interval
- Change unit from 's' to 'percentunit'
- Set max to 1 for proper gauge display
2026-01-10 17:33:54 +09:00
fa4c2ce8f6 FIX(grafana): set default value for MinIO dashboard variable
- Set scrape_jobs default to 'minio'
- Hide variable selector (only one option)
2026-01-10 17:32:23 +09:00
fc4f825b6d FIX(grafana): fix MinIO dashboard scrape_jobs variable
- Query only MinIO-related jobs
- Set includeAll and multi to false
2026-01-10 17:15:53 +09:00
7d5780cb97 PERF(tempo): switch from MinIO to local filesystem storage
- Change storage backend from S3 to local filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 3 days to 1 day for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Remove anti-affinity (single replica only)
2026-01-10 15:58:34 +09:00
eea6420544 PERF(loki): switch from MinIO to local filesystem storage
- Change storage type from S3 to filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 7 days to 3 days for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Eliminates MinIO CPU load from Loki operations
2026-01-10 15:57:50 +09:00
001aa9253d PERF(loki): disable canary to reduce MinIO load
- Disable lokiCanary which queries Loki every second
- Reduces continuous S3 read operations on MinIO
2026-01-10 15:43:19 +09:00
ef7c7c2593 PERF(loki,tempo): reduce replicas to 1
- Reduce Loki singleBinary replicas from 2 to 1
- Reduce Tempo replicas from 2 to 1
- Decrease MinIO CPU load (0.5 → 0.1 cores expected)
2026-01-10 15:32:07 +09:00
b4b48c6e89 FIX(opentelemetry-operator): restore memory to 256Mi
- VPA recommended 75Mi was too low causing informer sync timeout
- Restore original memory value for stability
2026-01-10 14:52:24 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
395c79ad9e PERF(alertmanager): reduce karma replicas to 1
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:36:14 +09:00
67db06cf8b PERF(observability): reduce replicas to 1
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
c34f56945a feat(prometheus): enable container CPU throttling metrics collection
- Override default cAdvisorMetricRelabelings
- Remove cfs_throttled_seconds_total from drop regex
- Enables CPU Throttled panels in Grafana dashboards

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:55:36 +09:00
823edfbd88 fix(grafana): restrict main dashboard datasource to Thanos only
- Set regex filter "/Thanos/" on datasource variable
- Set default value to "Thanos"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:51:44 +09:00
dc8706fb02 fix(grafana): set explicit 2m interval on CPU query targets
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:50:44 +09:00
3516a860db fix(grafana): standardize CPU panel intervals to 2m
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:48:21 +09:00
64e129128f fix(grafana): sync interval for CPU panels in main dashboard
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:46:15 +09:00
518b5c31ef fix: update dashboards and OTel collector for proper metrics/logs
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:36:37 +09:00
de81ca68c9 FIX(opentelemetry-operator): fix ServiceMonitor config path
- Move serviceMonitor config from metrics to manager section
- Fix Helm schema validation error
- Disable ServiceMonitor creation to prevent conflicts
2026-01-10 02:38:53 +09:00
dac5fc7bcf FIX(opentelemetry-operator): disable ServiceMonitor creation
- Set metrics.serviceMonitor.enabled to false
- Prevent ServiceMonitor conflict causing CrashLoopBackOff
- Resolve error: servicemonitor already exists
2026-01-10 02:36:12 +09:00
8a050dd303 CHORE(opentelemetry-operator): disable CPU limits
- Set CPU limits to null for manager container
- Set CPU limits to null for kube-rbac-proxy container
- Disable chart default CPU limits to prevent throttling
2026-01-10 02:32:53 +09:00
466ec6210c CHORE(observability): align memory requests with limits
- Update opentelemetry-operator manager from 64Mi to 256Mi
- Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi
- Update opentelemetry-collector memory request from 256Mi to 512Mi
2026-01-10 02:31:19 +09:00
507395aca7 CHORE(otel-operator): schedule on master node
- Add tolerations and nodeSelector to run operator on control-plane node
2026-01-10 01:18:41 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
a506ca3f58 FIX(prometheus): reduce replicas to 1 due to resource constraints
- Cluster has insufficient memory to schedule 2 Prometheus replicas
- Thanos sidecar still provides HA query capability
2026-01-10 01:18:26 +09:00
328d952cc1 FIX(otel): increase metrics collector memory to 1Gi
- OTel metrics collector pods were OOMKilled with 512Mi limit
- Increased memory requests to 512Mi and limits to 1Gi
2026-01-10 01:18:18 +09:00
5bc0caa324 FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled
- Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi)
- Increased memory requests and limits from 768Mi to 1536Mi
2026-01-10 01:18:11 +09:00
8ce6f95d92 FIX(otel): use statefulset mode for metrics collector
- Change from deployment to statefulset mode
- Target Allocator requires statefulset (not deployment)
2026-01-10 00:01:22 +09:00
5b70f19b12 REFACTOR(otel): split collector into logs and metrics
- Create otel-logs (DaemonSet) for logs and traces collection
- Create otel-metrics (Deployment+TA) for metrics collection
- Use consistent-hashing strategy for full target coverage
- Remove old unified collector.yaml
2026-01-09 23:50:21 +09:00
12ee5b61c0 FIX(prometheus): enable out-of-order time window
- Set outOfOrderTimeWindow to 5m for TSDB
- Allow slightly out-of-order samples from distributed collectors
- Prevents data loss from timing differences
2026-01-09 23:43:01 +09:00
a3c5a8dbcf CHORE(prometheus): disable direct scraping
- Disable ServiceMonitor/PodMonitor scraping in Prometheus
- OTel Collector now handles all metrics collection
- Prevents out-of-order sample errors from duplicate scraping
2026-01-09 23:39:30 +09:00
31f15e230d FIX(otel): add scrape_configs for Target Allocator
- Add minimal scrape_configs (required by Operator)
- Keep self-metrics scraping alongside Target Allocator
2026-01-09 23:36:55 +09:00
254687225c FIX(otel): use per-node strategy for DaemonSet mode
- Change allocationStrategy to per-node (required for DaemonSet)
- Operator rejects consistent-hashing with DaemonSet mode
2026-01-09 23:32:56 +09:00
1fdbb5e1dd FEAT(otel): enable Target Allocator for metrics
- Enable Target Allocator with consistent-hashing strategy
- Configure prometheus receiver to use Target Allocator
- Add RBAC permissions for secrets and events
- Use prometheusCR for ServiceMonitor/PodMonitor discovery
2026-01-09 23:30:41 +09:00
02faf93555 FEAT(otel): add OTel Collector for logs and traces
- Add OpenTelemetry Operator for CR management
- Deploy OTel Collector as DaemonSet via CR
- Enable filelog receiver for container log collection
- Replace Promtail with OTel filelog receiver
- Keep Prometheus for ServiceMonitor-based metrics scraping
2026-01-09 23:23:51 +09:00
ad9573e998 FIX(alertmanager): remove duplicate volume config
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
fa4d97eede REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret
- Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret)
- Remove manifests source from ArgoCD application
2026-01-09 21:42:35 +09:00
b378c6ec06 FIX(tempo): move extraEnv under tempo section for S3 credentials
- Move extraEnv from top-level to tempo section where chart expects it
- Move extraVolumeMounts under tempo section for proper WAL mounting
- Fixes Access Denied error when connecting to MinIO
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
2e6b4cecbf FEAT(loki): switch storage backend to MinIO S3
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00