Commit Graph

147 Commits

Author SHA1 Message Date
c34f56945a feat(prometheus): enable container CPU throttling metrics collection
- Override default cAdvisorMetricRelabelings
- Remove cfs_throttled_seconds_total from drop regex
- Enables CPU Throttled panels in Grafana dashboards

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:55:36 +09:00
823edfbd88 fix(grafana): restrict main dashboard datasource to Thanos only
- Set regex filter "/Thanos/" on datasource variable
- Set default value to "Thanos"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:51:44 +09:00
dc8706fb02 fix(grafana): set explicit 2m interval on CPU query targets
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:50:44 +09:00
3516a860db fix(grafana): standardize CPU panel intervals to 2m
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:48:21 +09:00
64e129128f fix(grafana): sync interval for CPU panels in main dashboard
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:46:15 +09:00
518b5c31ef fix: update dashboards and OTel collector for proper metrics/logs
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:36:37 +09:00
de81ca68c9 FIX(opentelemetry-operator): fix ServiceMonitor config path
- Move serviceMonitor config from metrics to manager section
- Fix Helm schema validation error
- Disable ServiceMonitor creation to prevent conflicts
2026-01-10 02:38:53 +09:00
dac5fc7bcf FIX(opentelemetry-operator): disable ServiceMonitor creation
- Set metrics.serviceMonitor.enabled to false
- Prevent ServiceMonitor conflict causing CrashLoopBackOff
- Resolve error: servicemonitor already exists
2026-01-10 02:36:12 +09:00
8a050dd303 CHORE(opentelemetry-operator): disable CPU limits
- Set CPU limits to null for manager container
- Set CPU limits to null for kube-rbac-proxy container
- Disable chart default CPU limits to prevent throttling
2026-01-10 02:32:53 +09:00
466ec6210c CHORE(observability): align memory requests with limits
- Update opentelemetry-operator manager from 64Mi to 256Mi
- Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi
- Update opentelemetry-collector memory request from 256Mi to 512Mi
2026-01-10 02:31:19 +09:00
507395aca7 CHORE(otel-operator): schedule on master node
- Add tolerations and nodeSelector to run operator on control-plane node
2026-01-10 01:18:41 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
a506ca3f58 FIX(prometheus): reduce replicas to 1 due to resource constraints
- Cluster has insufficient memory to schedule 2 Prometheus replicas
- Thanos sidecar still provides HA query capability
2026-01-10 01:18:26 +09:00
328d952cc1 FIX(otel): increase metrics collector memory to 1Gi
- OTel metrics collector pods were OOMKilled with 512Mi limit
- Increased memory requests to 512Mi and limits to 1Gi
2026-01-10 01:18:18 +09:00
5bc0caa324 FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled
- Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi)
- Increased memory requests and limits from 768Mi to 1536Mi
2026-01-10 01:18:11 +09:00
8ce6f95d92 FIX(otel): use statefulset mode for metrics collector
- Change from deployment to statefulset mode
- Target Allocator requires statefulset (not deployment)
2026-01-10 00:01:22 +09:00
5b70f19b12 REFACTOR(otel): split collector into logs and metrics
- Create otel-logs (DaemonSet) for logs and traces collection
- Create otel-metrics (Deployment+TA) for metrics collection
- Use consistent-hashing strategy for full target coverage
- Remove old unified collector.yaml
2026-01-09 23:50:21 +09:00
12ee5b61c0 FIX(prometheus): enable out-of-order time window
- Set outOfOrderTimeWindow to 5m for TSDB
- Allow slightly out-of-order samples from distributed collectors
- Prevents data loss from timing differences
2026-01-09 23:43:01 +09:00
a3c5a8dbcf CHORE(prometheus): disable direct scraping
- Disable ServiceMonitor/PodMonitor scraping in Prometheus
- OTel Collector now handles all metrics collection
- Prevents out-of-order sample errors from duplicate scraping
2026-01-09 23:39:30 +09:00
31f15e230d FIX(otel): add scrape_configs for Target Allocator
- Add minimal scrape_configs (required by Operator)
- Keep self-metrics scraping alongside Target Allocator
2026-01-09 23:36:55 +09:00
254687225c FIX(otel): use per-node strategy for DaemonSet mode
- Change allocationStrategy to per-node (required for DaemonSet)
- Operator rejects consistent-hashing with DaemonSet mode
2026-01-09 23:32:56 +09:00
1fdbb5e1dd FEAT(otel): enable Target Allocator for metrics
- Enable Target Allocator with consistent-hashing strategy
- Configure prometheus receiver to use Target Allocator
- Add RBAC permissions for secrets and events
- Use prometheusCR for ServiceMonitor/PodMonitor discovery
2026-01-09 23:30:41 +09:00
02faf93555 FEAT(otel): add OTel Collector for logs and traces
- Add OpenTelemetry Operator for CR management
- Deploy OTel Collector as DaemonSet via CR
- Enable filelog receiver for container log collection
- Replace Promtail with OTel filelog receiver
- Keep Prometheus for ServiceMonitor-based metrics scraping
2026-01-09 23:23:51 +09:00
ad9573e998 FIX(alertmanager): remove duplicate volume config
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
fa4d97eede REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret
- Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret)
- Remove manifests source from ArgoCD application
2026-01-09 21:42:35 +09:00
b378c6ec06 FIX(tempo): move extraEnv under tempo section for S3 credentials
- Move extraEnv from top-level to tempo section where chart expects it
- Move extraVolumeMounts under tempo section for proper WAL mounting
- Fixes Access Denied error when connecting to MinIO
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
2e6b4cecbf FEAT(loki): switch storage backend to MinIO S3
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
94af545120 REFACTOR(thanos): remove S3 storage integration
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
2026-01-09 21:42:35 +09:00
ffed27419a REFACTOR(blackbox-exporter): revert to http_2xx module
- Remove http_auth module workaround
- Authelia now bypasses internal cluster traffic
- All endpoints use standard http_2xx module
2026-01-09 21:42:35 +09:00
37c216c433 FIX(blackbox-exporter): handle Authelia-protected endpoints
- Add http_auth module accepting 401/403 status codes
- Apply http_auth to grafana, code-server, pgweb, velero-ui
- These services return 401 when accessed without authentication
2026-01-09 21:42:35 +09:00
884a38d8ad FEAT(blackbox-exporter): add external endpoint monitoring
- Add blackbox-exporter with prometheus-community Helm chart
- Configure HTTP probes for 25 external endpoints
- Include SSL certificate expiry alerting rules
- Add probe failure and slow response alerts
- Deploy 2 replicas with anti-affinity for HA
2026-01-09 21:42:35 +09:00
01c5742d7a FIX(grafana): change OOM panel to stat type
- Replace timeseries with stat panel for OOM detection
- Show total count of OOMKilled pods instead of timeline
- Gauge metric not suitable for timeseries visualization
2026-01-09 21:42:35 +09:00
7cd778313a FIX(prometheus): disable PrometheusDuplicateTimestamps alert
- Low severity alert that fires repeatedly in HA setup
- 0.05 samples/s drop rate is negligible
2026-01-09 21:42:35 +09:00
bb8b1c193e FIX(alertmanager): improve OOMKilled alert detection
- Only fire when container restarted in last 10 minutes
- Prevent stale alerts from old OOM events
2026-01-09 21:42:35 +09:00
e3c615b5c1 FEAT(alertmanager): add OOMKilled alert rule
- Add PrometheusRule to alert when containers are OOMKilled
- Severity: warning, fires immediately
2026-01-09 21:42:35 +09:00
539f4be497 FIX(grafana): use kube-state-metrics for OOM detection
- Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason
- Fix OOM events not showing after pod restart
- cAdvisor metric resets on pod restart, kube-state-metrics persists
2026-01-09 21:42:35 +09:00
14bd244b98 FIX(thanos): increase compactor memory to 256Mi
- Compactor was OOMKilled with 128Mi limit
- Set to 256Mi for stability during compaction
2026-01-09 21:42:35 +09:00
4a4a43ed82 FIX(prometheus): increase memory to 768Mi
- Prometheus was OOMKilled with 512Mi limit
- Set both requests and limits to 768Mi
2026-01-09 21:42:35 +09:00
8c2a9badf8 FIX(alertmanager): set karma memory limits equal to requests
- Align memory limits with requests for guaranteed QoS
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
fd6c1952ad FIX(tempo): enable env var expansion in config
- Add extraArgs config.expand-env=true
- Required for ${VAR} substitution in tempo.yaml
2026-01-09 21:41:52 +09:00
5f926cb6cf FEAT(tempo): configure S3 storage with MinIO
- Enable env var expansion in config
- Configure extraEnv for S3 credentials
- Fix OTel Collector image settings
2026-01-09 21:41:52 +09:00
7139f3e5a2 FIX(prometheus): correct ArgoCD metrics service names
- Update controller target to argocd-application-controller-metrics
- Update repo-server target to argocd-repo-server-metrics
2026-01-09 21:41:52 +09:00
034a5f32a2 CHORE(repo): remove application.yaml reference
- Remove from kustomization.yaml
2026-01-09 21:41:52 +09:00
87420d842d CHORE(repo): remove self-referencing application.yaml
- Delete application.yaml (managed by platform)
2026-01-09 21:41:52 +09:00
445cabb900 FIX(prometheus): add ExternalSecret default values to fix OutOfSync 2026-01-09 21:41:52 +09:00
aecb15031d FEAT(grafana): add Thanos as default datasource
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus

REFACTOR(thanos): move all components to master node

- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)

FIX(thanos): allow non-Bitnami images (quay.io/thanos)

FIX(thanos): correct nodeSelector value to 'true'
2026-01-09 21:41:52 +09:00