Commit Graph

137 Commits

Author SHA1 Message Date
507395aca7 CHORE(otel-operator): schedule on master node
- Add tolerations and nodeSelector to run operator on control-plane node
2026-01-10 01:18:41 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
a506ca3f58 FIX(prometheus): reduce replicas to 1 due to resource constraints
- Cluster has insufficient memory to schedule 2 Prometheus replicas
- Thanos sidecar still provides HA query capability
2026-01-10 01:18:26 +09:00
328d952cc1 FIX(otel): increase metrics collector memory to 1Gi
- OTel metrics collector pods were OOMKilled with 512Mi limit
- Increased memory requests to 512Mi and limits to 1Gi
2026-01-10 01:18:18 +09:00
5bc0caa324 FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled
- Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi)
- Increased memory requests and limits from 768Mi to 1536Mi
2026-01-10 01:18:11 +09:00
8ce6f95d92 FIX(otel): use statefulset mode for metrics collector
- Change from deployment to statefulset mode
- Target Allocator requires statefulset (not deployment)
2026-01-10 00:01:22 +09:00
5b70f19b12 REFACTOR(otel): split collector into logs and metrics
- Create otel-logs (DaemonSet) for logs and traces collection
- Create otel-metrics (Deployment+TA) for metrics collection
- Use consistent-hashing strategy for full target coverage
- Remove old unified collector.yaml
2026-01-09 23:50:21 +09:00
12ee5b61c0 FIX(prometheus): enable out-of-order time window
- Set outOfOrderTimeWindow to 5m for TSDB
- Allow slightly out-of-order samples from distributed collectors
- Prevents data loss from timing differences
2026-01-09 23:43:01 +09:00
a3c5a8dbcf CHORE(prometheus): disable direct scraping
- Disable ServiceMonitor/PodMonitor scraping in Prometheus
- OTel Collector now handles all metrics collection
- Prevents out-of-order sample errors from duplicate scraping
2026-01-09 23:39:30 +09:00
31f15e230d FIX(otel): add scrape_configs for Target Allocator
- Add minimal scrape_configs (required by Operator)
- Keep self-metrics scraping alongside Target Allocator
2026-01-09 23:36:55 +09:00
254687225c FIX(otel): use per-node strategy for DaemonSet mode
- Change allocationStrategy to per-node (required for DaemonSet)
- Operator rejects consistent-hashing with DaemonSet mode
2026-01-09 23:32:56 +09:00
1fdbb5e1dd FEAT(otel): enable Target Allocator for metrics
- Enable Target Allocator with consistent-hashing strategy
- Configure prometheus receiver to use Target Allocator
- Add RBAC permissions for secrets and events
- Use prometheusCR for ServiceMonitor/PodMonitor discovery
2026-01-09 23:30:41 +09:00
02faf93555 FEAT(otel): add OTel Collector for logs and traces
- Add OpenTelemetry Operator for CR management
- Deploy OTel Collector as DaemonSet via CR
- Enable filelog receiver for container log collection
- Replace Promtail with OTel filelog receiver
- Keep Prometheus for ServiceMonitor-based metrics scraping
2026-01-09 23:23:51 +09:00
ad9573e998 FIX(alertmanager): remove duplicate volume config
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
fa4d97eede REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret
- Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret)
- Remove manifests source from ArgoCD application
2026-01-09 21:42:35 +09:00
b378c6ec06 FIX(tempo): move extraEnv under tempo section for S3 credentials
- Move extraEnv from top-level to tempo section where chart expects it
- Move extraVolumeMounts under tempo section for proper WAL mounting
- Fixes Access Denied error when connecting to MinIO
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
2e6b4cecbf FEAT(loki): switch storage backend to MinIO S3
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
94af545120 REFACTOR(thanos): remove S3 storage integration
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
2026-01-09 21:42:35 +09:00
ffed27419a REFACTOR(blackbox-exporter): revert to http_2xx module
- Remove http_auth module workaround
- Authelia now bypasses internal cluster traffic
- All endpoints use standard http_2xx module
2026-01-09 21:42:35 +09:00
37c216c433 FIX(blackbox-exporter): handle Authelia-protected endpoints
- Add http_auth module accepting 401/403 status codes
- Apply http_auth to grafana, code-server, pgweb, velero-ui
- These services return 401 when accessed without authentication
2026-01-09 21:42:35 +09:00
884a38d8ad FEAT(blackbox-exporter): add external endpoint monitoring
- Add blackbox-exporter with prometheus-community Helm chart
- Configure HTTP probes for 25 external endpoints
- Include SSL certificate expiry alerting rules
- Add probe failure and slow response alerts
- Deploy 2 replicas with anti-affinity for HA
2026-01-09 21:42:35 +09:00
01c5742d7a FIX(grafana): change OOM panel to stat type
- Replace timeseries with stat panel for OOM detection
- Show total count of OOMKilled pods instead of timeline
- Gauge metric not suitable for timeseries visualization
2026-01-09 21:42:35 +09:00
7cd778313a FIX(prometheus): disable PrometheusDuplicateTimestamps alert
- Low severity alert that fires repeatedly in HA setup
- 0.05 samples/s drop rate is negligible
2026-01-09 21:42:35 +09:00
bb8b1c193e FIX(alertmanager): improve OOMKilled alert detection
- Only fire when container restarted in last 10 minutes
- Prevent stale alerts from old OOM events
2026-01-09 21:42:35 +09:00
e3c615b5c1 FEAT(alertmanager): add OOMKilled alert rule
- Add PrometheusRule to alert when containers are OOMKilled
- Severity: warning, fires immediately
2026-01-09 21:42:35 +09:00
539f4be497 FIX(grafana): use kube-state-metrics for OOM detection
- Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason
- Fix OOM events not showing after pod restart
- cAdvisor metric resets on pod restart, kube-state-metrics persists
2026-01-09 21:42:35 +09:00
14bd244b98 FIX(thanos): increase compactor memory to 256Mi
- Compactor was OOMKilled with 128Mi limit
- Set to 256Mi for stability during compaction
2026-01-09 21:42:35 +09:00
4a4a43ed82 FIX(prometheus): increase memory to 768Mi
- Prometheus was OOMKilled with 512Mi limit
- Set both requests and limits to 768Mi
2026-01-09 21:42:35 +09:00
8c2a9badf8 FIX(alertmanager): set karma memory limits equal to requests
- Align memory limits with requests for guaranteed QoS
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
fd6c1952ad FIX(tempo): enable env var expansion in config
- Add extraArgs config.expand-env=true
- Required for ${VAR} substitution in tempo.yaml
2026-01-09 21:41:52 +09:00
5f926cb6cf FEAT(tempo): configure S3 storage with MinIO
- Enable env var expansion in config
- Configure extraEnv for S3 credentials
- Fix OTel Collector image settings
2026-01-09 21:41:52 +09:00
7139f3e5a2 FIX(prometheus): correct ArgoCD metrics service names
- Update controller target to argocd-application-controller-metrics
- Update repo-server target to argocd-repo-server-metrics
2026-01-09 21:41:52 +09:00
034a5f32a2 CHORE(repo): remove application.yaml reference
- Remove from kustomization.yaml
2026-01-09 21:41:52 +09:00
87420d842d CHORE(repo): remove self-referencing application.yaml
- Delete application.yaml (managed by platform)
2026-01-09 21:41:52 +09:00
445cabb900 FIX(prometheus): add ExternalSecret default values to fix OutOfSync 2026-01-09 21:41:52 +09:00
aecb15031d FEAT(grafana): add Thanos as default datasource
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus

REFACTOR(thanos): move all components to master node

- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)

FIX(thanos): allow non-Bitnami images (quay.io/thanos)

FIX(thanos): correct nodeSelector value to 'true'
2026-01-09 21:41:52 +09:00
9b052b49cf FEAT(thanos): add Thanos for Prometheus HA
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure OCI registry for Bitnami chart
- Fix Vault secret path and image settings
- Add nodeSelector for master node
2026-01-09 21:41:52 +09:00
ea4d7d4ecf PERF(prometheus): reduce CPU request from 200m to 50m
- Actual usage is ~17m, 200m was over-provisioned
- Fixes "Insufficient cpu" scheduling error for replica 2
2026-01-09 21:41:52 +09:00
6b576d6a16 FEAT(thanos): add Thanos for Prometheus HA and long-term storage
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure Prometheus replicas: 2 with pod anti-affinity
- Add ExternalSecrets for MinIO credentials
- Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d
2026-01-09 21:41:52 +09:00
9f3b768cd9 FIX(loki): fix lokiCanary config path
- Move lokiCanary to top-level config
- Fix toleration not being applied to DaemonSet
2026-01-09 21:41:52 +09:00
a1c347e4ff FEAT(loki): enable loki-canary with control-plane toleration
- Enable lokiCanary for log ingestion monitoring
- Add toleration for control-plane node
2026-01-09 21:41:52 +09:00
30f028fae4 CHORE(prometheus): disable CPU/Memory overcommit alerts
- Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts
- Cluster uses replica=2 with pod anti-affinity for HA
2026-01-09 21:41:52 +09:00
6da4eba1dc CHORE(grafana): remove admin login secret for SSO
- Remove grafana-admin-password ExternalSecret
- Remove admin section from helm-values.yaml
- Authentication handled by Authelia SSO middleware
2026-01-09 21:41:52 +09:00
735166fc9c REFACTOR(repo): standardize taint to control-plane
- Change node-role.kubernetes.io/master to control-plane
- Update vpa, goldilocks, kube-state-metrics tolerations
- Remove deprecated master taint from promtail
2026-01-09 21:41:52 +09:00
7ed4d69c51 PERF(alertmanager): add HA with 2 replicas
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4511fd5b2e FIX(repo): correct nodeSelector label value
- Change master label value from "" to "true"
- Fix pod scheduling failure due to label mismatch
2026-01-09 21:41:52 +09:00