Commit Graph

12 Commits

Author SHA1 Message Date
518b5c31ef fix: update dashboards and OTel collector for proper metrics/logs
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:36:37 +09:00
466ec6210c CHORE(observability): align memory requests with limits
- Update opentelemetry-operator manager from 64Mi to 256Mi
- Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi
- Update opentelemetry-collector memory request from 256Mi to 512Mi
2026-01-10 02:31:19 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
328d952cc1 FIX(otel): increase metrics collector memory to 1Gi
- OTel metrics collector pods were OOMKilled with 512Mi limit
- Increased memory requests to 512Mi and limits to 1Gi
2026-01-10 01:18:18 +09:00
8ce6f95d92 FIX(otel): use statefulset mode for metrics collector
- Change from deployment to statefulset mode
- Target Allocator requires statefulset (not deployment)
2026-01-10 00:01:22 +09:00
5b70f19b12 REFACTOR(otel): split collector into logs and metrics
- Create otel-logs (DaemonSet) for logs and traces collection
- Create otel-metrics (Deployment+TA) for metrics collection
- Use consistent-hashing strategy for full target coverage
- Remove old unified collector.yaml
2026-01-09 23:50:21 +09:00
31f15e230d FIX(otel): add scrape_configs for Target Allocator
- Add minimal scrape_configs (required by Operator)
- Keep self-metrics scraping alongside Target Allocator
2026-01-09 23:36:55 +09:00
254687225c FIX(otel): use per-node strategy for DaemonSet mode
- Change allocationStrategy to per-node (required for DaemonSet)
- Operator rejects consistent-hashing with DaemonSet mode
2026-01-09 23:32:56 +09:00
1fdbb5e1dd FEAT(otel): enable Target Allocator for metrics
- Enable Target Allocator with consistent-hashing strategy
- Configure prometheus receiver to use Target Allocator
- Add RBAC permissions for secrets and events
- Use prometheusCR for ServiceMonitor/PodMonitor discovery
2026-01-09 23:30:41 +09:00
02faf93555 FEAT(otel): add OTel Collector for logs and traces
- Add OpenTelemetry Operator for CR management
- Deploy OTel Collector as DaemonSet via CR
- Enable filelog receiver for container log collection
- Replace Promtail with OTel filelog receiver
- Keep Prometheus for ServiceMonitor-based metrics scraping
2026-01-09 23:23:51 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
5f926cb6cf FEAT(tempo): configure S3 storage with MinIO
- Enable env var expansion in config
- Configure extraEnv for S3 credentials
- Fix OTel Collector image settings
2026-01-09 21:41:52 +09:00