184 Commits

Author SHA1 Message Date
b145881fa2 PERF(prometheus): increase memory limit to 1Gi
- Increase memory request from 768Mi to 1Gi
- Increase memory limit from 768Mi to 1Gi
- Prevents OOM at 97% memory usage
2026-01-12 03:16:40 +09:00
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
a70403d1ae FEAT(grafana): add Tempo datasource
- Add Tempo datasource for distributed tracing
- Configure URL to tempo.tempo.svc.cluster.local:3100
2026-01-12 00:34:50 +09:00
7cbc0c810e FIX(tempo): move resources to correct helm path
- Move resources from top-level to tempo.resources
- Fix memory limit not being applied to container
2026-01-12 00:21:12 +09:00
904cc3cab6 PERF(grafana): increase memory limits
- Increase requests from 175Mi to 256Mi
- Increase limits from 175Mi to 256Mi
- Fix OOM and timeout issues
2026-01-11 23:32:09 +09:00
c1214029a2 refactor: update Vault secret paths to new categorized structure
- alertmanager: alertmanager → observability/alertmanager
- grafana: postgresql → storage/postgresql
- prometheus: postgresql → storage/postgresql, minio → storage/minio
- thanos: minio → storage/minio

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 22:36:22 +09:00
4aa7e37f76 PERF(otel): reduce resources based on VPA recommendation
- Add fullnameOverride to simplify pod names
- Reduce memory request from 512Mi to 400Mi
- Reduce CPU request from 50m to 25m
2026-01-11 21:33:58 +09:00
4bdcaf8fcd REFACTOR(otel): rename folder to opentelemetry
- Rename opentelemetry-collector to opentelemetry
- Update ArgoCD Application name to opentelemetry
- Simplify folder structure after operator removal
2026-01-11 21:27:54 +09:00
43cf7e9de7 REFACTOR(otel): migrate collector from Operator to Helm
- Remove opentelemetry-operator (no longer needed)
- Convert opentelemetry-collector to direct Helm Chart
- Remove CRD-based manifests (collector.yaml, rbac.yaml)
- Update helm-values.yaml with Loki labels and env vars
- Simplify architecture: Helm -> DaemonSet (no Operator)
2026-01-11 21:22:39 +09:00
15d5e58d6c migrate: change repoURLs from GitHub to Gitea
Update all ArgoCD Application references to use Gitea (github0213.com)
instead of GitHub for K3S-HOME/observability repository.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:43:29 +09:00
7d0c8aa5f3 FIX(opentelemetry-operator): remove cpu null values
- Remove cpu: null (not allowed in new chart schema)
- Keep only memory limits
2026-01-10 18:55:23 +09:00
9c00c42946 CHORE(opentelemetry-operator): upgrade chart to 0.102.0
- Fix ServiceMonitor duplicate creation bug (Issue #3446)
- Upgrade from 0.74.0 to 0.102.0
2026-01-10 18:53:34 +09:00
a08d989fc3 FIX(opentelemetry-operator): remove invalid serviceMonitor
- Remove top-level serviceMonitor (not in chart schema)
- Keep manager.serviceMonitor.enabled: false
2026-01-10 18:42:02 +09:00
203a8debac REFACTOR(repo): remove control-plane scheduling
- Remove nodeSelector for control-plane node
- Remove tolerations for control-plane taint
- Allow pods to schedule on any available node
2026-01-10 18:35:15 +09:00
c128ece672 FIX(opentelemetry-operator): disable serviceMonitor
- Add top-level serviceMonitor.enabled: false
- Prevent duplicate ServiceMonitor creation on restart
2026-01-10 18:28:12 +09:00
bcf60b2428 fix: set CPU pressure threshold to 10% 2026-01-10 18:00:06 +09:00
da89c8dbf0 FIX(grafana): restore gauge design with percentage display
- Restore original gauge panel type
- Keep * 100 query and percent unit
- Set max to 100 for proper gauge range
2026-01-10 17:58:11 +09:00
11f9457236 fix: increase CPU pressure threshold to 30% 2026-01-10 17:57:34 +09:00
7e375e20c6 FIX(grafana): show CPU Usage as percentage per node
- Change panel type from gauge to stat
- Add * 100 to query for percentage
- Show each node's CPU usage horizontally
- Set thresholds at 50% (orange), 80% (red)
2026-01-10 17:57:05 +09:00
b818a8c1fe fix: update CPU throttling panels to use PSI metrics with 10% threshold 2026-01-10 17:54:55 +09:00
2b1667e643 FIX(grafana): replace rate_interval with 5m in MinIO dashboard
- Change all $__rate_interval to 5m
- Fix No data issues in rate() queries
2026-01-10 17:50:47 +09:00
38e0c68ddb CHORE(grafana): rearrange Bucket Scans panels side by side
- Move Finished to left (x=0)
- Move Started next to Finished (x=12, same y)
2026-01-10 17:48:43 +09:00
4afdf04ef2 CHORE(grafana): remove KMS panels from MinIO dashboard
- Remove 5 KMS-related panels (KMS not configured)
- KMS Uptime, Request rates, Online/Offline status
2026-01-10 17:46:45 +09:00
20b796f9e4 FIX(grafana): fix MinIO CPU Usage panel query
- Hardcode job=minio and 5m interval
- Change unit from 's' to 'percentunit'
- Set max to 1 for proper gauge display
2026-01-10 17:33:54 +09:00
fa4c2ce8f6 FIX(grafana): set default value for MinIO dashboard variable
- Set scrape_jobs default to 'minio'
- Hide variable selector (only one option)
2026-01-10 17:32:23 +09:00
fc4f825b6d FIX(grafana): fix MinIO dashboard scrape_jobs variable
- Query only MinIO-related jobs
- Set includeAll and multi to false
2026-01-10 17:15:53 +09:00
7d5780cb97 PERF(tempo): switch from MinIO to local filesystem storage
- Change storage backend from S3 to local filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 3 days to 1 day for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Remove anti-affinity (single replica only)
2026-01-10 15:58:34 +09:00
eea6420544 PERF(loki): switch from MinIO to local filesystem storage
- Change storage type from S3 to filesystem (emptyDir)
- Remove MinIO/S3 configuration and credentials
- Reduce retention from 7 days to 3 days for local storage
- Increase emptyDir size from 2Gi to 5Gi
- Eliminates MinIO CPU load from Loki operations
2026-01-10 15:57:50 +09:00
001aa9253d PERF(loki): disable canary to reduce MinIO load
- Disable lokiCanary which queries Loki every second
- Reduces continuous S3 read operations on MinIO
2026-01-10 15:43:19 +09:00
ef7c7c2593 PERF(loki,tempo): reduce replicas to 1
- Reduce Loki singleBinary replicas from 2 to 1
- Reduce Tempo replicas from 2 to 1
- Decrease MinIO CPU load (0.5 → 0.1 cores expected)
2026-01-10 15:32:07 +09:00
b4b48c6e89 FIX(opentelemetry-operator): restore memory to 256Mi
- VPA recommended 75Mi was too low causing informer sync timeout
- Restore original memory value for stability
2026-01-10 14:52:24 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
395c79ad9e PERF(alertmanager): reduce karma replicas to 1
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:36:14 +09:00
67db06cf8b PERF(observability): reduce replicas to 1
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
c34f56945a feat(prometheus): enable container CPU throttling metrics collection
- Override default cAdvisorMetricRelabelings
- Remove cfs_throttled_seconds_total from drop regex
- Enables CPU Throttled panels in Grafana dashboards

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:55:36 +09:00
823edfbd88 fix(grafana): restrict main dashboard datasource to Thanos only
- Set regex filter "/Thanos/" on datasource variable
- Set default value to "Thanos"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:51:44 +09:00
dc8706fb02 fix(grafana): set explicit 2m interval on CPU query targets
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:50:44 +09:00
3516a860db fix(grafana): standardize CPU panel intervals to 2m
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:48:21 +09:00
64e129128f fix(grafana): sync interval for CPU panels in main dashboard
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:46:15 +09:00
518b5c31ef fix: update dashboards and OTel collector for proper metrics/logs
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:36:37 +09:00
de81ca68c9 FIX(opentelemetry-operator): fix ServiceMonitor config path
- Move serviceMonitor config from metrics to manager section
- Fix Helm schema validation error
- Disable ServiceMonitor creation to prevent conflicts
2026-01-10 02:38:53 +09:00
dac5fc7bcf FIX(opentelemetry-operator): disable ServiceMonitor creation
- Set metrics.serviceMonitor.enabled to false
- Prevent ServiceMonitor conflict causing CrashLoopBackOff
- Resolve error: servicemonitor already exists
2026-01-10 02:36:12 +09:00
8a050dd303 CHORE(opentelemetry-operator): disable CPU limits
- Set CPU limits to null for manager container
- Set CPU limits to null for kube-rbac-proxy container
- Disable chart default CPU limits to prevent throttling
2026-01-10 02:32:53 +09:00
466ec6210c CHORE(observability): align memory requests with limits
- Update opentelemetry-operator manager from 64Mi to 256Mi
- Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi
- Update opentelemetry-collector memory request from 256Mi to 512Mi
2026-01-10 02:31:19 +09:00
507395aca7 CHORE(otel-operator): schedule on master node
- Add tolerations and nodeSelector to run operator on control-plane node
2026-01-10 01:18:41 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
a506ca3f58 FIX(prometheus): reduce replicas to 1 due to resource constraints
- Cluster has insufficient memory to schedule 2 Prometheus replicas
- Thanos sidecar still provides HA query capability
2026-01-10 01:18:26 +09:00