observability

Author	SHA1	Message	Date
Mayne0213	da89c8dbf0	FIX(grafana): restore gauge design with percentage display - Restore original gauge panel type - Keep * 100 query and percent unit - Set max to 100 for proper gauge range	2026-01-10 17:58:11 +09:00
Mayne0213	11f9457236	fix: increase CPU pressure threshold to 30%	2026-01-10 17:57:34 +09:00
Mayne0213	7e375e20c6	FIX(grafana): show CPU Usage as percentage per node - Change panel type from gauge to stat - Add * 100 to query for percentage - Show each node's CPU usage horizontally - Set thresholds at 50% (orange), 80% (red)	2026-01-10 17:57:05 +09:00
Mayne0213	b818a8c1fe	fix: update CPU throttling panels to use PSI metrics with 10% threshold	2026-01-10 17:54:55 +09:00
Mayne0213	2b1667e643	FIX(grafana): replace rate_interval with 5m in MinIO dashboard - Change all $__rate_interval to 5m - Fix No data issues in rate() queries	2026-01-10 17:50:47 +09:00
Mayne0213	38e0c68ddb	CHORE(grafana): rearrange Bucket Scans panels side by side - Move Finished to left (x=0) - Move Started next to Finished (x=12, same y)	2026-01-10 17:48:43 +09:00
Mayne0213	4afdf04ef2	CHORE(grafana): remove KMS panels from MinIO dashboard - Remove 5 KMS-related panels (KMS not configured) - KMS Uptime, Request rates, Online/Offline status	2026-01-10 17:46:45 +09:00
Mayne0213	20b796f9e4	FIX(grafana): fix MinIO CPU Usage panel query - Hardcode job=minio and 5m interval - Change unit from 's' to 'percentunit' - Set max to 1 for proper gauge display	2026-01-10 17:33:54 +09:00
Mayne0213	fa4c2ce8f6	FIX(grafana): set default value for MinIO dashboard variable - Set scrape_jobs default to 'minio' - Hide variable selector (only one option)	2026-01-10 17:32:23 +09:00
Mayne0213	fc4f825b6d	FIX(grafana): fix MinIO dashboard scrape_jobs variable - Query only MinIO-related jobs - Set includeAll and multi to false	2026-01-10 17:15:53 +09:00
Mayne0213	7d5780cb97	PERF(tempo): switch from MinIO to local filesystem storage - Change storage backend from S3 to local filesystem (emptyDir) - Remove MinIO/S3 configuration and credentials - Reduce retention from 3 days to 1 day for local storage - Increase emptyDir size from 2Gi to 5Gi - Remove anti-affinity (single replica only)	2026-01-10 15:58:34 +09:00
Mayne0213	eea6420544	PERF(loki): switch from MinIO to local filesystem storage - Change storage type from S3 to filesystem (emptyDir) - Remove MinIO/S3 configuration and credentials - Reduce retention from 7 days to 3 days for local storage - Increase emptyDir size from 2Gi to 5Gi - Eliminates MinIO CPU load from Loki operations	2026-01-10 15:57:50 +09:00
Mayne0213	001aa9253d	PERF(loki): disable canary to reduce MinIO load - Disable lokiCanary which queries Loki every second - Reduces continuous S3 read operations on MinIO	2026-01-10 15:43:19 +09:00
Mayne0213	ef7c7c2593	PERF(loki,tempo): reduce replicas to 1 - Reduce Loki singleBinary replicas from 2 to 1 - Reduce Tempo replicas from 2 to 1 - Decrease MinIO CPU load (0.5 → 0.1 cores expected)	2026-01-10 15:32:07 +09:00
Mayne0213	b4b48c6e89	FIX(opentelemetry-operator): restore memory to 256Mi - VPA recommended 75Mi was too low causing informer sync timeout - Restore original memory value for stability	2026-01-10 14:52:24 +09:00
Mayne0213	a3003d597f	PERF(observability): adjust resources based on VPA - Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi - Update grafana cpu 11m→23m, memory 425Mi→175Mi - Update loki cpu 23m→63m, memory 462Mi→363Mi - Update tempo cpu 50m→15m, memory 128Mi→100Mi - Update thanos memory 128Mi→283Mi - Update node-exporter memory 64Mi→100Mi - Update kube-state-metrics memory 100Mi→105Mi - Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi - Update vpa memory 128Mi→100Mi	2026-01-10 14:33:40 +09:00
Mayne0213	c3084225b7	PERF(observability): add HA for Loki and Tempo - Loki: replicas 1→2 with soft anti-affinity - Tempo: replicas 1→2 with soft anti-affinity - Thanos/Prometheus: keep replica 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 13:46:02 +09:00
Mayne0213	395c79ad9e	PERF(alertmanager): reduce karma replicas to 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 13:36:14 +09:00
Mayne0213	67db06cf8b	PERF(observability): reduce replicas to 1 - Reduce alertmanager replicas to 1 - Reduce karma replicas to 1 - Reduce goldilocks dashboard replicas to 1	2026-01-10 13:31:39 +09:00
Mayne0213	9e218a8adc	PERF(observability): reduce replicas, add priority - Reduce Prometheus replicas from 2 to 1 - Reduce Grafana replicas from 2 to 1 - Reduce Blackbox-exporter replicas from 2 to 1 - Move Loki, Thanos, Tempo to workers (remove tolerations) - Add medium-priority to Prometheus, Loki, Thanos, Tempo	2026-01-10 13:15:03 +09:00
Mayne0213	c34f56945a	feat(prometheus): enable container CPU throttling metrics collection - Override default cAdvisorMetricRelabelings - Remove cfs_throttled_seconds_total from drop regex - Enables CPU Throttled panels in Grafana dashboards Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:55:36 +09:00
Mayne0213	823edfbd88	fix(grafana): restrict main dashboard datasource to Thanos only - Set regex filter "/Thanos/" on datasource variable - Set default value to "Thanos" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:51:44 +09:00
Mayne0213	dc8706fb02	fix(grafana): set explicit 2m interval on CPU query targets - Global CPU Usage: set interval="2m" on Real Linux/Windows targets - CPU Usage: set interval="2m" on Real Linux/Windows targets - Previously empty interval caused $__rate_interval mismatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:50:44 +09:00
Mayne0213	3516a860db	fix(grafana): standardize CPU panel intervals to 2m - Revert Overview panels to 2m (rate() needs sufficient data points) - Change Cluster CPU Utilization targets to 2m for consistency - All CPU panels now update at the same rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:48:21 +09:00
Mayne0213	64e129128f	fix(grafana): sync interval for CPU panels in main dashboard - Change hardcoded "2m" interval to "$resolution" variable - Affected panels: Global CPU Usage (id 77), CPU Usage (id 37) - Ensures consistent refresh rate across all CPU metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:46:15 +09:00
Mayne0213	518b5c31ef	fix: update dashboards and OTel collector for proper metrics/logs - certmanager.json: use Thanos datasource, fix variable regex - argocd.json: use Thanos datasource via $datasource variable - logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name) - collector.yaml: add loki.resource.labels hint for proper Loki label mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:36:37 +09:00
Mayne0213	de81ca68c9	FIX(opentelemetry-operator): fix ServiceMonitor config path - Move serviceMonitor config from metrics to manager section - Fix Helm schema validation error - Disable ServiceMonitor creation to prevent conflicts	2026-01-10 02:38:53 +09:00
Mayne0213	dac5fc7bcf	FIX(opentelemetry-operator): disable ServiceMonitor creation - Set metrics.serviceMonitor.enabled to false - Prevent ServiceMonitor conflict causing CrashLoopBackOff - Resolve error: servicemonitor already exists	2026-01-10 02:36:12 +09:00
Mayne0213	8a050dd303	CHORE(opentelemetry-operator): disable CPU limits - Set CPU limits to null for manager container - Set CPU limits to null for kube-rbac-proxy container - Disable chart default CPU limits to prevent throttling	2026-01-10 02:32:53 +09:00
Mayne0213	466ec6210c	CHORE(observability): align memory requests with limits - Update opentelemetry-operator manager from 64Mi to 256Mi - Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi - Update opentelemetry-collector memory request from 256Mi to 512Mi	2026-01-10 02:31:19 +09:00
Mayne0213	507395aca7	CHORE(otel-operator): schedule on master node - Add tolerations and nodeSelector to run operator on control-plane node	2026-01-10 01:18:41 +09:00
Mayne0213	9e87e6fbcb	REVERT(otel): remove metrics collection, keep logs/traces only - Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors - OTel Collector only handles logs (filelog) and traces (otlp) - Remove Target Allocator and metrics-related config - This reduces complexity and resource usage for home cluster	2026-01-10 01:18:35 +09:00
Mayne0213	a506ca3f58	FIX(prometheus): reduce replicas to 1 due to resource constraints - Cluster has insufficient memory to schedule 2 Prometheus replicas - Thanos sidecar still provides HA query capability	2026-01-10 01:18:26 +09:00
Mayne0213	328d952cc1	FIX(otel): increase metrics collector memory to 1Gi - OTel metrics collector pods were OOMKilled with 512Mi limit - Increased memory requests to 512Mi and limits to 1Gi	2026-01-10 01:18:18 +09:00
Mayne0213	5bc0caa324	FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled - Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi) - Increased memory requests and limits from 768Mi to 1536Mi	2026-01-10 01:18:11 +09:00
Mayne0213	8ce6f95d92	FIX(otel): use statefulset mode for metrics collector - Change from deployment to statefulset mode - Target Allocator requires statefulset (not deployment)	2026-01-10 00:01:22 +09:00
Mayne0213	5b70f19b12	REFACTOR(otel): split collector into logs and metrics - Create otel-logs (DaemonSet) for logs and traces collection - Create otel-metrics (Deployment+TA) for metrics collection - Use consistent-hashing strategy for full target coverage - Remove old unified collector.yaml	2026-01-09 23:50:21 +09:00
Mayne0213	12ee5b61c0	FIX(prometheus): enable out-of-order time window - Set outOfOrderTimeWindow to 5m for TSDB - Allow slightly out-of-order samples from distributed collectors - Prevents data loss from timing differences	2026-01-09 23:43:01 +09:00
Mayne0213	a3c5a8dbcf	CHORE(prometheus): disable direct scraping - Disable ServiceMonitor/PodMonitor scraping in Prometheus - OTel Collector now handles all metrics collection - Prevents out-of-order sample errors from duplicate scraping	2026-01-09 23:39:30 +09:00
Mayne0213	31f15e230d	FIX(otel): add scrape_configs for Target Allocator - Add minimal scrape_configs (required by Operator) - Keep self-metrics scraping alongside Target Allocator	2026-01-09 23:36:55 +09:00
Mayne0213	254687225c	FIX(otel): use per-node strategy for DaemonSet mode - Change allocationStrategy to per-node (required for DaemonSet) - Operator rejects consistent-hashing with DaemonSet mode	2026-01-09 23:32:56 +09:00
Mayne0213	1fdbb5e1dd	FEAT(otel): enable Target Allocator for metrics - Enable Target Allocator with consistent-hashing strategy - Configure prometheus receiver to use Target Allocator - Add RBAC permissions for secrets and events - Use prometheusCR for ServiceMonitor/PodMonitor discovery	2026-01-09 23:30:41 +09:00
Mayne0213	02faf93555	FEAT(otel): add OTel Collector for logs and traces - Add OpenTelemetry Operator for CR management - Deploy OTel Collector as DaemonSet via CR - Enable filelog receiver for container log collection - Replace Promtail with OTel filelog receiver - Keep Prometheus for ServiceMonitor-based metrics scraping	2026-01-09 23:23:51 +09:00
Mayne0213	ad9573e998	FIX(alertmanager): remove duplicate volume config - Remove extraVolumes and extraVolumeMounts - Chart uses emptyDir automatically when persistence disabled	2026-01-09 21:42:35 +09:00
Mayne0213	470a08f78a	CHORE(repo): switch to emptyDir with sizeLimit - Add sizeLimit 2Gi to loki emptyDir - Add sizeLimit 2Gi to tempo emptyDir - Change prometheus from PVC to emptyDir 5Gi - Change alertmanager from PVC to emptyDir 500Mi	2026-01-09 21:42:35 +09:00
Mayne0213	fa4d97eede	REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret - Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret) - Remove manifests source from ArgoCD application	2026-01-09 21:42:35 +09:00
Mayne0213	b378c6ec06	FIX(tempo): move extraEnv under tempo section for S3 credentials - Move extraEnv from top-level to tempo section where chart expects it - Move extraVolumeMounts under tempo section for proper WAL mounting - Fixes Access Denied error when connecting to MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	8ac76d17f3	FEAT(loki,tempo): use MinIO with emptyDir for WAL - Loki: disable PVC, use emptyDir for /var/loki - Tempo: switch backend from local to s3 (MinIO) - Tempo: disable PVC, use emptyDir for /var/tempo - Both services no longer use boot volume (/dev/sda1) - WAL data is temporary, persistent data stored in MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	2e6b4cecbf	FEAT(loki): switch storage backend to MinIO S3 - Change storage type from filesystem to s3 - Configure MinIO endpoint and bucket settings - Add S3 credentials from minio-s3-credentials secret - Update schema config to use s3 object_store	2026-01-09 21:42:35 +09:00
Mayne0213	24747b98cf	REFACTOR(loki,tempo): switch from MinIO to local-path storage - Loki: s3 backend to filesystem with local-path PVC - Tempo: s3 backend to local backend with local-path PVC - Remove MinIO/S3 credentials and configuration	2026-01-09 21:42:35 +09:00

1 2 3 4

167 Commits