observability

Author	SHA1	Message	Date
Mayne0213	c34f56945a	feat(prometheus): enable container CPU throttling metrics collection - Override default cAdvisorMetricRelabelings - Remove cfs_throttled_seconds_total from drop regex - Enables CPU Throttled panels in Grafana dashboards Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:55:36 +09:00
Mayne0213	823edfbd88	fix(grafana): restrict main dashboard datasource to Thanos only - Set regex filter "/Thanos/" on datasource variable - Set default value to "Thanos" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:51:44 +09:00
Mayne0213	dc8706fb02	fix(grafana): set explicit 2m interval on CPU query targets - Global CPU Usage: set interval="2m" on Real Linux/Windows targets - CPU Usage: set interval="2m" on Real Linux/Windows targets - Previously empty interval caused $__rate_interval mismatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:50:44 +09:00
Mayne0213	3516a860db	fix(grafana): standardize CPU panel intervals to 2m - Revert Overview panels to 2m (rate() needs sufficient data points) - Change Cluster CPU Utilization targets to 2m for consistency - All CPU panels now update at the same rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:48:21 +09:00
Mayne0213	64e129128f	fix(grafana): sync interval for CPU panels in main dashboard - Change hardcoded "2m" interval to "$resolution" variable - Affected panels: Global CPU Usage (id 77), CPU Usage (id 37) - Ensures consistent refresh rate across all CPU metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:46:15 +09:00
Mayne0213	518b5c31ef	fix: update dashboards and OTel collector for proper metrics/logs - certmanager.json: use Thanos datasource, fix variable regex - argocd.json: use Thanos datasource via $datasource variable - logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name) - collector.yaml: add loki.resource.labels hint for proper Loki label mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:36:37 +09:00
Mayne0213	de81ca68c9	FIX(opentelemetry-operator): fix ServiceMonitor config path - Move serviceMonitor config from metrics to manager section - Fix Helm schema validation error - Disable ServiceMonitor creation to prevent conflicts	2026-01-10 02:38:53 +09:00
Mayne0213	dac5fc7bcf	FIX(opentelemetry-operator): disable ServiceMonitor creation - Set metrics.serviceMonitor.enabled to false - Prevent ServiceMonitor conflict causing CrashLoopBackOff - Resolve error: servicemonitor already exists	2026-01-10 02:36:12 +09:00
Mayne0213	8a050dd303	CHORE(opentelemetry-operator): disable CPU limits - Set CPU limits to null for manager container - Set CPU limits to null for kube-rbac-proxy container - Disable chart default CPU limits to prevent throttling	2026-01-10 02:32:53 +09:00
Mayne0213	466ec6210c	CHORE(observability): align memory requests with limits - Update opentelemetry-operator manager from 64Mi to 256Mi - Update opentelemetry-operator kube-rbac-proxy from 32Mi to 64Mi - Update opentelemetry-collector memory request from 256Mi to 512Mi	2026-01-10 02:31:19 +09:00
Mayne0213	507395aca7	CHORE(otel-operator): schedule on master node - Add tolerations and nodeSelector to run operator on control-plane node	2026-01-10 01:18:41 +09:00
Mayne0213	9e87e6fbcb	REVERT(otel): remove metrics collection, keep logs/traces only - Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors - OTel Collector only handles logs (filelog) and traces (otlp) - Remove Target Allocator and metrics-related config - This reduces complexity and resource usage for home cluster	2026-01-10 01:18:35 +09:00
Mayne0213	a506ca3f58	FIX(prometheus): reduce replicas to 1 due to resource constraints - Cluster has insufficient memory to schedule 2 Prometheus replicas - Thanos sidecar still provides HA query capability	2026-01-10 01:18:26 +09:00
Mayne0213	328d952cc1	FIX(otel): increase metrics collector memory to 1Gi - OTel metrics collector pods were OOMKilled with 512Mi limit - Increased memory requests to 512Mi and limits to 1Gi	2026-01-10 01:18:18 +09:00
Mayne0213	5bc0caa324	FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled - Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi) - Increased memory requests and limits from 768Mi to 1536Mi	2026-01-10 01:18:11 +09:00
Mayne0213	8ce6f95d92	FIX(otel): use statefulset mode for metrics collector - Change from deployment to statefulset mode - Target Allocator requires statefulset (not deployment)	2026-01-10 00:01:22 +09:00
Mayne0213	5b70f19b12	REFACTOR(otel): split collector into logs and metrics - Create otel-logs (DaemonSet) for logs and traces collection - Create otel-metrics (Deployment+TA) for metrics collection - Use consistent-hashing strategy for full target coverage - Remove old unified collector.yaml	2026-01-09 23:50:21 +09:00
Mayne0213	12ee5b61c0	FIX(prometheus): enable out-of-order time window - Set outOfOrderTimeWindow to 5m for TSDB - Allow slightly out-of-order samples from distributed collectors - Prevents data loss from timing differences	2026-01-09 23:43:01 +09:00
Mayne0213	a3c5a8dbcf	CHORE(prometheus): disable direct scraping - Disable ServiceMonitor/PodMonitor scraping in Prometheus - OTel Collector now handles all metrics collection - Prevents out-of-order sample errors from duplicate scraping	2026-01-09 23:39:30 +09:00
Mayne0213	31f15e230d	FIX(otel): add scrape_configs for Target Allocator - Add minimal scrape_configs (required by Operator) - Keep self-metrics scraping alongside Target Allocator	2026-01-09 23:36:55 +09:00
Mayne0213	254687225c	FIX(otel): use per-node strategy for DaemonSet mode - Change allocationStrategy to per-node (required for DaemonSet) - Operator rejects consistent-hashing with DaemonSet mode	2026-01-09 23:32:56 +09:00
Mayne0213	1fdbb5e1dd	FEAT(otel): enable Target Allocator for metrics - Enable Target Allocator with consistent-hashing strategy - Configure prometheus receiver to use Target Allocator - Add RBAC permissions for secrets and events - Use prometheusCR for ServiceMonitor/PodMonitor discovery	2026-01-09 23:30:41 +09:00
Mayne0213	02faf93555	FEAT(otel): add OTel Collector for logs and traces - Add OpenTelemetry Operator for CR management - Deploy OTel Collector as DaemonSet via CR - Enable filelog receiver for container log collection - Replace Promtail with OTel filelog receiver - Keep Prometheus for ServiceMonitor-based metrics scraping	2026-01-09 23:23:51 +09:00
Mayne0213	ad9573e998	FIX(alertmanager): remove duplicate volume config - Remove extraVolumes and extraVolumeMounts - Chart uses emptyDir automatically when persistence disabled	2026-01-09 21:42:35 +09:00
Mayne0213	470a08f78a	CHORE(repo): switch to emptyDir with sizeLimit - Add sizeLimit 2Gi to loki emptyDir - Add sizeLimit 2Gi to tempo emptyDir - Change prometheus from PVC to emptyDir 5Gi - Change alertmanager from PVC to emptyDir 500Mi	2026-01-09 21:42:35 +09:00
Mayne0213	fa4d97eede	REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret - Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret) - Remove manifests source from ArgoCD application	2026-01-09 21:42:35 +09:00
Mayne0213	b378c6ec06	FIX(tempo): move extraEnv under tempo section for S3 credentials - Move extraEnv from top-level to tempo section where chart expects it - Move extraVolumeMounts under tempo section for proper WAL mounting - Fixes Access Denied error when connecting to MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	8ac76d17f3	FEAT(loki,tempo): use MinIO with emptyDir for WAL - Loki: disable PVC, use emptyDir for /var/loki - Tempo: switch backend from local to s3 (MinIO) - Tempo: disable PVC, use emptyDir for /var/tempo - Both services no longer use boot volume (/dev/sda1) - WAL data is temporary, persistent data stored in MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	2e6b4cecbf	FEAT(loki): switch storage backend to MinIO S3 - Change storage type from filesystem to s3 - Configure MinIO endpoint and bucket settings - Add S3 credentials from minio-s3-credentials secret - Update schema config to use s3 object_store	2026-01-09 21:42:35 +09:00
Mayne0213	24747b98cf	REFACTOR(loki,tempo): switch from MinIO to local-path storage - Loki: s3 backend to filesystem with local-path PVC - Tempo: s3 backend to local backend with local-path PVC - Remove MinIO/S3 credentials and configuration	2026-01-09 21:42:35 +09:00
Mayne0213	94af545120	REFACTOR(thanos): remove S3 storage integration - Disable Store Gateway and Compactor - Remove Sidecar objectStorageConfig - Keep Thanos Query + Sidecar for HA query - 3-day local retention is sufficient	2026-01-09 21:42:35 +09:00
Mayne0213	ffed27419a	REFACTOR(blackbox-exporter): revert to http_2xx module - Remove http_auth module workaround - Authelia now bypasses internal cluster traffic - All endpoints use standard http_2xx module	2026-01-09 21:42:35 +09:00
Mayne0213	37c216c433	FIX(blackbox-exporter): handle Authelia-protected endpoints - Add http_auth module accepting 401/403 status codes - Apply http_auth to grafana, code-server, pgweb, velero-ui - These services return 401 when accessed without authentication	2026-01-09 21:42:35 +09:00
Mayne0213	884a38d8ad	FEAT(blackbox-exporter): add external endpoint monitoring - Add blackbox-exporter with prometheus-community Helm chart - Configure HTTP probes for 25 external endpoints - Include SSL certificate expiry alerting rules - Add probe failure and slow response alerts - Deploy 2 replicas with anti-affinity for HA	2026-01-09 21:42:35 +09:00
Mayne0213	01c5742d7a	FIX(grafana): change OOM panel to stat type - Replace timeseries with stat panel for OOM detection - Show total count of OOMKilled pods instead of timeline - Gauge metric not suitable for timeseries visualization	2026-01-09 21:42:35 +09:00
Mayne0213	7cd778313a	FIX(prometheus): disable PrometheusDuplicateTimestamps alert - Low severity alert that fires repeatedly in HA setup - 0.05 samples/s drop rate is negligible	2026-01-09 21:42:35 +09:00
Mayne0213	bb8b1c193e	FIX(alertmanager): improve OOMKilled alert detection - Only fire when container restarted in last 10 minutes - Prevent stale alerts from old OOM events	2026-01-09 21:42:35 +09:00
Mayne0213	e3c615b5c1	FEAT(alertmanager): add OOMKilled alert rule - Add PrometheusRule to alert when containers are OOMKilled - Severity: warning, fires immediately	2026-01-09 21:42:35 +09:00
Mayne0213	539f4be497	FIX(grafana): use kube-state-metrics for OOM detection - Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason - Fix OOM events not showing after pod restart - cAdvisor metric resets on pod restart, kube-state-metrics persists	2026-01-09 21:42:35 +09:00
Mayne0213	14bd244b98	FIX(thanos): increase compactor memory to 256Mi - Compactor was OOMKilled with 128Mi limit - Set to 256Mi for stability during compaction	2026-01-09 21:42:35 +09:00
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	8c2a9badf8	FIX(alertmanager): set karma memory limits equal to requests - Align memory limits with requests for guaranteed QoS	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	fd6c1952ad	FIX(tempo): enable env var expansion in config - Add extraArgs config.expand-env=true - Required for ${VAR} substitution in tempo.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	5f926cb6cf	FEAT(tempo): configure S3 storage with MinIO - Enable env var expansion in config - Configure extraEnv for S3 credentials - Fix OTel Collector image settings	2026-01-09 21:41:52 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	034a5f32a2	CHORE(repo): remove application.yaml reference - Remove from kustomization.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	87420d842d	CHORE(repo): remove self-referencing application.yaml - Delete application.yaml (managed by platform)	2026-01-09 21:41:52 +09:00
Mayne0213	445cabb900	FIX(prometheus): add ExternalSecret default values to fix OutOfSync	2026-01-09 21:41:52 +09:00
Mayne0213	aecb15031d	FEAT(grafana): add Thanos as default datasource - Add Thanos Query as default Prometheus datasource - Keep original Prometheus datasource as backup - Thanos provides deduplicated metrics from HA Prometheus REFACTOR(thanos): move all components to master node - Add tolerations for control-plane:NoSchedule - Add nodeSelector for control-plane node - Affects: query, storegateway, compactor - PVC will be recreated on master node (data in S3) FIX(thanos): allow non-Bitnami images (quay.io/thanos) FIX(thanos): correct nodeSelector value to 'true'	2026-01-09 21:41:52 +09:00

1 2 3

147 Commits