observability

Author	SHA1	Message	Date
Mayne0213	507395aca7	CHORE(otel-operator): schedule on master node - Add tolerations and nodeSelector to run operator on control-plane node	2026-01-10 01:18:41 +09:00
Mayne0213	9e87e6fbcb	REVERT(otel): remove metrics collection, keep logs/traces only - Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors - OTel Collector only handles logs (filelog) and traces (otlp) - Remove Target Allocator and metrics-related config - This reduces complexity and resource usage for home cluster	2026-01-10 01:18:35 +09:00
Mayne0213	a506ca3f58	FIX(prometheus): reduce replicas to 1 due to resource constraints - Cluster has insufficient memory to schedule 2 Prometheus replicas - Thanos sidecar still provides HA query capability	2026-01-10 01:18:26 +09:00
Mayne0213	328d952cc1	FIX(otel): increase metrics collector memory to 1Gi - OTel metrics collector pods were OOMKilled with 512Mi limit - Increased memory requests to 512Mi and limits to 1Gi	2026-01-10 01:18:18 +09:00
Mayne0213	5bc0caa324	FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled - Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi) - Increased memory requests and limits from 768Mi to 1536Mi	2026-01-10 01:18:11 +09:00
Mayne0213	8ce6f95d92	FIX(otel): use statefulset mode for metrics collector - Change from deployment to statefulset mode - Target Allocator requires statefulset (not deployment)	2026-01-10 00:01:22 +09:00
Mayne0213	5b70f19b12	REFACTOR(otel): split collector into logs and metrics - Create otel-logs (DaemonSet) for logs and traces collection - Create otel-metrics (Deployment+TA) for metrics collection - Use consistent-hashing strategy for full target coverage - Remove old unified collector.yaml	2026-01-09 23:50:21 +09:00
Mayne0213	12ee5b61c0	FIX(prometheus): enable out-of-order time window - Set outOfOrderTimeWindow to 5m for TSDB - Allow slightly out-of-order samples from distributed collectors - Prevents data loss from timing differences	2026-01-09 23:43:01 +09:00
Mayne0213	a3c5a8dbcf	CHORE(prometheus): disable direct scraping - Disable ServiceMonitor/PodMonitor scraping in Prometheus - OTel Collector now handles all metrics collection - Prevents out-of-order sample errors from duplicate scraping	2026-01-09 23:39:30 +09:00
Mayne0213	31f15e230d	FIX(otel): add scrape_configs for Target Allocator - Add minimal scrape_configs (required by Operator) - Keep self-metrics scraping alongside Target Allocator	2026-01-09 23:36:55 +09:00
Mayne0213	254687225c	FIX(otel): use per-node strategy for DaemonSet mode - Change allocationStrategy to per-node (required for DaemonSet) - Operator rejects consistent-hashing with DaemonSet mode	2026-01-09 23:32:56 +09:00
Mayne0213	1fdbb5e1dd	FEAT(otel): enable Target Allocator for metrics - Enable Target Allocator with consistent-hashing strategy - Configure prometheus receiver to use Target Allocator - Add RBAC permissions for secrets and events - Use prometheusCR for ServiceMonitor/PodMonitor discovery	2026-01-09 23:30:41 +09:00
Mayne0213	02faf93555	FEAT(otel): add OTel Collector for logs and traces - Add OpenTelemetry Operator for CR management - Deploy OTel Collector as DaemonSet via CR - Enable filelog receiver for container log collection - Replace Promtail with OTel filelog receiver - Keep Prometheus for ServiceMonitor-based metrics scraping	2026-01-09 23:23:51 +09:00
Mayne0213	ad9573e998	FIX(alertmanager): remove duplicate volume config - Remove extraVolumes and extraVolumeMounts - Chart uses emptyDir automatically when persistence disabled	2026-01-09 21:42:35 +09:00
Mayne0213	470a08f78a	CHORE(repo): switch to emptyDir with sizeLimit - Add sizeLimit 2Gi to loki emptyDir - Add sizeLimit 2Gi to tempo emptyDir - Change prometheus from PVC to emptyDir 5Gi - Change alertmanager from PVC to emptyDir 500Mi	2026-01-09 21:42:35 +09:00
Mayne0213	fa4d97eede	REFACTOR(tempo): remove redundant ExternalSecret, use ClusterExternalSecret - Remove tempo-s3-secret ExternalSecret (now using minio-s3-credentials from ClusterExternalSecret) - Remove manifests source from ArgoCD application	2026-01-09 21:42:35 +09:00
Mayne0213	b378c6ec06	FIX(tempo): move extraEnv under tempo section for S3 credentials - Move extraEnv from top-level to tempo section where chart expects it - Move extraVolumeMounts under tempo section for proper WAL mounting - Fixes Access Denied error when connecting to MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	8ac76d17f3	FEAT(loki,tempo): use MinIO with emptyDir for WAL - Loki: disable PVC, use emptyDir for /var/loki - Tempo: switch backend from local to s3 (MinIO) - Tempo: disable PVC, use emptyDir for /var/tempo - Both services no longer use boot volume (/dev/sda1) - WAL data is temporary, persistent data stored in MinIO	2026-01-09 21:42:35 +09:00
Mayne0213	2e6b4cecbf	FEAT(loki): switch storage backend to MinIO S3 - Change storage type from filesystem to s3 - Configure MinIO endpoint and bucket settings - Add S3 credentials from minio-s3-credentials secret - Update schema config to use s3 object_store	2026-01-09 21:42:35 +09:00
Mayne0213	24747b98cf	REFACTOR(loki,tempo): switch from MinIO to local-path storage - Loki: s3 backend to filesystem with local-path PVC - Tempo: s3 backend to local backend with local-path PVC - Remove MinIO/S3 credentials and configuration	2026-01-09 21:42:35 +09:00
Mayne0213	94af545120	REFACTOR(thanos): remove S3 storage integration - Disable Store Gateway and Compactor - Remove Sidecar objectStorageConfig - Keep Thanos Query + Sidecar for HA query - 3-day local retention is sufficient	2026-01-09 21:42:35 +09:00
Mayne0213	ffed27419a	REFACTOR(blackbox-exporter): revert to http_2xx module - Remove http_auth module workaround - Authelia now bypasses internal cluster traffic - All endpoints use standard http_2xx module	2026-01-09 21:42:35 +09:00
Mayne0213	37c216c433	FIX(blackbox-exporter): handle Authelia-protected endpoints - Add http_auth module accepting 401/403 status codes - Apply http_auth to grafana, code-server, pgweb, velero-ui - These services return 401 when accessed without authentication	2026-01-09 21:42:35 +09:00
Mayne0213	884a38d8ad	FEAT(blackbox-exporter): add external endpoint monitoring - Add blackbox-exporter with prometheus-community Helm chart - Configure HTTP probes for 25 external endpoints - Include SSL certificate expiry alerting rules - Add probe failure and slow response alerts - Deploy 2 replicas with anti-affinity for HA	2026-01-09 21:42:35 +09:00
Mayne0213	01c5742d7a	FIX(grafana): change OOM panel to stat type - Replace timeseries with stat panel for OOM detection - Show total count of OOMKilled pods instead of timeline - Gauge metric not suitable for timeseries visualization	2026-01-09 21:42:35 +09:00
Mayne0213	7cd778313a	FIX(prometheus): disable PrometheusDuplicateTimestamps alert - Low severity alert that fires repeatedly in HA setup - 0.05 samples/s drop rate is negligible	2026-01-09 21:42:35 +09:00
Mayne0213	bb8b1c193e	FIX(alertmanager): improve OOMKilled alert detection - Only fire when container restarted in last 10 minutes - Prevent stale alerts from old OOM events	2026-01-09 21:42:35 +09:00
Mayne0213	e3c615b5c1	FEAT(alertmanager): add OOMKilled alert rule - Add PrometheusRule to alert when containers are OOMKilled - Severity: warning, fires immediately	2026-01-09 21:42:35 +09:00
Mayne0213	539f4be497	FIX(grafana): use kube-state-metrics for OOM detection - Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason - Fix OOM events not showing after pod restart - cAdvisor metric resets on pod restart, kube-state-metrics persists	2026-01-09 21:42:35 +09:00
Mayne0213	14bd244b98	FIX(thanos): increase compactor memory to 256Mi - Compactor was OOMKilled with 128Mi limit - Set to 256Mi for stability during compaction	2026-01-09 21:42:35 +09:00
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	8c2a9badf8	FIX(alertmanager): set karma memory limits equal to requests - Align memory limits with requests for guaranteed QoS	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	fd6c1952ad	FIX(tempo): enable env var expansion in config - Add extraArgs config.expand-env=true - Required for ${VAR} substitution in tempo.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	5f926cb6cf	FEAT(tempo): configure S3 storage with MinIO - Enable env var expansion in config - Configure extraEnv for S3 credentials - Fix OTel Collector image settings	2026-01-09 21:41:52 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	034a5f32a2	CHORE(repo): remove application.yaml reference - Remove from kustomization.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	87420d842d	CHORE(repo): remove self-referencing application.yaml - Delete application.yaml (managed by platform)	2026-01-09 21:41:52 +09:00
Mayne0213	445cabb900	FIX(prometheus): add ExternalSecret default values to fix OutOfSync	2026-01-09 21:41:52 +09:00
Mayne0213	aecb15031d	FEAT(grafana): add Thanos as default datasource - Add Thanos Query as default Prometheus datasource - Keep original Prometheus datasource as backup - Thanos provides deduplicated metrics from HA Prometheus REFACTOR(thanos): move all components to master node - Add tolerations for control-plane:NoSchedule - Add nodeSelector for control-plane node - Affects: query, storegateway, compactor - PVC will be recreated on master node (data in S3) FIX(thanos): allow non-Bitnami images (quay.io/thanos) FIX(thanos): correct nodeSelector value to 'true'	2026-01-09 21:41:52 +09:00
Mayne0213	9b052b49cf	FEAT(thanos): add Thanos for Prometheus HA - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure OCI registry for Bitnami chart - Fix Vault secret path and image settings - Add nodeSelector for master node	2026-01-09 21:41:52 +09:00
Mayne0213	ea4d7d4ecf	PERF(prometheus): reduce CPU request from 200m to 50m - Actual usage is ~17m, 200m was over-provisioned - Fixes "Insufficient cpu" scheduling error for replica 2	2026-01-09 21:41:52 +09:00
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	9f3b768cd9	FIX(loki): fix lokiCanary config path - Move lokiCanary to top-level config - Fix toleration not being applied to DaemonSet	2026-01-09 21:41:52 +09:00
Mayne0213	a1c347e4ff	FEAT(loki): enable loki-canary with control-plane toleration - Enable lokiCanary for log ingestion monitoring - Add toleration for control-plane node	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	6da4eba1dc	CHORE(grafana): remove admin login secret for SSO - Remove grafana-admin-password ExternalSecret - Remove admin section from helm-values.yaml - Authentication handled by Authelia SSO middleware	2026-01-09 21:41:52 +09:00
Mayne0213	735166fc9c	REFACTOR(repo): standardize taint to control-plane - Change node-role.kubernetes.io/master to control-plane - Update vpa, goldilocks, kube-state-metrics tolerations - Remove deprecated master taint from promtail	2026-01-09 21:41:52 +09:00
Mayne0213	7ed4d69c51	PERF(alertmanager): add HA with 2 replicas - Increase replicaCount from 1 to 2 - Add soft pod anti-affinity to spread across nodes - Improve availability during node failures	2026-01-09 21:41:52 +09:00
Mayne0213	4511fd5b2e	FIX(repo): correct nodeSelector label value - Change master label value from "" to "true" - Fix pod scheduling failure due to label mismatch	2026-01-09 21:41:52 +09:00

1 2 3

137 Commits