observability

Author	SHA1	Message	Date
Mayne0213	7cd778313a	FIX(prometheus): disable PrometheusDuplicateTimestamps alert - Low severity alert that fires repeatedly in HA setup - 0.05 samples/s drop rate is negligible	2026-01-09 21:42:35 +09:00
Mayne0213	bb8b1c193e	FIX(alertmanager): improve OOMKilled alert detection - Only fire when container restarted in last 10 minutes - Prevent stale alerts from old OOM events	2026-01-09 21:42:35 +09:00
Mayne0213	e3c615b5c1	FEAT(alertmanager): add OOMKilled alert rule - Add PrometheusRule to alert when containers are OOMKilled - Severity: warning, fires immediately	2026-01-09 21:42:35 +09:00
Mayne0213	539f4be497	FIX(grafana): use kube-state-metrics for OOM detection - Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason - Fix OOM events not showing after pod restart - cAdvisor metric resets on pod restart, kube-state-metrics persists	2026-01-09 21:42:35 +09:00
Mayne0213	14bd244b98	FIX(thanos): increase compactor memory to 256Mi - Compactor was OOMKilled with 128Mi limit - Set to 256Mi for stability during compaction	2026-01-09 21:42:35 +09:00
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	8c2a9badf8	FIX(alertmanager): set karma memory limits equal to requests - Align memory limits with requests for guaranteed QoS	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	fd6c1952ad	FIX(tempo): enable env var expansion in config - Add extraArgs config.expand-env=true - Required for ${VAR} substitution in tempo.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	5f926cb6cf	FEAT(tempo): configure S3 storage with MinIO - Enable env var expansion in config - Configure extraEnv for S3 credentials - Fix OTel Collector image settings	2026-01-09 21:41:52 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	034a5f32a2	CHORE(repo): remove application.yaml reference - Remove from kustomization.yaml	2026-01-09 21:41:52 +09:00
Mayne0213	87420d842d	CHORE(repo): remove self-referencing application.yaml - Delete application.yaml (managed by platform)	2026-01-09 21:41:52 +09:00
Mayne0213	445cabb900	FIX(prometheus): add ExternalSecret default values to fix OutOfSync	2026-01-09 21:41:52 +09:00
Mayne0213	aecb15031d	FEAT(grafana): add Thanos as default datasource - Add Thanos Query as default Prometheus datasource - Keep original Prometheus datasource as backup - Thanos provides deduplicated metrics from HA Prometheus REFACTOR(thanos): move all components to master node - Add tolerations for control-plane:NoSchedule - Add nodeSelector for control-plane node - Affects: query, storegateway, compactor - PVC will be recreated on master node (data in S3) FIX(thanos): allow non-Bitnami images (quay.io/thanos) FIX(thanos): correct nodeSelector value to 'true'	2026-01-09 21:41:52 +09:00
Mayne0213	9b052b49cf	FEAT(thanos): add Thanos for Prometheus HA - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure OCI registry for Bitnami chart - Fix Vault secret path and image settings - Add nodeSelector for master node	2026-01-09 21:41:52 +09:00
Mayne0213	ea4d7d4ecf	PERF(prometheus): reduce CPU request from 200m to 50m - Actual usage is ~17m, 200m was over-provisioned - Fixes "Insufficient cpu" scheduling error for replica 2	2026-01-09 21:41:52 +09:00
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	9f3b768cd9	FIX(loki): fix lokiCanary config path - Move lokiCanary to top-level config - Fix toleration not being applied to DaemonSet	2026-01-09 21:41:52 +09:00
Mayne0213	a1c347e4ff	FEAT(loki): enable loki-canary with control-plane toleration - Enable lokiCanary for log ingestion monitoring - Add toleration for control-plane node	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	6da4eba1dc	CHORE(grafana): remove admin login secret for SSO - Remove grafana-admin-password ExternalSecret - Remove admin section from helm-values.yaml - Authentication handled by Authelia SSO middleware	2026-01-09 21:41:52 +09:00
Mayne0213	735166fc9c	REFACTOR(repo): standardize taint to control-plane - Change node-role.kubernetes.io/master to control-plane - Update vpa, goldilocks, kube-state-metrics tolerations - Remove deprecated master taint from promtail	2026-01-09 21:41:52 +09:00
Mayne0213	7ed4d69c51	PERF(alertmanager): add HA with 2 replicas - Increase replicaCount from 1 to 2 - Add soft pod anti-affinity to spread across nodes - Improve availability during node failures	2026-01-09 21:41:52 +09:00
Mayne0213	4511fd5b2e	FIX(repo): correct nodeSelector label value - Change master label value from "" to "true" - Fix pod scheduling failure due to label mismatch	2026-01-09 21:41:52 +09:00
Mayne0213	1c6a9dc491	PERF(repo): move system pods to master node - Add nodeSelector for master node placement - Add tolerations for NoExecute taint - kube-state-metrics: schedule on master - goldilocks-controller: schedule on master, reduce to 1 replica - vpa-recommender: schedule on master, remove anti-affinity - Free worker node resources for applications	2026-01-09 21:41:52 +09:00
Mayne0213	bbdd908b27	CHORE(uptime-kuma): remove uptime-kuma application - Delete uptime-kuma folder and configuration - Using Grafana + Prometheus for monitoring instead	2026-01-09 21:41:52 +09:00
Mayne0213	6a8e1f5a47	PERF(vpa): fix config and reduce CPU request - Merge duplicate recommender sections - Reduce CPU: 50m → 15m - Change replicas: 2 → 1 (single recommender sufficient)	2026-01-09 21:41:52 +09:00
Mayne0213	2cf35d0f76	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	2b7ee1fe51	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	1b7a4294dc	FIX(goldilocks): merge duplicate dashboard sections - Merge dashboard.affinity into dashboard section - Fix YAML structure to prevent OutOfSync status	2026-01-09 21:41:52 +09:00
Mayne0213	4515ea0b33	FEAT(observability): enable HA with replica 2 and soft anti-affinity - Add replicaCount: 2 to goldilocks, vpa, alertmanager - Add replicas: 2 to loki singleBinary - Add soft pod anti-affinity for node distribution - Keep kube-state-metrics at replica 1 to prevent duplicate metrics FIX(loki): revert to replica 1 for Single Binary mode - Single Binary mode cannot run more than 1 replica without object storage - Remove affinity configuration for single replica - Keep filesystem storage backend	2026-01-09 21:41:51 +09:00
Mayne0213	855321bebf	FIX(prometheus): ignore relabelings default value diff in ServiceMonitor	2026-01-08 00:15:09 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	69dc3b34be	REFACTOR(secrets): flatten Vault paths - Change secret paths from <category>/<app> to <app> - monitoring/alertmanager → alertmanager - monitoring/grafana → grafana - databases/postgresql → postgresql	2026-01-06 16:52:58 +09:00
Mayne0213	7888aeff36	REFACTOR(repo): move vault/ to manifests/ - Move ExternalSecret files from vault/ to manifests/secret.yaml - Update kustomization.yaml references - Remove vault/ folders Apps: alertmanager, grafana, prometheus	2026-01-06 16:42:33 +09:00
Mayne0213	7b9abaf9c8	REFACTOR(obs): integrate ingress to helm-values - alertmanager: move ingress to karma inline, servicemonitor to manifests - goldilocks: move ingress to helm-values - grafana: move ingress to helm-values - uptime-kuma: move ingress to helm-values	2026-01-06 01:57:03 +09:00
Mayne0213	28ba50d1a3	REFACTOR(repo): observability repo structure - Add application.yaml for ArgoCD app-of-apps - Add kustomization.yaml with observability components - Add renovate.json for automated updates - Update all component argocd.yaml repoURLs to observability repo Components: prometheus, alertmanager, grafana, loki, promtail, node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa	2026-01-05 00:40:01 +09:00
Mayne0213	8dcb563ae4	REFACTOR(grafana): change Grafana storageClass - Update storageClass to local-path - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	c472035499	FEAT(grafana): add Grafana monitoring - Add Grafana monitoring configuration - Enable metrics collection	2026-01-05 00:40:01 +09:00
Mayne0213	18bb20075e	REFACTOR(grafana): remove trivy ui - use grafana dashboard instead - Delete trivy-ui ArgoCD Application - Delete trivy ingress.yaml - Update kustomization.yaml	2026-01-05 00:40:01 +09:00
Mayne0213	864c2c45d8	REFACTOR(alertmanager): change storageClass - Update storageClass to local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	7653f2c4c8	FIX(grafana): fix storageClass key name - Correct storageclass key spelling - Fix Helm values configuration	2026-01-05 00:40:01 +09:00
Mayne0213	aad4c249e2	CHORE(grafana): disable auto dashboard provision - Use manual import instead of automatic provisioning - Remove configMapGenerator for dashboards - Remove sidecar and dashboards helm config - Keep JSON files in dashboards/ for manual import reference	2026-01-05 00:40:01 +09:00
Mayne0213	ababd677d4	FEAT(repo): enable ServerSideApply - Enable ServerSideApply to handle large configmaps - Fix resource management issues	2026-01-05 00:40:01 +09:00
Mayne0213	9583be9b46	FEAT(grafana): export dashboards - to JSON and use sidecar ConfigMaps - Export 14 dashboards to JSON files - Use kustomize configMapGenerator for dashboard ConfigMaps - Enable Grafana sidecar to load dashboards from ConfigMaps - Keep Longhorn and Traefik Official from grafana.com	2026-01-05 00:40:01 +09:00
Mayne0213	b7ac39d68c	FEAT(prometheus): enable Loki ServiceMonitor - Enable Loki ServiceMonitor for Prometheus metrics collection - Add monitoring configuration	2026-01-05 00:40:01 +09:00
Mayne0213	c356493707	FEAT(alertmanager): add datasource to Grafana - Add Alertmanager datasource configuration - Enable alert visualization	2026-01-05 00:40:01 +09:00
Mayne0213	997893284b	FEAT(alertmanager): add ServiceMonitor - Create servicemonitor.yaml for Prometheus to scrape Alertmanager - alertmanager chart does not include ServiceMonitor, must be added separately - Enables Grafana Alertmanager dashboard to display data	2026-01-05 00:40:01 +09:00
Mayne0213	699b31cc67	FEAT(repo): add uptime kuma for service monitoring - Add Uptime Kuma Helm chart deployment (dirsigler/uptime-kuma v2.24.0) - Configure ingress for kuma0213.kro.kr - Enable ServiceMonitor for Prometheus integration - Use local-path storage class with 1Gi	2026-01-05 00:40:01 +09:00

1 2 3

112 Commits