observability

Author	SHA1	Message	Date
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	9f3b768cd9	FIX(loki): fix lokiCanary config path - Move lokiCanary to top-level config - Fix toleration not being applied to DaemonSet	2026-01-09 21:41:52 +09:00
Mayne0213	a1c347e4ff	FEAT(loki): enable loki-canary with control-plane toleration - Enable lokiCanary for log ingestion monitoring - Add toleration for control-plane node	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	6da4eba1dc	CHORE(grafana): remove admin login secret for SSO - Remove grafana-admin-password ExternalSecret - Remove admin section from helm-values.yaml - Authentication handled by Authelia SSO middleware	2026-01-09 21:41:52 +09:00
Mayne0213	735166fc9c	REFACTOR(repo): standardize taint to control-plane - Change node-role.kubernetes.io/master to control-plane - Update vpa, goldilocks, kube-state-metrics tolerations - Remove deprecated master taint from promtail	2026-01-09 21:41:52 +09:00
Mayne0213	7ed4d69c51	PERF(alertmanager): add HA with 2 replicas - Increase replicaCount from 1 to 2 - Add soft pod anti-affinity to spread across nodes - Improve availability during node failures	2026-01-09 21:41:52 +09:00
Mayne0213	4511fd5b2e	FIX(repo): correct nodeSelector label value - Change master label value from "" to "true" - Fix pod scheduling failure due to label mismatch	2026-01-09 21:41:52 +09:00
Mayne0213	1c6a9dc491	PERF(repo): move system pods to master node - Add nodeSelector for master node placement - Add tolerations for NoExecute taint - kube-state-metrics: schedule on master - goldilocks-controller: schedule on master, reduce to 1 replica - vpa-recommender: schedule on master, remove anti-affinity - Free worker node resources for applications	2026-01-09 21:41:52 +09:00
Mayne0213	bbdd908b27	CHORE(uptime-kuma): remove uptime-kuma application - Delete uptime-kuma folder and configuration - Using Grafana + Prometheus for monitoring instead	2026-01-09 21:41:52 +09:00
Mayne0213	6a8e1f5a47	PERF(vpa): fix config and reduce CPU request - Merge duplicate recommender sections - Reduce CPU: 50m → 15m - Change replicas: 2 → 1 (single recommender sufficient)	2026-01-09 21:41:52 +09:00
Mayne0213	2cf35d0f76	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	2b7ee1fe51	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	1b7a4294dc	FIX(goldilocks): merge duplicate dashboard sections - Merge dashboard.affinity into dashboard section - Fix YAML structure to prevent OutOfSync status	2026-01-09 21:41:52 +09:00
Mayne0213	4515ea0b33	FEAT(observability): enable HA with replica 2 and soft anti-affinity - Add replicaCount: 2 to goldilocks, vpa, alertmanager - Add replicas: 2 to loki singleBinary - Add soft pod anti-affinity for node distribution - Keep kube-state-metrics at replica 1 to prevent duplicate metrics FIX(loki): revert to replica 1 for Single Binary mode - Single Binary mode cannot run more than 1 replica without object storage - Remove affinity configuration for single replica - Keep filesystem storage backend	2026-01-09 21:41:51 +09:00
Mayne0213	855321bebf	FIX(prometheus): ignore relabelings default value diff in ServiceMonitor	2026-01-08 00:15:09 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	69dc3b34be	REFACTOR(secrets): flatten Vault paths - Change secret paths from <category>/<app> to <app> - monitoring/alertmanager → alertmanager - monitoring/grafana → grafana - databases/postgresql → postgresql	2026-01-06 16:52:58 +09:00
Mayne0213	7888aeff36	REFACTOR(repo): move vault/ to manifests/ - Move ExternalSecret files from vault/ to manifests/secret.yaml - Update kustomization.yaml references - Remove vault/ folders Apps: alertmanager, grafana, prometheus	2026-01-06 16:42:33 +09:00
Mayne0213	7b9abaf9c8	REFACTOR(obs): integrate ingress to helm-values - alertmanager: move ingress to karma inline, servicemonitor to manifests - goldilocks: move ingress to helm-values - grafana: move ingress to helm-values - uptime-kuma: move ingress to helm-values	2026-01-06 01:57:03 +09:00
Mayne0213	28ba50d1a3	REFACTOR(repo): observability repo structure - Add application.yaml for ArgoCD app-of-apps - Add kustomization.yaml with observability components - Add renovate.json for automated updates - Update all component argocd.yaml repoURLs to observability repo Components: prometheus, alertmanager, grafana, loki, promtail, node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa	2026-01-05 00:40:01 +09:00
Mayne0213	8dcb563ae4	REFACTOR(grafana): change Grafana storageClass - Update storageClass to local-path - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	c472035499	FEAT(grafana): add Grafana monitoring - Add Grafana monitoring configuration - Enable metrics collection	2026-01-05 00:40:01 +09:00
Mayne0213	18bb20075e	REFACTOR(grafana): remove trivy ui - use grafana dashboard instead - Delete trivy-ui ArgoCD Application - Delete trivy ingress.yaml - Update kustomization.yaml	2026-01-05 00:40:01 +09:00
Mayne0213	864c2c45d8	REFACTOR(alertmanager): change storageClass - Update storageClass to local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	7653f2c4c8	FIX(grafana): fix storageClass key name - Correct storageclass key spelling - Fix Helm values configuration	2026-01-05 00:40:01 +09:00
Mayne0213	aad4c249e2	CHORE(grafana): disable auto dashboard provision - Use manual import instead of automatic provisioning - Remove configMapGenerator for dashboards - Remove sidecar and dashboards helm config - Keep JSON files in dashboards/ for manual import reference	2026-01-05 00:40:01 +09:00
Mayne0213	ababd677d4	FEAT(repo): enable ServerSideApply - Enable ServerSideApply to handle large configmaps - Fix resource management issues	2026-01-05 00:40:01 +09:00
Mayne0213	9583be9b46	FEAT(grafana): export dashboards - to JSON and use sidecar ConfigMaps - Export 14 dashboards to JSON files - Use kustomize configMapGenerator for dashboard ConfigMaps - Enable Grafana sidecar to load dashboards from ConfigMaps - Keep Longhorn and Traefik Official from grafana.com	2026-01-05 00:40:01 +09:00
Mayne0213	b7ac39d68c	FEAT(prometheus): enable Loki ServiceMonitor - Enable Loki ServiceMonitor for Prometheus metrics collection - Add monitoring configuration	2026-01-05 00:40:01 +09:00
Mayne0213	c356493707	FEAT(alertmanager): add datasource to Grafana - Add Alertmanager datasource configuration - Enable alert visualization	2026-01-05 00:40:01 +09:00
Mayne0213	997893284b	FEAT(alertmanager): add ServiceMonitor - Create servicemonitor.yaml for Prometheus to scrape Alertmanager - alertmanager chart does not include ServiceMonitor, must be added separately - Enables Grafana Alertmanager dashboard to display data	2026-01-05 00:40:01 +09:00
Mayne0213	699b31cc67	FEAT(repo): add uptime kuma for service monitoring - Add Uptime Kuma Helm chart deployment (dirsigler/uptime-kuma v2.24.0) - Configure ingress for kuma0213.kro.kr - Enable ServiceMonitor for Prometheus integration - Use local-path storage class with 1Gi	2026-01-05 00:40:01 +09:00
Mayne0213	200c6e97ae	REFACTOR(repo): migrate repoURL to K3S-HOME - Update repository URL to K3S-HOME organization - Change from personal to organization repo	2026-01-05 00:40:01 +09:00
renovate[bot]	1e1cde4cd9	CHORE(deps): update alertmanager to v1.30.0 - Upgrade Alertmanager chart version - Apply dependency updates	2026-01-05 00:40:01 +09:00
Mayne0213	f7e11efe03	FEAT(repo): route InfoInhibitor to null - Route InfoInhibitor alerts to null - Prevent unnecessary alert notifications	2026-01-05 00:40:01 +09:00
Mayne0213	925b8f2e01	FEAT(goldilocks): add Authelia SSO - Add Authelia SSO to goldilocks, karma, trivy ingress - Enable single sign-on authentication	2026-01-05 00:40:01 +09:00
Mayne0213	939ae13c5d	CHORE(grafana): disable local auth, add SSO - Enable anonymous auth with Admin role - Disable login form - Add Authelia middleware to ingress	2026-01-05 00:40:01 +09:00
Mayne0213	e4b477a510	REFACTOR(longhorn): migrate to local-path - alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	60dfa5cf7b	CHORE(resources): disable apiserver/etcd metrics - Disable kubeApiServer ServiceMonitor (~37k series) - Disable kubeEtcd ServiceMonitor (~26k series) - Expected memory reduction: ~30-40%	2026-01-05 00:40:01 +09:00
Mayne0213	823b2ba495	REFACTOR(repo): remove global panel from Grafana - Remove global panel configuration - Clean up dashboard settings	2026-01-05 00:40:01 +09:00
Mayne0213	f6ceb50503	REFACTOR(grafana): remove dashboard 15757 - Remove Windows-specific queries dashboard - Clean up unused dashboards	2026-01-05 00:40:01 +09:00
Mayne0213	658a81b4c1	REFACTOR(repo): remove ServerSideApply - Remove ServerSideApply configuration - Add RespectIgnoreDifferences syncOption	2026-01-05 00:40:01 +09:00
Mayne0213	2c841c2b6e	FEAT(vault): add ignoreDiff for ES/SM - Add ignoreDifferences for ExternalSecret - Prevent ArgoCD sync drift	2026-01-05 00:40:01 +09:00
Mayne0213	d8360c10a1	FEAT(repo): add cAdvisor metrics_path relabel - Add relabeling for cAdvisor metrics - Support recording rules	2026-01-05 00:40:01 +09:00
Mayne0213	0617611d22	FIX(grafana): restore dashboard 15757 - Restore Kubernetes Global with CPU Real dashboard - Re-enable monitoring visualization	2026-01-05 00:40:01 +09:00
Mayne0213	1befeb68c4	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-05 00:40:01 +09:00
Mayne0213	685563b92c	REFACTOR(grafana): remove duplicated Dashboard - Remove duplicate Grafana dashboard - Clean up configuration	2026-01-05 00:40:01 +09:00
Mayne0213	cd575d94a6	PERF(prometheus): optimize prometheus memory usage - Increase scrapeInterval: 30s → 60s - Increase evaluationInterval: 30s → 60s - Reduce retention: 7d → 3d - Add memory limit: 1Gi (prevent unlimited growth) - Increase memory request: 256Mi → 512Mi (reflect actual usage)	2026-01-05 00:40:01 +09:00
Mayne0213	2ee651b98d	FEAT(node-exporter): add toleration - for all taints to node-exporter - Allows node-exporter to run on master node with NoExecute taint - Enables metrics collection from all nodes including master	2026-01-05 00:40:01 +09:00

1 2

95 Commits