Commit Graph

95 Commits

Author SHA1 Message Date
6b576d6a16 FEAT(thanos): add Thanos for Prometheus HA and long-term storage
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure Prometheus replicas: 2 with pod anti-affinity
- Add ExternalSecrets for MinIO credentials
- Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d
2026-01-09 21:41:52 +09:00
9f3b768cd9 FIX(loki): fix lokiCanary config path
- Move lokiCanary to top-level config
- Fix toleration not being applied to DaemonSet
2026-01-09 21:41:52 +09:00
a1c347e4ff FEAT(loki): enable loki-canary with control-plane toleration
- Enable lokiCanary for log ingestion monitoring
- Add toleration for control-plane node
2026-01-09 21:41:52 +09:00
30f028fae4 CHORE(prometheus): disable CPU/Memory overcommit alerts
- Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts
- Cluster uses replica=2 with pod anti-affinity for HA
2026-01-09 21:41:52 +09:00
6da4eba1dc CHORE(grafana): remove admin login secret for SSO
- Remove grafana-admin-password ExternalSecret
- Remove admin section from helm-values.yaml
- Authentication handled by Authelia SSO middleware
2026-01-09 21:41:52 +09:00
735166fc9c REFACTOR(repo): standardize taint to control-plane
- Change node-role.kubernetes.io/master to control-plane
- Update vpa, goldilocks, kube-state-metrics tolerations
- Remove deprecated master taint from promtail
2026-01-09 21:41:52 +09:00
7ed4d69c51 PERF(alertmanager): add HA with 2 replicas
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4511fd5b2e FIX(repo): correct nodeSelector label value
- Change master label value from "" to "true"
- Fix pod scheduling failure due to label mismatch
2026-01-09 21:41:52 +09:00
1c6a9dc491 PERF(repo): move system pods to master node
- Add nodeSelector for master node placement
- Add tolerations for NoExecute taint
- kube-state-metrics: schedule on master
- goldilocks-controller: schedule on master, reduce to 1 replica
- vpa-recommender: schedule on master, remove anti-affinity
- Free worker node resources for applications
2026-01-09 21:41:52 +09:00
bbdd908b27 CHORE(uptime-kuma): remove uptime-kuma application
- Delete uptime-kuma folder and configuration
- Using Grafana + Prometheus for monitoring instead
2026-01-09 21:41:52 +09:00
6a8e1f5a47 PERF(vpa): fix config and reduce CPU request
- Merge duplicate recommender sections
- Reduce CPU: 50m → 15m
- Change replicas: 2 → 1 (single recommender sufficient)
2026-01-09 21:41:52 +09:00
2cf35d0f76 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
2b7ee1fe51 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
1b7a4294dc FIX(goldilocks): merge duplicate dashboard sections
- Merge dashboard.affinity into dashboard section
- Fix YAML structure to prevent OutOfSync status
2026-01-09 21:41:52 +09:00
4515ea0b33 FEAT(observability): enable HA with replica 2 and soft anti-affinity
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics

FIX(loki): revert to replica 1 for Single Binary mode

- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
855321bebf FIX(prometheus): ignore relabelings default value diff in ServiceMonitor 2026-01-08 00:15:09 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
69dc3b34be REFACTOR(secrets): flatten Vault paths
- Change secret paths from <category>/<app> to <app>
- monitoring/alertmanager → alertmanager
- monitoring/grafana → grafana
- databases/postgresql → postgresql
2026-01-06 16:52:58 +09:00
7888aeff36 REFACTOR(repo): move vault/ to manifests/
- Move ExternalSecret files from vault/ to manifests/secret.yaml
- Update kustomization.yaml references
- Remove vault/ folders

Apps: alertmanager, grafana, prometheus
2026-01-06 16:42:33 +09:00
7b9abaf9c8 REFACTOR(obs): integrate ingress to helm-values
- alertmanager: move ingress to karma inline, servicemonitor to manifests
- goldilocks: move ingress to helm-values
- grafana: move ingress to helm-values
- uptime-kuma: move ingress to helm-values
2026-01-06 01:57:03 +09:00
28ba50d1a3 REFACTOR(repo): observability repo structure
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo

Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
8dcb563ae4 REFACTOR(grafana): change Grafana storageClass
- Update storageClass to local-path
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
c472035499 FEAT(grafana): add Grafana monitoring
- Add Grafana monitoring configuration
- Enable metrics collection
2026-01-05 00:40:01 +09:00
18bb20075e REFACTOR(grafana): remove trivy ui
- use grafana dashboard instead
- Delete trivy-ui ArgoCD Application
- Delete trivy ingress.yaml
- Update kustomization.yaml
2026-01-05 00:40:01 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
7653f2c4c8 FIX(grafana): fix storageClass key name
- Correct storageclass key spelling
- Fix Helm values configuration
2026-01-05 00:40:01 +09:00
aad4c249e2 CHORE(grafana): disable auto dashboard provision
- Use manual import instead of automatic provisioning
- Remove configMapGenerator for dashboards
- Remove sidecar and dashboards helm config
- Keep JSON files in dashboards/ for manual import reference
2026-01-05 00:40:01 +09:00
ababd677d4 FEAT(repo): enable ServerSideApply
- Enable ServerSideApply to handle large configmaps
- Fix resource management issues
2026-01-05 00:40:01 +09:00
9583be9b46 FEAT(grafana): export dashboards
- to JSON and use sidecar ConfigMaps
- Export 14 dashboards to JSON files
- Use kustomize configMapGenerator for dashboard ConfigMaps
- Enable Grafana sidecar to load dashboards from ConfigMaps
- Keep Longhorn and Traefik Official from grafana.com
2026-01-05 00:40:01 +09:00
b7ac39d68c FEAT(prometheus): enable Loki ServiceMonitor
- Enable Loki ServiceMonitor for Prometheus metrics collection
- Add monitoring configuration
2026-01-05 00:40:01 +09:00
c356493707 FEAT(alertmanager): add datasource to Grafana
- Add Alertmanager datasource configuration
- Enable alert visualization
2026-01-05 00:40:01 +09:00
997893284b FEAT(alertmanager): add ServiceMonitor
- Create servicemonitor.yaml for Prometheus to scrape Alertmanager
- alertmanager chart does not include ServiceMonitor, must be added separately
- Enables Grafana Alertmanager dashboard to display data
2026-01-05 00:40:01 +09:00
699b31cc67 FEAT(repo): add uptime kuma for service monitoring
- Add Uptime Kuma Helm chart deployment (dirsigler/uptime-kuma v2.24.0)
- Configure ingress for kuma0213.kro.kr
- Enable ServiceMonitor for Prometheus integration
- Use local-path storage class with 1Gi
2026-01-05 00:40:01 +09:00
200c6e97ae REFACTOR(repo): migrate repoURL to K3S-HOME
- Update repository URL to K3S-HOME organization
- Change from personal to organization repo
2026-01-05 00:40:01 +09:00
renovate[bot]
1e1cde4cd9 CHORE(deps): update alertmanager to v1.30.0
- Upgrade Alertmanager chart version
- Apply dependency updates
2026-01-05 00:40:01 +09:00
f7e11efe03 FEAT(repo): route InfoInhibitor to null
- Route InfoInhibitor alerts to null
- Prevent unnecessary alert notifications
2026-01-05 00:40:01 +09:00
925b8f2e01 FEAT(goldilocks): add Authelia SSO
- Add Authelia SSO to goldilocks, karma, trivy ingress
- Enable single sign-on authentication
2026-01-05 00:40:01 +09:00
939ae13c5d CHORE(grafana): disable local auth, add SSO
- Enable anonymous auth with Admin role
- Disable login form
- Add Authelia middleware to ingress
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
60dfa5cf7b CHORE(resources): disable apiserver/etcd metrics
- Disable kubeApiServer ServiceMonitor (~37k series)
- Disable kubeEtcd ServiceMonitor (~26k series)
- Expected memory reduction: ~30-40%
2026-01-05 00:40:01 +09:00
823b2ba495 REFACTOR(repo): remove global panel from Grafana
- Remove global panel configuration
- Clean up dashboard settings
2026-01-05 00:40:01 +09:00
f6ceb50503 REFACTOR(grafana): remove dashboard 15757
- Remove Windows-specific queries dashboard
- Clean up unused dashboards
2026-01-05 00:40:01 +09:00
658a81b4c1 REFACTOR(repo): remove ServerSideApply
- Remove ServerSideApply configuration
- Add RespectIgnoreDifferences syncOption
2026-01-05 00:40:01 +09:00
2c841c2b6e FEAT(vault): add ignoreDiff for ES/SM
- Add ignoreDifferences for ExternalSecret
- Prevent ArgoCD sync drift
2026-01-05 00:40:01 +09:00
d8360c10a1 FEAT(repo): add cAdvisor metrics_path relabel
- Add relabeling for cAdvisor metrics
- Support recording rules
2026-01-05 00:40:01 +09:00
0617611d22 FIX(grafana): restore dashboard 15757
- Restore Kubernetes Global with CPU Real dashboard
- Re-enable monitoring visualization
2026-01-05 00:40:01 +09:00
1befeb68c4 FEAT(prometheus): add ServerSideApply
- Enable ServerSideApply for CRD annotation handling
- Fix resource management
2026-01-05 00:40:01 +09:00
685563b92c REFACTOR(grafana): remove duplicated Dashboard
- Remove duplicate Grafana dashboard
- Clean up configuration
2026-01-05 00:40:01 +09:00
cd575d94a6 PERF(prometheus): optimize prometheus memory usage
- Increase scrapeInterval: 30s → 60s
- Increase evaluationInterval: 30s → 60s
- Reduce retention: 7d → 3d
- Add memory limit: 1Gi (prevent unlimited growth)
- Increase memory request: 256Mi → 512Mi (reflect actual usage)
2026-01-05 00:40:01 +09:00
2ee651b98d FEAT(node-exporter): add toleration
- for all taints to node-exporter
- Allows node-exporter to run on master node with NoExecute taint
- Enables metrics collection from all nodes including master
2026-01-05 00:40:01 +09:00