Commit Graph

32 Commits

Author SHA1 Message Date
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
8ac76d17f3 FEAT(loki,tempo): use MinIO with emptyDir for WAL
- Loki: disable PVC, use emptyDir for /var/loki
- Tempo: switch backend from local to s3 (MinIO)
- Tempo: disable PVC, use emptyDir for /var/tempo
- Both services no longer use boot volume (/dev/sda1)
- WAL data is temporary, persistent data stored in MinIO
2026-01-09 21:42:35 +09:00
2e6b4cecbf FEAT(loki): switch storage backend to MinIO S3
- Change storage type from filesystem to s3
- Configure MinIO endpoint and bucket settings
- Add S3 credentials from minio-s3-credentials secret
- Update schema config to use s3 object_store
2026-01-09 21:42:35 +09:00
24747b98cf REFACTOR(loki,tempo): switch from MinIO to local-path storage
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
9f3b768cd9 FIX(loki): fix lokiCanary config path
- Move lokiCanary to top-level config
- Fix toleration not being applied to DaemonSet
2026-01-09 21:41:52 +09:00
a1c347e4ff FEAT(loki): enable loki-canary with control-plane toleration
- Enable lokiCanary for log ingestion monitoring
- Add toleration for control-plane node
2026-01-09 21:41:52 +09:00
2cf35d0f76 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
2b7ee1fe51 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
4515ea0b33 FEAT(observability): enable HA with replica 2 and soft anti-affinity
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics

FIX(loki): revert to replica 1 for Single Binary mode

- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
28ba50d1a3 REFACTOR(repo): observability repo structure
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo

Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
b7ac39d68c FEAT(prometheus): enable Loki ServiceMonitor
- Enable Loki ServiceMonitor for Prometheus metrics collection
- Add monitoring configuration
2026-01-05 00:40:01 +09:00
200c6e97ae REFACTOR(repo): migrate repoURL to K3S-HOME
- Update repository URL to K3S-HOME organization
- Change from personal to organization repo
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
ea4152a0d6 REFACTOR(gitea): migrate repoURL from Gitea
- to GitHub
2026-01-04 23:38:05 +09:00
5ec1a3323d REFACTOR(goldilocks): use managedNamespaceMetad...
- Remove namespace.yaml files
- Add managedNamespaceMetadata with Goldilocks label
- Set CreateNamespace=true in syncOptions
- Update kustomization.yaml to remove namespace.yaml references
2026-01-04 23:38:05 +09:00
bbf6fa5001 CHORE(repo): clean kustomization files
- Remove unused entries from kustomization
- Clean up configuration
2026-01-04 23:38:05 +09:00
2309254fc9 FIX(repo): circular reference in app kustomizes
- Comment out argocd.yaml in all app kustomization.yaml files
- Prevents circular reference when apps have 'path:' source (grafana,
  prometheus)
- ArgoCD Applications are managed manually, not via kustomize
2026-01-04 23:38:05 +09:00
b4ec13618a REFACTOR(repo): to independent app management
- pattern
- monitoring/kustomization.yaml now only manages application.yaml (App
  of Apps)
- Each app independently manages its own ArgoCD Application via
  kustomization.yaml
- Apps are fully self-contained: argocd.yaml, namespace.yaml, and app-
  specific resources
- Cleaner separation: no central app list to maintain
2026-01-04 23:38:05 +09:00
078850f77a FIX(argocd): sharedresourcewarning by referencing
- argocd.yaml files d...
- Change monitoring/kustomization.yaml to reference argocd.yaml files
  instead of folders
- Comment out argocd.yaml in each app's kustomization.yaml
- Matches applications folder pattern to avoid resource conflicts
2026-01-04 23:38:05 +09:00
6dec7e0a46 REFACTOR(argocd): monitoring apps
- to self-manage ArgoCD Applications
- Each app now includes its own argocd.yaml in kustomization.yaml
- Main monitoring/kustomization.yaml references app folders instead of
  individual argocd.yaml files
- Better separation of concerns - each app is self-contained and
  independently managed
2026-01-04 23:38:05 +09:00
5c4676ca9a REFACTOR(repo): restructure monitoring folder
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
  ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00
1bf40d431b REVERT(grafana): grafana to local-path
- storageclass
Due to storage constraints, reverting from longhorn to local-path.
Only Loki, Alertmanager, and Gitea remain on longhorn.
2026-01-04 23:38:05 +09:00
1a2f15c468 REFACTOR(longhorn): migrate monitoring PVCs
- from local-path to Longhorn
- Grafana: 2Gi (replica=3)
- Loki: 10Gi (replica=3)
- Alertmanager: 1Gi (replica=3)
- Prometheus: 5Gi (replica=3)
- Use dedicated 50GB Longhorn storage on each node
2026-01-04 23:38:05 +09:00
b86386a98d PERF(loki): reduce loki resource requests
- for worker-node-2 optimizat...
2026-01-04 23:38:05 +09:00
a11a9ab329 CHORE(argocd): update ArgoCD applications
- to point to monitoring repo...
2025-12-17 15:12:56 +09:00
baee94b69d INIT(repo): monitoring stack setup 2025-12-17 15:06:58 +09:00