24747b98cf
REFACTOR(loki,tempo): switch from MinIO to local-path storage
...
- Loki: s3 backend to filesystem with local-path PVC
- Tempo: s3 backend to local backend with local-path PVC
- Remove MinIO/S3 credentials and configuration
2026-01-09 21:42:35 +09:00
94af545120
REFACTOR(thanos): remove S3 storage integration
...
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
2026-01-09 21:42:35 +09:00
ffed27419a
REFACTOR(blackbox-exporter): revert to http_2xx module
...
- Remove http_auth module workaround
- Authelia now bypasses internal cluster traffic
- All endpoints use standard http_2xx module
2026-01-09 21:42:35 +09:00
37c216c433
FIX(blackbox-exporter): handle Authelia-protected endpoints
...
- Add http_auth module accepting 401/403 status codes
- Apply http_auth to grafana, code-server, pgweb, velero-ui
- These services return 401 when accessed without authentication
2026-01-09 21:42:35 +09:00
884a38d8ad
FEAT(blackbox-exporter): add external endpoint monitoring
...
- Add blackbox-exporter with prometheus-community Helm chart
- Configure HTTP probes for 25 external endpoints
- Include SSL certificate expiry alerting rules
- Add probe failure and slow response alerts
- Deploy 2 replicas with anti-affinity for HA
2026-01-09 21:42:35 +09:00
01c5742d7a
FIX(grafana): change OOM panel to stat type
...
- Replace timeseries with stat panel for OOM detection
- Show total count of OOMKilled pods instead of timeline
- Gauge metric not suitable for timeseries visualization
2026-01-09 21:42:35 +09:00
7cd778313a
FIX(prometheus): disable PrometheusDuplicateTimestamps alert
...
- Low severity alert that fires repeatedly in HA setup
- 0.05 samples/s drop rate is negligible
2026-01-09 21:42:35 +09:00
bb8b1c193e
FIX(alertmanager): improve OOMKilled alert detection
...
- Only fire when container restarted in last 10 minutes
- Prevent stale alerts from old OOM events
2026-01-09 21:42:35 +09:00
e3c615b5c1
FEAT(alertmanager): add OOMKilled alert rule
...
- Add PrometheusRule to alert when containers are OOMKilled
- Severity: warning, fires immediately
2026-01-09 21:42:35 +09:00
539f4be497
FIX(grafana): use kube-state-metrics for OOM detection
...
- Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason
- Fix OOM events not showing after pod restart
- cAdvisor metric resets on pod restart, kube-state-metrics persists
2026-01-09 21:42:35 +09:00
14bd244b98
FIX(thanos): increase compactor memory to 256Mi
...
- Compactor was OOMKilled with 128Mi limit
- Set to 256Mi for stability during compaction
2026-01-09 21:42:35 +09:00
4a4a43ed82
FIX(prometheus): increase memory to 768Mi
...
- Prometheus was OOMKilled with 512Mi limit
- Set both requests and limits to 768Mi
2026-01-09 21:42:35 +09:00
8c2a9badf8
FIX(alertmanager): set karma memory limits equal to requests
...
- Align memory limits with requests for guaranteed QoS
2026-01-09 21:42:35 +09:00
5089e8607d
CHORE(resources): set memory limits equal to memory requests
...
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
fd6c1952ad
FIX(tempo): enable env var expansion in config
...
- Add extraArgs config.expand-env=true
- Required for ${VAR} substitution in tempo.yaml
2026-01-09 21:41:52 +09:00
5f926cb6cf
FEAT(tempo): configure S3 storage with MinIO
...
- Enable env var expansion in config
- Configure extraEnv for S3 credentials
- Fix OTel Collector image settings
2026-01-09 21:41:52 +09:00
7139f3e5a2
FIX(prometheus): correct ArgoCD metrics service names
...
- Update controller target to argocd-application-controller-metrics
- Update repo-server target to argocd-repo-server-metrics
2026-01-09 21:41:52 +09:00
034a5f32a2
CHORE(repo): remove application.yaml reference
...
- Remove from kustomization.yaml
2026-01-09 21:41:52 +09:00
87420d842d
CHORE(repo): remove self-referencing application.yaml
...
- Delete application.yaml (managed by platform)
2026-01-09 21:41:52 +09:00
445cabb900
FIX(prometheus): add ExternalSecret default values to fix OutOfSync
2026-01-09 21:41:52 +09:00
aecb15031d
FEAT(grafana): add Thanos as default datasource
...
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus
REFACTOR(thanos): move all components to master node
- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)
FIX(thanos): allow non-Bitnami images (quay.io/thanos)
FIX(thanos): correct nodeSelector value to 'true'
2026-01-09 21:41:52 +09:00
9b052b49cf
FEAT(thanos): add Thanos for Prometheus HA
...
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure OCI registry for Bitnami chart
- Fix Vault secret path and image settings
- Add nodeSelector for master node
2026-01-09 21:41:52 +09:00
ea4d7d4ecf
PERF(prometheus): reduce CPU request from 200m to 50m
...
- Actual usage is ~17m, 200m was over-provisioned
- Fixes "Insufficient cpu" scheduling error for replica 2
2026-01-09 21:41:52 +09:00
6b576d6a16
FEAT(thanos): add Thanos for Prometheus HA and long-term storage
...
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure Prometheus replicas: 2 with pod anti-affinity
- Add ExternalSecrets for MinIO credentials
- Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d
2026-01-09 21:41:52 +09:00
9f3b768cd9
FIX(loki): fix lokiCanary config path
...
- Move lokiCanary to top-level config
- Fix toleration not being applied to DaemonSet
2026-01-09 21:41:52 +09:00
a1c347e4ff
FEAT(loki): enable loki-canary with control-plane toleration
...
- Enable lokiCanary for log ingestion monitoring
- Add toleration for control-plane node
2026-01-09 21:41:52 +09:00
30f028fae4
CHORE(prometheus): disable CPU/Memory overcommit alerts
...
- Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts
- Cluster uses replica=2 with pod anti-affinity for HA
2026-01-09 21:41:52 +09:00
6da4eba1dc
CHORE(grafana): remove admin login secret for SSO
...
- Remove grafana-admin-password ExternalSecret
- Remove admin section from helm-values.yaml
- Authentication handled by Authelia SSO middleware
2026-01-09 21:41:52 +09:00
735166fc9c
REFACTOR(repo): standardize taint to control-plane
...
- Change node-role.kubernetes.io/master to control-plane
- Update vpa, goldilocks, kube-state-metrics tolerations
- Remove deprecated master taint from promtail
2026-01-09 21:41:52 +09:00
7ed4d69c51
PERF(alertmanager): add HA with 2 replicas
...
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4511fd5b2e
FIX(repo): correct nodeSelector label value
...
- Change master label value from "" to "true"
- Fix pod scheduling failure due to label mismatch
2026-01-09 21:41:52 +09:00
1c6a9dc491
PERF(repo): move system pods to master node
...
- Add nodeSelector for master node placement
- Add tolerations for NoExecute taint
- kube-state-metrics: schedule on master
- goldilocks-controller: schedule on master, reduce to 1 replica
- vpa-recommender: schedule on master, remove anti-affinity
- Free worker node resources for applications
2026-01-09 21:41:52 +09:00
bbdd908b27
CHORE(uptime-kuma): remove uptime-kuma application
...
- Delete uptime-kuma folder and configuration
- Using Grafana + Prometheus for monitoring instead
2026-01-09 21:41:52 +09:00
6a8e1f5a47
PERF(vpa): fix config and reduce CPU request
...
- Merge duplicate recommender sections
- Reduce CPU: 50m → 15m
- Change replicas: 2 → 1 (single recommender sufficient)
2026-01-09 21:41:52 +09:00
2cf35d0f76
FEAT(loki): configure storage and HA
...
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
2b7ee1fe51
FEAT(loki): configure storage and HA
...
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
1b7a4294dc
FIX(goldilocks): merge duplicate dashboard sections
...
- Merge dashboard.affinity into dashboard section
- Fix YAML structure to prevent OutOfSync status
2026-01-09 21:41:52 +09:00
4515ea0b33
FEAT(observability): enable HA with replica 2 and soft anti-affinity
...
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics
FIX(loki): revert to replica 1 for Single Binary mode
- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
855321bebf
FIX(prometheus): ignore relabelings default value diff in ServiceMonitor
2026-01-08 00:15:09 +09:00
4286296591
PERF(resources): remove CPU limits - keep memory limits only
...
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
69dc3b34be
REFACTOR(secrets): flatten Vault paths
...
- Change secret paths from <category>/<app> to <app>
- monitoring/alertmanager → alertmanager
- monitoring/grafana → grafana
- databases/postgresql → postgresql
2026-01-06 16:52:58 +09:00
7888aeff36
REFACTOR(repo): move vault/ to manifests/
...
- Move ExternalSecret files from vault/ to manifests/secret.yaml
- Update kustomization.yaml references
- Remove vault/ folders
Apps: alertmanager, grafana, prometheus
2026-01-06 16:42:33 +09:00
7b9abaf9c8
REFACTOR(obs): integrate ingress to helm-values
...
- alertmanager: move ingress to karma inline, servicemonitor to manifests
- goldilocks: move ingress to helm-values
- grafana: move ingress to helm-values
- uptime-kuma: move ingress to helm-values
2026-01-06 01:57:03 +09:00
28ba50d1a3
REFACTOR(repo): observability repo structure
...
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo
Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
8dcb563ae4
REFACTOR(grafana): change Grafana storageClass
...
- Update storageClass to local-path
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
c472035499
FEAT(grafana): add Grafana monitoring
...
- Add Grafana monitoring configuration
- Enable metrics collection
2026-01-05 00:40:01 +09:00
18bb20075e
REFACTOR(grafana): remove trivy ui
...
- use grafana dashboard instead
- Delete trivy-ui ArgoCD Application
- Delete trivy ingress.yaml
- Update kustomization.yaml
2026-01-05 00:40:01 +09:00
864c2c45d8
REFACTOR(alertmanager): change storageClass
...
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
7653f2c4c8
FIX(grafana): fix storageClass key name
...
- Correct storageclass key spelling
- Fix Helm values configuration
2026-01-05 00:40:01 +09:00
aad4c249e2
CHORE(grafana): disable auto dashboard provision
...
- Use manual import instead of automatic provisioning
- Remove configMapGenerator for dashboards
- Remove sidecar and dashboards helm config
- Keep JSON files in dashboards/ for manual import reference
2026-01-05 00:40:01 +09:00