395c79ad9e
PERF(alertmanager): reduce karma replicas to 1
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-10 13:36:14 +09:00
67db06cf8b
PERF(observability): reduce replicas to 1
...
- Reduce alertmanager replicas to 1
- Reduce karma replicas to 1
- Reduce goldilocks dashboard replicas to 1
2026-01-10 13:31:39 +09:00
ad9573e998
FIX(alertmanager): remove duplicate volume config
...
- Remove extraVolumes and extraVolumeMounts
- Chart uses emptyDir automatically when persistence disabled
2026-01-09 21:42:35 +09:00
470a08f78a
CHORE(repo): switch to emptyDir with sizeLimit
...
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
bb8b1c193e
FIX(alertmanager): improve OOMKilled alert detection
...
- Only fire when container restarted in last 10 minutes
- Prevent stale alerts from old OOM events
2026-01-09 21:42:35 +09:00
e3c615b5c1
FEAT(alertmanager): add OOMKilled alert rule
...
- Add PrometheusRule to alert when containers are OOMKilled
- Severity: warning, fires immediately
2026-01-09 21:42:35 +09:00
8c2a9badf8
FIX(alertmanager): set karma memory limits equal to requests
...
- Align memory limits with requests for guaranteed QoS
2026-01-09 21:42:35 +09:00
5089e8607d
CHORE(resources): set memory limits equal to memory requests
...
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
7ed4d69c51
PERF(alertmanager): add HA with 2 replicas
...
- Increase replicaCount from 1 to 2
- Add soft pod anti-affinity to spread across nodes
- Improve availability during node failures
2026-01-09 21:41:52 +09:00
4515ea0b33
FEAT(observability): enable HA with replica 2 and soft anti-affinity
...
- Add replicaCount: 2 to goldilocks, vpa, alertmanager
- Add replicas: 2 to loki singleBinary
- Add soft pod anti-affinity for node distribution
- Keep kube-state-metrics at replica 1 to prevent duplicate metrics
FIX(loki): revert to replica 1 for Single Binary mode
- Single Binary mode cannot run more than 1 replica without object storage
- Remove affinity configuration for single replica
- Keep filesystem storage backend
2026-01-09 21:41:51 +09:00
4286296591
PERF(resources): remove CPU limits - keep memory limits only
...
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
69dc3b34be
REFACTOR(secrets): flatten Vault paths
...
- Change secret paths from <category>/<app> to <app>
- monitoring/alertmanager → alertmanager
- monitoring/grafana → grafana
- databases/postgresql → postgresql
2026-01-06 16:52:58 +09:00
7888aeff36
REFACTOR(repo): move vault/ to manifests/
...
- Move ExternalSecret files from vault/ to manifests/secret.yaml
- Update kustomization.yaml references
- Remove vault/ folders
Apps: alertmanager, grafana, prometheus
2026-01-06 16:42:33 +09:00
7b9abaf9c8
REFACTOR(obs): integrate ingress to helm-values
...
- alertmanager: move ingress to karma inline, servicemonitor to manifests
- goldilocks: move ingress to helm-values
- grafana: move ingress to helm-values
- uptime-kuma: move ingress to helm-values
2026-01-06 01:57:03 +09:00
28ba50d1a3
REFACTOR(repo): observability repo structure
...
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo
Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
864c2c45d8
REFACTOR(alertmanager): change storageClass
...
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
997893284b
FEAT(alertmanager): add ServiceMonitor
...
- Create servicemonitor.yaml for Prometheus to scrape Alertmanager
- alertmanager chart does not include ServiceMonitor, must be added separately
- Enables Grafana Alertmanager dashboard to display data
2026-01-05 00:40:01 +09:00
200c6e97ae
REFACTOR(repo): migrate repoURL to K3S-HOME
...
- Update repository URL to K3S-HOME organization
- Change from personal to organization repo
2026-01-05 00:40:01 +09:00
renovate[bot]
1e1cde4cd9
CHORE(deps): update alertmanager to v1.30.0
...
- Upgrade Alertmanager chart version
- Apply dependency updates
2026-01-05 00:40:01 +09:00
f7e11efe03
FEAT(repo): route InfoInhibitor to null
...
- Route InfoInhibitor alerts to null
- Prevent unnecessary alert notifications
2026-01-05 00:40:01 +09:00
925b8f2e01
FEAT(goldilocks): add Authelia SSO
...
- Add Authelia SSO to goldilocks, karma, trivy ingress
- Enable single sign-on authentication
2026-01-05 00:40:01 +09:00
e4b477a510
REFACTOR(longhorn): migrate to local-path
...
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
2c8095a1db
FIX(alertmanager): alertmanager smtp auth by
...
- loading config from secret
- Add ExternalSecret to generate alertmanager.yml with SMTP password
from Vault
- Disable helm chart config (ConfigMap) and use extraSecretMounts
instead
- Fixes "535 5.7.8 Error: authentication failed" SMTP error
2026-01-05 00:40:01 +09:00
0ce1f99fb4
CHORE(goldilocks): disable goldilocks
...
- and cancel trivy installation
- Comment out goldilocks/argocd.yaml from kustomization
- Comment out trivy/argocd.yaml from kustomization
- Disable autoSync in both applications
- Server overload mitigation
2026-01-05 00:40:01 +09:00
5002d352fb
FEAT(alertmanager): add Karma to Alertmanager
...
- Add Karma dashboard for alert aggregation
- Enable alert visualization
2026-01-04 23:38:05 +09:00
ea4152a0d6
REFACTOR(gitea): migrate repoURL from Gitea
...
- to GitHub
2026-01-04 23:38:05 +09:00
5ec1a3323d
REFACTOR(goldilocks): use managedNamespaceMetad...
...
- Remove namespace.yaml files
- Add managedNamespaceMetadata with Goldilocks label
- Set CreateNamespace=true in syncOptions
- Update kustomization.yaml to remove namespace.yaml references
2026-01-04 23:38:05 +09:00
ac2abde8b5
FIX(prometheus): servicemonitor namespace
...
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
bbf6fa5001
CHORE(repo): clean kustomization files
...
- Remove unused entries from kustomization
- Clean up configuration
2026-01-04 23:38:05 +09:00
2309254fc9
FIX(repo): circular reference in app kustomizes
...
- Comment out argocd.yaml in all app kustomization.yaml files
- Prevents circular reference when apps have 'path:' source (grafana,
prometheus)
- ArgoCD Applications are managed manually, not via kustomize
2026-01-04 23:38:05 +09:00
b4ec13618a
REFACTOR(repo): to independent app management
...
- pattern
- monitoring/kustomization.yaml now only manages application.yaml (App
of Apps)
- Each app independently manages its own ArgoCD Application via
kustomization.yaml
- Apps are fully self-contained: argocd.yaml, namespace.yaml, and app-
specific resources
- Cleaner separation: no central app list to maintain
2026-01-04 23:38:05 +09:00
078850f77a
FIX(argocd): sharedresourcewarning by referencing
...
- argocd.yaml files d...
- Change monitoring/kustomization.yaml to reference argocd.yaml files
instead of folders
- Comment out argocd.yaml in each app's kustomization.yaml
- Matches applications folder pattern to avoid resource conflicts
2026-01-04 23:38:05 +09:00
6dec7e0a46
REFACTOR(argocd): monitoring apps
...
- to self-manage ArgoCD Applications
- Each app now includes its own argocd.yaml in kustomization.yaml
- Main monitoring/kustomization.yaml references app folders instead of
individual argocd.yaml files
- Better separation of concerns - each app is self-contained and
independently managed
2026-01-04 23:38:05 +09:00
5c4676ca9a
REFACTOR(repo): restructure monitoring folder
...
- and add namespace resou...
- Remove argocd/, helm-values/, ingress/ subdirectories
- Move files to parent directory (argocd.yaml, helm-values.yaml,
ingress.yaml)
- Update helm valueFiles paths in ArgoCD Applications
- Add namespace.yaml to all applications with Goldilocks labels
- Update destination namespaces to match folder names
- Update kustomization.yaml files to reference new structure
2026-01-04 23:38:05 +09:00
1a2f15c468
REFACTOR(longhorn): migrate monitoring PVCs
...
- from local-path to Longhorn
- Grafana: 2Gi (replica=3)
- Loki: 10Gi (replica=3)
- Alertmanager: 1Gi (replica=3)
- Prometheus: 5Gi (replica=3)
- Use dedicated 50GB Longhorn storage on each node
2026-01-04 23:38:05 +09:00
a11a9ab329
CHORE(argocd): update ArgoCD applications
...
- to point to monitoring repo...
2025-12-17 15:12:56 +09:00
baee94b69d
INIT(repo): monitoring stack setup
2025-12-17 15:06:58 +09:00