62 Commits

Author SHA1 Message Date
b145881fa2 PERF(prometheus): increase memory limit to 1Gi
- Increase memory request from 768Mi to 1Gi
- Increase memory limit from 768Mi to 1Gi
- Prevents OOM at 97% memory usage
2026-01-12 03:16:40 +09:00
7e61af372b PERF(observability): remove CPU limits for stability
- Remove CPU limits from all observability components
- Prevents CPU throttling issues across monitoring stack
2026-01-12 02:10:54 +09:00
3b5bf20902 PERF(observability): optimize resources via VPA
- alertmanager: CPU 15m/15m, memory 100Mi/100Mi
- blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi
- goldilocks: controller 15m/25m, dashboard 15m/15m
- grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound)
- kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi
- loki: CPU 10m/69m, memory 225Mi/323Mi
- node-exporter: CPU 15m/15m, memory 100Mi/100Mi
- opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi
- prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi
- tempo: CPU 15m/15m, memory 100Mi/109Mi
- thanos: CPU 15m/15m, memory 100Mi/126Mi
- vpa: CPU 15m/15m, memory 100Mi/100Mi
2026-01-12 01:07:58 +09:00
c1214029a2 refactor: update Vault secret paths to new categorized structure
- alertmanager: alertmanager → observability/alertmanager
- grafana: postgresql → storage/postgresql
- prometheus: postgresql → storage/postgresql, minio → storage/minio
- thanos: minio → storage/minio

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 22:36:22 +09:00
15d5e58d6c migrate: change repoURLs from GitHub to Gitea
Update all ArgoCD Application references to use Gitea (github0213.com)
instead of GitHub for K3S-HOME/observability repository.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:43:29 +09:00
c3084225b7 PERF(observability): add HA for Loki and Tempo
- Loki: replicas 1→2 with soft anti-affinity
- Tempo: replicas 1→2 with soft anti-affinity
- Thanos/Prometheus: keep replica 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 13:46:02 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
c34f56945a feat(prometheus): enable container CPU throttling metrics collection
- Override default cAdvisorMetricRelabelings
- Remove cfs_throttled_seconds_total from drop regex
- Enables CPU Throttled panels in Grafana dashboards

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:55:36 +09:00
9e87e6fbcb REVERT(otel): remove metrics collection, keep logs/traces only
- Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors
- OTel Collector only handles logs (filelog) and traces (otlp)
- Remove Target Allocator and metrics-related config
- This reduces complexity and resource usage for home cluster
2026-01-10 01:18:35 +09:00
a506ca3f58 FIX(prometheus): reduce replicas to 1 due to resource constraints
- Cluster has insufficient memory to schedule 2 Prometheus replicas
- Thanos sidecar still provides HA query capability
2026-01-10 01:18:26 +09:00
5bc0caa324 FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled
- Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi)
- Increased memory requests and limits from 768Mi to 1536Mi
2026-01-10 01:18:11 +09:00
12ee5b61c0 FIX(prometheus): enable out-of-order time window
- Set outOfOrderTimeWindow to 5m for TSDB
- Allow slightly out-of-order samples from distributed collectors
- Prevents data loss from timing differences
2026-01-09 23:43:01 +09:00
a3c5a8dbcf CHORE(prometheus): disable direct scraping
- Disable ServiceMonitor/PodMonitor scraping in Prometheus
- OTel Collector now handles all metrics collection
- Prevents out-of-order sample errors from duplicate scraping
2026-01-09 23:39:30 +09:00
02faf93555 FEAT(otel): add OTel Collector for logs and traces
- Add OpenTelemetry Operator for CR management
- Deploy OTel Collector as DaemonSet via CR
- Enable filelog receiver for container log collection
- Replace Promtail with OTel filelog receiver
- Keep Prometheus for ServiceMonitor-based metrics scraping
2026-01-09 23:23:51 +09:00
470a08f78a CHORE(repo): switch to emptyDir with sizeLimit
- Add sizeLimit 2Gi to loki emptyDir
- Add sizeLimit 2Gi to tempo emptyDir
- Change prometheus from PVC to emptyDir 5Gi
- Change alertmanager from PVC to emptyDir 500Mi
2026-01-09 21:42:35 +09:00
94af545120 REFACTOR(thanos): remove S3 storage integration
- Disable Store Gateway and Compactor
- Remove Sidecar objectStorageConfig
- Keep Thanos Query + Sidecar for HA query
- 3-day local retention is sufficient
2026-01-09 21:42:35 +09:00
7cd778313a FIX(prometheus): disable PrometheusDuplicateTimestamps alert
- Low severity alert that fires repeatedly in HA setup
- 0.05 samples/s drop rate is negligible
2026-01-09 21:42:35 +09:00
4a4a43ed82 FIX(prometheus): increase memory to 768Mi
- Prometheus was OOMKilled with 512Mi limit
- Set both requests and limits to 768Mi
2026-01-09 21:42:35 +09:00
5089e8607d CHORE(resources): set memory limits equal to memory requests
Align memory limits with memory requests for guaranteed QoS class.
- prometheus, thanos (query, storegateway, compactor)
- alertmanager, tempo, goldilocks (dashboard, controller)
- node-exporter, opentelemetry-collector, vpa, kube-state-metrics
2026-01-09 21:42:35 +09:00
7139f3e5a2 FIX(prometheus): correct ArgoCD metrics service names
- Update controller target to argocd-application-controller-metrics
- Update repo-server target to argocd-repo-server-metrics
2026-01-09 21:41:52 +09:00
445cabb900 FIX(prometheus): add ExternalSecret default values to fix OutOfSync 2026-01-09 21:41:52 +09:00
9b052b49cf FEAT(thanos): add Thanos for Prometheus HA
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure OCI registry for Bitnami chart
- Fix Vault secret path and image settings
- Add nodeSelector for master node
2026-01-09 21:41:52 +09:00
ea4d7d4ecf PERF(prometheus): reduce CPU request from 200m to 50m
- Actual usage is ~17m, 200m was over-provisioned
- Fixes "Insufficient cpu" scheduling error for replica 2
2026-01-09 21:41:52 +09:00
6b576d6a16 FEAT(thanos): add Thanos for Prometheus HA and long-term storage
- Add Thanos Query, Store Gateway, Compactor
- Enable Prometheus Sidecar with S3 (MinIO) storage
- Configure Prometheus replicas: 2 with pod anti-affinity
- Add ExternalSecrets for MinIO credentials
- Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d
2026-01-09 21:41:52 +09:00
30f028fae4 CHORE(prometheus): disable CPU/Memory overcommit alerts
- Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts
- Cluster uses replica=2 with pod anti-affinity for HA
2026-01-09 21:41:52 +09:00
855321bebf FIX(prometheus): ignore relabelings default value diff in ServiceMonitor 2026-01-08 00:15:09 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
69dc3b34be REFACTOR(secrets): flatten Vault paths
- Change secret paths from <category>/<app> to <app>
- monitoring/alertmanager → alertmanager
- monitoring/grafana → grafana
- databases/postgresql → postgresql
2026-01-06 16:52:58 +09:00
7888aeff36 REFACTOR(repo): move vault/ to manifests/
- Move ExternalSecret files from vault/ to manifests/secret.yaml
- Update kustomization.yaml references
- Remove vault/ folders

Apps: alertmanager, grafana, prometheus
2026-01-06 16:42:33 +09:00
28ba50d1a3 REFACTOR(repo): observability repo structure
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo

Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
864c2c45d8 REFACTOR(alertmanager): change storageClass
- Update storageClass to local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
200c6e97ae REFACTOR(repo): migrate repoURL to K3S-HOME
- Update repository URL to K3S-HOME organization
- Change from personal to organization repo
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
60dfa5cf7b CHORE(resources): disable apiserver/etcd metrics
- Disable kubeApiServer ServiceMonitor (~37k series)
- Disable kubeEtcd ServiceMonitor (~26k series)
- Expected memory reduction: ~30-40%
2026-01-05 00:40:01 +09:00
658a81b4c1 REFACTOR(repo): remove ServerSideApply
- Remove ServerSideApply configuration
- Add RespectIgnoreDifferences syncOption
2026-01-05 00:40:01 +09:00
2c841c2b6e FEAT(vault): add ignoreDiff for ES/SM
- Add ignoreDifferences for ExternalSecret
- Prevent ArgoCD sync drift
2026-01-05 00:40:01 +09:00
d8360c10a1 FEAT(repo): add cAdvisor metrics_path relabel
- Add relabeling for cAdvisor metrics
- Support recording rules
2026-01-05 00:40:01 +09:00
1befeb68c4 FEAT(prometheus): add ServerSideApply
- Enable ServerSideApply for CRD annotation handling
- Fix resource management
2026-01-05 00:40:01 +09:00
cd575d94a6 PERF(prometheus): optimize prometheus memory usage
- Increase scrapeInterval: 30s → 60s
- Increase evaluationInterval: 30s → 60s
- Reduce retention: 7d → 3d
- Add memory limit: 1Gi (prevent unlimited growth)
- Increase memory request: 256Mi → 512Mi (reflect actual usage)
2026-01-05 00:40:01 +09:00
2ec87ca7a5 PERF(prometheus): increase Prometheus CPU request from 50m to 200m
- Increase CPU request based on actual usage
- Optimize resource allocation
2026-01-05 00:40:01 +09:00
b3ad6338ac FIX(prometheus): grafana prometheus datasource
- url with full namespace
2026-01-04 23:38:05 +09:00
340c6fea11 FIX(alertmanager): prometheus alertingendpoints
- to connect to alertma...
2026-01-04 23:38:05 +09:00
bc1cf0d223 REFACTOR(argocd): remove serversideapply
- from argocd applications
2026-01-04 23:38:05 +09:00
79b34aaca6 FEAT(prometheus): add ServerSideApply
- Enable ServerSideApply for CRD annotation handling
- Fix resource management
2026-01-04 23:38:05 +09:00
0cb7438d79 CHORE(external-secrets): update ESO API version from v1beta1 to v1
- Update ExternalSecret API version
- Migrate to stable API
2026-01-04 23:38:05 +09:00
c75798065f CHORE(postgresql): update PostgreSQL namespace reference
- Update namespace reference for PostgreSQL
- Fix service discovery
2026-01-04 23:38:05 +09:00
ea4152a0d6 REFACTOR(gitea): migrate repoURL from Gitea
- to GitHub
2026-01-04 23:38:05 +09:00
5ec1a3323d REFACTOR(goldilocks): use managedNamespaceMetad...
- Remove namespace.yaml files
- Add managedNamespaceMetadata with Goldilocks label
- Set CreateNamespace=true in syncOptions
- Update kustomization.yaml to remove namespace.yaml references
2026-01-04 23:38:05 +09:00
ac2abde8b5 FIX(prometheus): servicemonitor namespace
- from monitoring to prometheus
2026-01-04 23:38:05 +09:00
bbf6fa5001 CHORE(repo): clean kustomization files
- Remove unused entries from kustomization
- Clean up configuration
2026-01-04 23:38:05 +09:00