Commit Graph

63 Commits

Author SHA1 Message Date
b818a8c1fe fix: update CPU throttling panels to use PSI metrics with 10% threshold 2026-01-10 17:54:55 +09:00
2b1667e643 FIX(grafana): replace rate_interval with 5m in MinIO dashboard
- Change all $__rate_interval to 5m
- Fix No data issues in rate() queries
2026-01-10 17:50:47 +09:00
38e0c68ddb CHORE(grafana): rearrange Bucket Scans panels side by side
- Move Finished to left (x=0)
- Move Started next to Finished (x=12, same y)
2026-01-10 17:48:43 +09:00
4afdf04ef2 CHORE(grafana): remove KMS panels from MinIO dashboard
- Remove 5 KMS-related panels (KMS not configured)
- KMS Uptime, Request rates, Online/Offline status
2026-01-10 17:46:45 +09:00
20b796f9e4 FIX(grafana): fix MinIO CPU Usage panel query
- Hardcode job=minio and 5m interval
- Change unit from 's' to 'percentunit'
- Set max to 1 for proper gauge display
2026-01-10 17:33:54 +09:00
fa4c2ce8f6 FIX(grafana): set default value for MinIO dashboard variable
- Set scrape_jobs default to 'minio'
- Hide variable selector (only one option)
2026-01-10 17:32:23 +09:00
fc4f825b6d FIX(grafana): fix MinIO dashboard scrape_jobs variable
- Query only MinIO-related jobs
- Set includeAll and multi to false
2026-01-10 17:15:53 +09:00
a3003d597f PERF(observability): adjust resources based on VPA
- Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi
- Update grafana cpu 11m→23m, memory 425Mi→175Mi
- Update loki cpu 23m→63m, memory 462Mi→363Mi
- Update tempo cpu 50m→15m, memory 128Mi→100Mi
- Update thanos memory 128Mi→283Mi
- Update node-exporter memory 64Mi→100Mi
- Update kube-state-metrics memory 100Mi→105Mi
- Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi
- Update vpa memory 128Mi→100Mi
2026-01-10 14:33:40 +09:00
9e218a8adc PERF(observability): reduce replicas, add priority
- Reduce Prometheus replicas from 2 to 1
- Reduce Grafana replicas from 2 to 1
- Reduce Blackbox-exporter replicas from 2 to 1
- Move Loki, Thanos, Tempo to workers (remove tolerations)
- Add medium-priority to Prometheus, Loki, Thanos, Tempo
2026-01-10 13:15:03 +09:00
823edfbd88 fix(grafana): restrict main dashboard datasource to Thanos only
- Set regex filter "/Thanos/" on datasource variable
- Set default value to "Thanos"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:51:44 +09:00
dc8706fb02 fix(grafana): set explicit 2m interval on CPU query targets
- Global CPU Usage: set interval="2m" on Real Linux/Windows targets
- CPU Usage: set interval="2m" on Real Linux/Windows targets
- Previously empty interval caused $__rate_interval mismatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:50:44 +09:00
3516a860db fix(grafana): standardize CPU panel intervals to 2m
- Revert Overview panels to 2m (rate() needs sufficient data points)
- Change Cluster CPU Utilization targets to 2m for consistency
- All CPU panels now update at the same rate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:48:21 +09:00
64e129128f fix(grafana): sync interval for CPU panels in main dashboard
- Change hardcoded "2m" interval to "$resolution" variable
- Affected panels: Global CPU Usage (id 77), CPU Usage (id 37)
- Ensures consistent refresh rate across all CPU metrics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:46:15 +09:00
518b5c31ef fix: update dashboards and OTel collector for proper metrics/logs
- certmanager.json: use Thanos datasource, fix variable regex
- argocd.json: use Thanos datasource via $datasource variable
- logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name)
- collector.yaml: add loki.resource.labels hint for proper Loki label mapping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 03:36:37 +09:00
01c5742d7a FIX(grafana): change OOM panel to stat type
- Replace timeseries with stat panel for OOM detection
- Show total count of OOMKilled pods instead of timeline
- Gauge metric not suitable for timeseries visualization
2026-01-09 21:42:35 +09:00
539f4be497 FIX(grafana): use kube-state-metrics for OOM detection
- Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason
- Fix OOM events not showing after pod restart
- cAdvisor metric resets on pod restart, kube-state-metrics persists
2026-01-09 21:42:35 +09:00
aecb15031d FEAT(grafana): add Thanos as default datasource
- Add Thanos Query as default Prometheus datasource
- Keep original Prometheus datasource as backup
- Thanos provides deduplicated metrics from HA Prometheus

REFACTOR(thanos): move all components to master node

- Add tolerations for control-plane:NoSchedule
- Add nodeSelector for control-plane node
- Affects: query, storegateway, compactor
- PVC will be recreated on master node (data in S3)

FIX(thanos): allow non-Bitnami images (quay.io/thanos)

FIX(thanos): correct nodeSelector value to 'true'
2026-01-09 21:41:52 +09:00
6da4eba1dc CHORE(grafana): remove admin login secret for SSO
- Remove grafana-admin-password ExternalSecret
- Remove admin section from helm-values.yaml
- Authentication handled by Authelia SSO middleware
2026-01-09 21:41:52 +09:00
2cf35d0f76 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
2b7ee1fe51 FEAT(loki): configure storage and HA
- Rename extraVolume to avoid duplicate name
- Add emptyDir for /var/loki cache
- Migrate to shared storage with MinIO
- Configure HA with 2 replicas
- Revert to single replica for Single Binary mode
2026-01-09 21:41:52 +09:00
4286296591 PERF(resources): remove CPU limits - keep memory limits only
- CPU throttling prevents app startup, not crashes
- Memory OOM is the real cascading failure cause
- CPU request ensures fair scheduling
2026-01-07 23:48:35 +09:00
69dc3b34be REFACTOR(secrets): flatten Vault paths
- Change secret paths from <category>/<app> to <app>
- monitoring/alertmanager → alertmanager
- monitoring/grafana → grafana
- databases/postgresql → postgresql
2026-01-06 16:52:58 +09:00
7888aeff36 REFACTOR(repo): move vault/ to manifests/
- Move ExternalSecret files from vault/ to manifests/secret.yaml
- Update kustomization.yaml references
- Remove vault/ folders

Apps: alertmanager, grafana, prometheus
2026-01-06 16:42:33 +09:00
7b9abaf9c8 REFACTOR(obs): integrate ingress to helm-values
- alertmanager: move ingress to karma inline, servicemonitor to manifests
- goldilocks: move ingress to helm-values
- grafana: move ingress to helm-values
- uptime-kuma: move ingress to helm-values
2026-01-06 01:57:03 +09:00
28ba50d1a3 REFACTOR(repo): observability repo structure
- Add application.yaml for ArgoCD app-of-apps
- Add kustomization.yaml with observability components
- Add renovate.json for automated updates
- Update all component argocd.yaml repoURLs to observability repo

Components: prometheus, alertmanager, grafana, loki, promtail,
node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa
2026-01-05 00:40:01 +09:00
8dcb563ae4 REFACTOR(grafana): change Grafana storageClass
- Update storageClass to local-path
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
c472035499 FEAT(grafana): add Grafana monitoring
- Add Grafana monitoring configuration
- Enable metrics collection
2026-01-05 00:40:01 +09:00
7653f2c4c8 FIX(grafana): fix storageClass key name
- Correct storageclass key spelling
- Fix Helm values configuration
2026-01-05 00:40:01 +09:00
aad4c249e2 CHORE(grafana): disable auto dashboard provision
- Use manual import instead of automatic provisioning
- Remove configMapGenerator for dashboards
- Remove sidecar and dashboards helm config
- Keep JSON files in dashboards/ for manual import reference
2026-01-05 00:40:01 +09:00
ababd677d4 FEAT(repo): enable ServerSideApply
- Enable ServerSideApply to handle large configmaps
- Fix resource management issues
2026-01-05 00:40:01 +09:00
9583be9b46 FEAT(grafana): export dashboards
- to JSON and use sidecar ConfigMaps
- Export 14 dashboards to JSON files
- Use kustomize configMapGenerator for dashboard ConfigMaps
- Enable Grafana sidecar to load dashboards from ConfigMaps
- Keep Longhorn and Traefik Official from grafana.com
2026-01-05 00:40:01 +09:00
c356493707 FEAT(alertmanager): add datasource to Grafana
- Add Alertmanager datasource configuration
- Enable alert visualization
2026-01-05 00:40:01 +09:00
200c6e97ae REFACTOR(repo): migrate repoURL to K3S-HOME
- Update repository URL to K3S-HOME organization
- Change from personal to organization repo
2026-01-05 00:40:01 +09:00
renovate[bot]
1e1cde4cd9 CHORE(deps): update alertmanager to v1.30.0
- Upgrade Alertmanager chart version
- Apply dependency updates
2026-01-05 00:40:01 +09:00
939ae13c5d CHORE(grafana): disable local auth, add SSO
- Enable anonymous auth with Admin role
- Disable login form
- Add Authelia middleware to ingress
2026-01-05 00:40:01 +09:00
e4b477a510 REFACTOR(longhorn): migrate to local-path
- alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain
- Change storage backend configuration
2026-01-05 00:40:01 +09:00
823b2ba495 REFACTOR(repo): remove global panel from Grafana
- Remove global panel configuration
- Clean up dashboard settings
2026-01-05 00:40:01 +09:00
f6ceb50503 REFACTOR(grafana): remove dashboard 15757
- Remove Windows-specific queries dashboard
- Clean up unused dashboards
2026-01-05 00:40:01 +09:00
0617611d22 FIX(grafana): restore dashboard 15757
- Restore Kubernetes Global with CPU Real dashboard
- Re-enable monitoring visualization
2026-01-05 00:40:01 +09:00
685563b92c REFACTOR(grafana): remove duplicated Dashboard
- Remove duplicate Grafana dashboard
- Clean up configuration
2026-01-05 00:40:01 +09:00
ebc5af24ef FEAT(repo): add Grafana Global panel
- Add global panel to Grafana dashboard
- Enable overview visualization
2026-01-05 00:40:01 +09:00
d0fc55d403 FEAT(grafana): add uid to Grafana datasources
- for dashboard compatibi...
2026-01-05 00:40:01 +09:00
912b3aa38f REFACTOR(minio): remove minio dashboard
- using manually imported one
2026-01-04 23:38:05 +09:00
8e964afe42 FEAT(grafana): add grafana dashboards
- for cluster monitoring
2026-01-04 23:38:05 +09:00
b3ad6338ac FIX(prometheus): grafana prometheus datasource
- url with full namespace
2026-01-04 23:38:05 +09:00
a30dbf138f REFACTOR(traefik): switch ingress to Traefik
- Update ingressClassName from haproxy to traefik
- Remove haproxy.org annotations
2026-01-04 23:38:05 +09:00
0cb7438d79 CHORE(external-secrets): update ESO API version from v1beta1 to v1
- Update ExternalSecret API version
- Migrate to stable API
2026-01-04 23:38:05 +09:00
ea4152a0d6 REFACTOR(gitea): migrate repoURL from Gitea
- to GitHub
2026-01-04 23:38:05 +09:00
5ec1a3323d REFACTOR(goldilocks): use managedNamespaceMetad...
- Remove namespace.yaml files
- Add managedNamespaceMetadata with Goldilocks label
- Set CreateNamespace=true in syncOptions
- Update kustomization.yaml to remove namespace.yaml references
2026-01-04 23:38:05 +09:00
bbf6fa5001 CHORE(repo): clean kustomization files
- Remove unused entries from kustomization
- Clean up configuration
2026-01-04 23:38:05 +09:00