observability

Author	SHA1	Message	Date
Mayne0213	4afdf04ef2	CHORE(grafana): remove KMS panels from MinIO dashboard - Remove 5 KMS-related panels (KMS not configured) - KMS Uptime, Request rates, Online/Offline status	2026-01-10 17:46:45 +09:00
Mayne0213	20b796f9e4	FIX(grafana): fix MinIO CPU Usage panel query - Hardcode job=minio and 5m interval - Change unit from 's' to 'percentunit' - Set max to 1 for proper gauge display	2026-01-10 17:33:54 +09:00
Mayne0213	fa4c2ce8f6	FIX(grafana): set default value for MinIO dashboard variable - Set scrape_jobs default to 'minio' - Hide variable selector (only one option)	2026-01-10 17:32:23 +09:00
Mayne0213	fc4f825b6d	FIX(grafana): fix MinIO dashboard scrape_jobs variable - Query only MinIO-related jobs - Set includeAll and multi to false	2026-01-10 17:15:53 +09:00
Mayne0213	823edfbd88	fix(grafana): restrict main dashboard datasource to Thanos only - Set regex filter "/Thanos/" on datasource variable - Set default value to "Thanos" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:51:44 +09:00
Mayne0213	dc8706fb02	fix(grafana): set explicit 2m interval on CPU query targets - Global CPU Usage: set interval="2m" on Real Linux/Windows targets - CPU Usage: set interval="2m" on Real Linux/Windows targets - Previously empty interval caused $__rate_interval mismatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:50:44 +09:00
Mayne0213	3516a860db	fix(grafana): standardize CPU panel intervals to 2m - Revert Overview panels to 2m (rate() needs sufficient data points) - Change Cluster CPU Utilization targets to 2m for consistency - All CPU panels now update at the same rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:48:21 +09:00
Mayne0213	64e129128f	fix(grafana): sync interval for CPU panels in main dashboard - Change hardcoded "2m" interval to "$resolution" variable - Affected panels: Global CPU Usage (id 77), CPU Usage (id 37) - Ensures consistent refresh rate across all CPU metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:46:15 +09:00
Mayne0213	518b5c31ef	fix: update dashboards and OTel collector for proper metrics/logs - certmanager.json: use Thanos datasource, fix variable regex - argocd.json: use Thanos datasource via $datasource variable - logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name) - collector.yaml: add loki.resource.labels hint for proper Loki label mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:36:37 +09:00
Mayne0213	01c5742d7a	FIX(grafana): change OOM panel to stat type - Replace timeseries with stat panel for OOM detection - Show total count of OOMKilled pods instead of timeline - Gauge metric not suitable for timeseries visualization	2026-01-09 21:42:35 +09:00
Mayne0213	539f4be497	FIX(grafana): use kube-state-metrics for OOM detection - Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason - Fix OOM events not showing after pod restart - cAdvisor metric resets on pod restart, kube-state-metrics persists	2026-01-09 21:42:35 +09:00
Mayne0213	c472035499	FEAT(grafana): add Grafana monitoring - Add Grafana monitoring configuration - Enable metrics collection	2026-01-05 00:40:01 +09:00
Mayne0213	9583be9b46	FEAT(grafana): export dashboards - to JSON and use sidecar ConfigMaps - Export 14 dashboards to JSON files - Use kustomize configMapGenerator for dashboard ConfigMaps - Enable Grafana sidecar to load dashboards from ConfigMaps - Keep Longhorn and Traefik Official from grafana.com	2026-01-05 00:40:01 +09:00

13 Commits