observability

Author	SHA1	Message	Date
Mayne0213	b818a8c1fe	fix: update CPU throttling panels to use PSI metrics with 10% threshold	2026-01-10 17:54:55 +09:00
Mayne0213	2b1667e643	FIX(grafana): replace rate_interval with 5m in MinIO dashboard - Change all $__rate_interval to 5m - Fix No data issues in rate() queries	2026-01-10 17:50:47 +09:00
Mayne0213	38e0c68ddb	CHORE(grafana): rearrange Bucket Scans panels side by side - Move Finished to left (x=0) - Move Started next to Finished (x=12, same y)	2026-01-10 17:48:43 +09:00
Mayne0213	4afdf04ef2	CHORE(grafana): remove KMS panels from MinIO dashboard - Remove 5 KMS-related panels (KMS not configured) - KMS Uptime, Request rates, Online/Offline status	2026-01-10 17:46:45 +09:00
Mayne0213	20b796f9e4	FIX(grafana): fix MinIO CPU Usage panel query - Hardcode job=minio and 5m interval - Change unit from 's' to 'percentunit' - Set max to 1 for proper gauge display	2026-01-10 17:33:54 +09:00
Mayne0213	fa4c2ce8f6	FIX(grafana): set default value for MinIO dashboard variable - Set scrape_jobs default to 'minio' - Hide variable selector (only one option)	2026-01-10 17:32:23 +09:00
Mayne0213	fc4f825b6d	FIX(grafana): fix MinIO dashboard scrape_jobs variable - Query only MinIO-related jobs - Set includeAll and multi to false	2026-01-10 17:15:53 +09:00
Mayne0213	a3003d597f	PERF(observability): adjust resources based on VPA - Update blackbox-exporter cpu 15m→23m, memory 64Mi→100Mi - Update grafana cpu 11m→23m, memory 425Mi→175Mi - Update loki cpu 23m→63m, memory 462Mi→363Mi - Update tempo cpu 50m→15m, memory 128Mi→100Mi - Update thanos memory 128Mi→283Mi - Update node-exporter memory 64Mi→100Mi - Update kube-state-metrics memory 100Mi→105Mi - Update opentelemetry-operator cpu 10m→11m, memory 256Mi→75Mi - Update vpa memory 128Mi→100Mi	2026-01-10 14:33:40 +09:00
Mayne0213	9e218a8adc	PERF(observability): reduce replicas, add priority - Reduce Prometheus replicas from 2 to 1 - Reduce Grafana replicas from 2 to 1 - Reduce Blackbox-exporter replicas from 2 to 1 - Move Loki, Thanos, Tempo to workers (remove tolerations) - Add medium-priority to Prometheus, Loki, Thanos, Tempo	2026-01-10 13:15:03 +09:00
Mayne0213	823edfbd88	fix(grafana): restrict main dashboard datasource to Thanos only - Set regex filter "/Thanos/" on datasource variable - Set default value to "Thanos" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:51:44 +09:00
Mayne0213	dc8706fb02	fix(grafana): set explicit 2m interval on CPU query targets - Global CPU Usage: set interval="2m" on Real Linux/Windows targets - CPU Usage: set interval="2m" on Real Linux/Windows targets - Previously empty interval caused $__rate_interval mismatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:50:44 +09:00
Mayne0213	3516a860db	fix(grafana): standardize CPU panel intervals to 2m - Revert Overview panels to 2m (rate() needs sufficient data points) - Change Cluster CPU Utilization targets to 2m for consistency - All CPU panels now update at the same rate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:48:21 +09:00
Mayne0213	64e129128f	fix(grafana): sync interval for CPU panels in main dashboard - Change hardcoded "2m" interval to "$resolution" variable - Affected panels: Global CPU Usage (id 77), CPU Usage (id 37) - Ensures consistent refresh rate across all CPU metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:46:15 +09:00
Mayne0213	518b5c31ef	fix: update dashboards and OTel collector for proper metrics/logs - certmanager.json: use Thanos datasource, fix variable regex - argocd.json: use Thanos datasource via $datasource variable - logs.json: update to use OTel labels (k8s_namespace_name, k8s_container_name) - collector.yaml: add loki.resource.labels hint for proper Loki label mapping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:36:37 +09:00
Mayne0213	01c5742d7a	FIX(grafana): change OOM panel to stat type - Replace timeseries with stat panel for OOM detection - Show total count of OOMKilled pods instead of timeline - Gauge metric not suitable for timeseries visualization	2026-01-09 21:42:35 +09:00
Mayne0213	539f4be497	FIX(grafana): use kube-state-metrics for OOM detection - Replace container_oom_events_total with kube_pod_container_status_last_terminated_reason - Fix OOM events not showing after pod restart - cAdvisor metric resets on pod restart, kube-state-metrics persists	2026-01-09 21:42:35 +09:00
Mayne0213	aecb15031d	FEAT(grafana): add Thanos as default datasource - Add Thanos Query as default Prometheus datasource - Keep original Prometheus datasource as backup - Thanos provides deduplicated metrics from HA Prometheus REFACTOR(thanos): move all components to master node - Add tolerations for control-plane:NoSchedule - Add nodeSelector for control-plane node - Affects: query, storegateway, compactor - PVC will be recreated on master node (data in S3) FIX(thanos): allow non-Bitnami images (quay.io/thanos) FIX(thanos): correct nodeSelector value to 'true'	2026-01-09 21:41:52 +09:00
Mayne0213	6da4eba1dc	CHORE(grafana): remove admin login secret for SSO - Remove grafana-admin-password ExternalSecret - Remove admin section from helm-values.yaml - Authentication handled by Authelia SSO middleware	2026-01-09 21:41:52 +09:00
Mayne0213	2cf35d0f76	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	2b7ee1fe51	FEAT(loki): configure storage and HA - Rename extraVolume to avoid duplicate name - Add emptyDir for /var/loki cache - Migrate to shared storage with MinIO - Configure HA with 2 replicas - Revert to single replica for Single Binary mode	2026-01-09 21:41:52 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	69dc3b34be	REFACTOR(secrets): flatten Vault paths - Change secret paths from <category>/<app> to <app> - monitoring/alertmanager → alertmanager - monitoring/grafana → grafana - databases/postgresql → postgresql	2026-01-06 16:52:58 +09:00
Mayne0213	7888aeff36	REFACTOR(repo): move vault/ to manifests/ - Move ExternalSecret files from vault/ to manifests/secret.yaml - Update kustomization.yaml references - Remove vault/ folders Apps: alertmanager, grafana, prometheus	2026-01-06 16:42:33 +09:00
Mayne0213	7b9abaf9c8	REFACTOR(obs): integrate ingress to helm-values - alertmanager: move ingress to karma inline, servicemonitor to manifests - goldilocks: move ingress to helm-values - grafana: move ingress to helm-values - uptime-kuma: move ingress to helm-values	2026-01-06 01:57:03 +09:00
Mayne0213	28ba50d1a3	REFACTOR(repo): observability repo structure - Add application.yaml for ArgoCD app-of-apps - Add kustomization.yaml with observability components - Add renovate.json for automated updates - Update all component argocd.yaml repoURLs to observability repo Components: prometheus, alertmanager, grafana, loki, promtail, node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa	2026-01-05 00:40:01 +09:00
Mayne0213	8dcb563ae4	REFACTOR(grafana): change Grafana storageClass - Update storageClass to local-path - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	c472035499	FEAT(grafana): add Grafana monitoring - Add Grafana monitoring configuration - Enable metrics collection	2026-01-05 00:40:01 +09:00
Mayne0213	7653f2c4c8	FIX(grafana): fix storageClass key name - Correct storageclass key spelling - Fix Helm values configuration	2026-01-05 00:40:01 +09:00
Mayne0213	aad4c249e2	CHORE(grafana): disable auto dashboard provision - Use manual import instead of automatic provisioning - Remove configMapGenerator for dashboards - Remove sidecar and dashboards helm config - Keep JSON files in dashboards/ for manual import reference	2026-01-05 00:40:01 +09:00
Mayne0213	ababd677d4	FEAT(repo): enable ServerSideApply - Enable ServerSideApply to handle large configmaps - Fix resource management issues	2026-01-05 00:40:01 +09:00
Mayne0213	9583be9b46	FEAT(grafana): export dashboards - to JSON and use sidecar ConfigMaps - Export 14 dashboards to JSON files - Use kustomize configMapGenerator for dashboard ConfigMaps - Enable Grafana sidecar to load dashboards from ConfigMaps - Keep Longhorn and Traefik Official from grafana.com	2026-01-05 00:40:01 +09:00
Mayne0213	c356493707	FEAT(alertmanager): add datasource to Grafana - Add Alertmanager datasource configuration - Enable alert visualization	2026-01-05 00:40:01 +09:00
Mayne0213	200c6e97ae	REFACTOR(repo): migrate repoURL to K3S-HOME - Update repository URL to K3S-HOME organization - Change from personal to organization repo	2026-01-05 00:40:01 +09:00
renovate[bot]	1e1cde4cd9	CHORE(deps): update alertmanager to v1.30.0 - Upgrade Alertmanager chart version - Apply dependency updates	2026-01-05 00:40:01 +09:00
Mayne0213	939ae13c5d	CHORE(grafana): disable local auth, add SSO - Enable anonymous auth with Admin role - Disable login form - Add Authelia middleware to ingress	2026-01-05 00:40:01 +09:00
Mayne0213	e4b477a510	REFACTOR(longhorn): migrate to local-path - alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	823b2ba495	REFACTOR(repo): remove global panel from Grafana - Remove global panel configuration - Clean up dashboard settings	2026-01-05 00:40:01 +09:00
Mayne0213	f6ceb50503	REFACTOR(grafana): remove dashboard 15757 - Remove Windows-specific queries dashboard - Clean up unused dashboards	2026-01-05 00:40:01 +09:00
Mayne0213	0617611d22	FIX(grafana): restore dashboard 15757 - Restore Kubernetes Global with CPU Real dashboard - Re-enable monitoring visualization	2026-01-05 00:40:01 +09:00
Mayne0213	685563b92c	REFACTOR(grafana): remove duplicated Dashboard - Remove duplicate Grafana dashboard - Clean up configuration	2026-01-05 00:40:01 +09:00
Mayne0213	ebc5af24ef	FEAT(repo): add Grafana Global panel - Add global panel to Grafana dashboard - Enable overview visualization	2026-01-05 00:40:01 +09:00
Mayne0213	d0fc55d403	FEAT(grafana): add uid to Grafana datasources - for dashboard compatibi...	2026-01-05 00:40:01 +09:00
Mayne0213	912b3aa38f	REFACTOR(minio): remove minio dashboard - using manually imported one	2026-01-04 23:38:05 +09:00
Mayne0213	8e964afe42	FEAT(grafana): add grafana dashboards - for cluster monitoring	2026-01-04 23:38:05 +09:00
Mayne0213	b3ad6338ac	FIX(prometheus): grafana prometheus datasource - url with full namespace	2026-01-04 23:38:05 +09:00
Mayne0213	a30dbf138f	REFACTOR(traefik): switch ingress to Traefik - Update ingressClassName from haproxy to traefik - Remove haproxy.org annotations	2026-01-04 23:38:05 +09:00
Mayne0213	0cb7438d79	CHORE(external-secrets): update ESO API version from v1beta1 to v1 - Update ExternalSecret API version - Migrate to stable API	2026-01-04 23:38:05 +09:00
Mayne0213	ea4152a0d6	REFACTOR(gitea): migrate repoURL from Gitea - to GitHub	2026-01-04 23:38:05 +09:00
Mayne0213	5ec1a3323d	REFACTOR(goldilocks): use managedNamespaceMetad... - Remove namespace.yaml files - Add managedNamespaceMetadata with Goldilocks label - Set CreateNamespace=true in syncOptions - Update kustomization.yaml to remove namespace.yaml references	2026-01-04 23:38:05 +09:00
Mayne0213	bbf6fa5001	CHORE(repo): clean kustomization files - Remove unused entries from kustomization - Clean up configuration	2026-01-04 23:38:05 +09:00

1 2

63 Commits