observability

Author	SHA1	Message	Date
Mayne0213	b145881fa2	PERF(prometheus): increase memory limit to 1Gi - Increase memory request from 768Mi to 1Gi - Increase memory limit from 768Mi to 1Gi - Prevents OOM at 97% memory usage	2026-01-12 03:16:40 +09:00
Mayne0213	7e61af372b	PERF(observability): remove CPU limits for stability - Remove CPU limits from all observability components - Prevents CPU throttling issues across monitoring stack	2026-01-12 02:10:54 +09:00
Mayne0213	3b5bf20902	PERF(observability): optimize resources via VPA - alertmanager: CPU 15m/15m, memory 100Mi/100Mi - blackbox-exporter: CPU 15m/32m, memory 100Mi/100Mi - goldilocks: controller 15m/25m, dashboard 15m/15m - grafana: CPU 22m/24m, memory 144Mi/242Mi (upperBound) - kube-state-metrics: CPU 15m/15m, memory 100Mi/100Mi - loki: CPU 10m/69m, memory 225Mi/323Mi - node-exporter: CPU 15m/15m, memory 100Mi/100Mi - opentelemetry: CPU 34m/410m, memory 142Mi/1024Mi - prometheus-operator: CPU 15m/15m, memory 100Mi/100Mi - tempo: CPU 15m/15m, memory 100Mi/109Mi - thanos: CPU 15m/15m, memory 100Mi/126Mi - vpa: CPU 15m/15m, memory 100Mi/100Mi	2026-01-12 01:07:58 +09:00
Mayne0213	c1214029a2	refactor: update Vault secret paths to new categorized structure - alertmanager: alertmanager → observability/alertmanager - grafana: postgresql → storage/postgresql - prometheus: postgresql → storage/postgresql, minio → storage/minio - thanos: minio → storage/minio Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-11 22:36:22 +09:00
Mayne0213	15d5e58d6c	migrate: change repoURLs from GitHub to Gitea Update all ArgoCD Application references to use Gitea (github0213.com) instead of GitHub for K3S-HOME/observability repository. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 20:43:29 +09:00
Mayne0213	c3084225b7	PERF(observability): add HA for Loki and Tempo - Loki: replicas 1→2 with soft anti-affinity - Tempo: replicas 1→2 with soft anti-affinity - Thanos/Prometheus: keep replica 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 13:46:02 +09:00
Mayne0213	9e218a8adc	PERF(observability): reduce replicas, add priority - Reduce Prometheus replicas from 2 to 1 - Reduce Grafana replicas from 2 to 1 - Reduce Blackbox-exporter replicas from 2 to 1 - Move Loki, Thanos, Tempo to workers (remove tolerations) - Add medium-priority to Prometheus, Loki, Thanos, Tempo	2026-01-10 13:15:03 +09:00
Mayne0213	c34f56945a	feat(prometheus): enable container CPU throttling metrics collection - Override default cAdvisorMetricRelabelings - Remove cfs_throttled_seconds_total from drop regex - Enables CPU Throttled panels in Grafana dashboards Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:55:36 +09:00
Mayne0213	9e87e6fbcb	REVERT(otel): remove metrics collection, keep logs/traces only - Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors - OTel Collector only handles logs (filelog) and traces (otlp) - Remove Target Allocator and metrics-related config - This reduces complexity and resource usage for home cluster	2026-01-10 01:18:35 +09:00
Mayne0213	a506ca3f58	FIX(prometheus): reduce replicas to 1 due to resource constraints - Cluster has insufficient memory to schedule 2 Prometheus replicas - Thanos sidecar still provides HA query capability	2026-01-10 01:18:26 +09:00
Mayne0213	5bc0caa324	FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled - Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi) - Increased memory requests and limits from 768Mi to 1536Mi	2026-01-10 01:18:11 +09:00
Mayne0213	12ee5b61c0	FIX(prometheus): enable out-of-order time window - Set outOfOrderTimeWindow to 5m for TSDB - Allow slightly out-of-order samples from distributed collectors - Prevents data loss from timing differences	2026-01-09 23:43:01 +09:00
Mayne0213	a3c5a8dbcf	CHORE(prometheus): disable direct scraping - Disable ServiceMonitor/PodMonitor scraping in Prometheus - OTel Collector now handles all metrics collection - Prevents out-of-order sample errors from duplicate scraping	2026-01-09 23:39:30 +09:00
Mayne0213	02faf93555	FEAT(otel): add OTel Collector for logs and traces - Add OpenTelemetry Operator for CR management - Deploy OTel Collector as DaemonSet via CR - Enable filelog receiver for container log collection - Replace Promtail with OTel filelog receiver - Keep Prometheus for ServiceMonitor-based metrics scraping	2026-01-09 23:23:51 +09:00
Mayne0213	470a08f78a	CHORE(repo): switch to emptyDir with sizeLimit - Add sizeLimit 2Gi to loki emptyDir - Add sizeLimit 2Gi to tempo emptyDir - Change prometheus from PVC to emptyDir 5Gi - Change alertmanager from PVC to emptyDir 500Mi	2026-01-09 21:42:35 +09:00
Mayne0213	94af545120	REFACTOR(thanos): remove S3 storage integration - Disable Store Gateway and Compactor - Remove Sidecar objectStorageConfig - Keep Thanos Query + Sidecar for HA query - 3-day local retention is sufficient	2026-01-09 21:42:35 +09:00
Mayne0213	7cd778313a	FIX(prometheus): disable PrometheusDuplicateTimestamps alert - Low severity alert that fires repeatedly in HA setup - 0.05 samples/s drop rate is negligible	2026-01-09 21:42:35 +09:00
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	445cabb900	FIX(prometheus): add ExternalSecret default values to fix OutOfSync	2026-01-09 21:41:52 +09:00
Mayne0213	9b052b49cf	FEAT(thanos): add Thanos for Prometheus HA - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure OCI registry for Bitnami chart - Fix Vault secret path and image settings - Add nodeSelector for master node	2026-01-09 21:41:52 +09:00
Mayne0213	ea4d7d4ecf	PERF(prometheus): reduce CPU request from 200m to 50m - Actual usage is ~17m, 200m was over-provisioned - Fixes "Insufficient cpu" scheduling error for replica 2	2026-01-09 21:41:52 +09:00
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	855321bebf	FIX(prometheus): ignore relabelings default value diff in ServiceMonitor	2026-01-08 00:15:09 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	69dc3b34be	REFACTOR(secrets): flatten Vault paths - Change secret paths from <category>/<app> to <app> - monitoring/alertmanager → alertmanager - monitoring/grafana → grafana - databases/postgresql → postgresql	2026-01-06 16:52:58 +09:00
Mayne0213	7888aeff36	REFACTOR(repo): move vault/ to manifests/ - Move ExternalSecret files from vault/ to manifests/secret.yaml - Update kustomization.yaml references - Remove vault/ folders Apps: alertmanager, grafana, prometheus	2026-01-06 16:42:33 +09:00
Mayne0213	28ba50d1a3	REFACTOR(repo): observability repo structure - Add application.yaml for ArgoCD app-of-apps - Add kustomization.yaml with observability components - Add renovate.json for automated updates - Update all component argocd.yaml repoURLs to observability repo Components: prometheus, alertmanager, grafana, loki, promtail, node-exporter, kube-state-metrics, goldilocks, uptime-kuma, vpa	2026-01-05 00:40:01 +09:00
Mayne0213	864c2c45d8	REFACTOR(alertmanager): change storageClass - Update storageClass to local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	200c6e97ae	REFACTOR(repo): migrate repoURL to K3S-HOME - Update repository URL to K3S-HOME organization - Change from personal to organization repo	2026-01-05 00:40:01 +09:00
Mayne0213	e4b477a510	REFACTOR(longhorn): migrate to local-path - alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	60dfa5cf7b	CHORE(resources): disable apiserver/etcd metrics - Disable kubeApiServer ServiceMonitor (~37k series) - Disable kubeEtcd ServiceMonitor (~26k series) - Expected memory reduction: ~30-40%	2026-01-05 00:40:01 +09:00
Mayne0213	658a81b4c1	REFACTOR(repo): remove ServerSideApply - Remove ServerSideApply configuration - Add RespectIgnoreDifferences syncOption	2026-01-05 00:40:01 +09:00
Mayne0213	2c841c2b6e	FEAT(vault): add ignoreDiff for ES/SM - Add ignoreDifferences for ExternalSecret - Prevent ArgoCD sync drift	2026-01-05 00:40:01 +09:00
Mayne0213	d8360c10a1	FEAT(repo): add cAdvisor metrics_path relabel - Add relabeling for cAdvisor metrics - Support recording rules	2026-01-05 00:40:01 +09:00
Mayne0213	1befeb68c4	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-05 00:40:01 +09:00
Mayne0213	cd575d94a6	PERF(prometheus): optimize prometheus memory usage - Increase scrapeInterval: 30s → 60s - Increase evaluationInterval: 30s → 60s - Reduce retention: 7d → 3d - Add memory limit: 1Gi (prevent unlimited growth) - Increase memory request: 256Mi → 512Mi (reflect actual usage)	2026-01-05 00:40:01 +09:00
Mayne0213	2ec87ca7a5	PERF(prometheus): increase Prometheus CPU request from 50m to 200m - Increase CPU request based on actual usage - Optimize resource allocation	2026-01-05 00:40:01 +09:00
Mayne0213	b3ad6338ac	FIX(prometheus): grafana prometheus datasource - url with full namespace	2026-01-04 23:38:05 +09:00
Mayne0213	340c6fea11	FIX(alertmanager): prometheus alertingendpoints - to connect to alertma...	2026-01-04 23:38:05 +09:00
Mayne0213	bc1cf0d223	REFACTOR(argocd): remove serversideapply - from argocd applications	2026-01-04 23:38:05 +09:00
Mayne0213	79b34aaca6	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-04 23:38:05 +09:00
Mayne0213	0cb7438d79	CHORE(external-secrets): update ESO API version from v1beta1 to v1 - Update ExternalSecret API version - Migrate to stable API	2026-01-04 23:38:05 +09:00
Mayne0213	c75798065f	CHORE(postgresql): update PostgreSQL namespace reference - Update namespace reference for PostgreSQL - Fix service discovery	2026-01-04 23:38:05 +09:00
Mayne0213	ea4152a0d6	REFACTOR(gitea): migrate repoURL from Gitea - to GitHub	2026-01-04 23:38:05 +09:00
Mayne0213	5ec1a3323d	REFACTOR(goldilocks): use managedNamespaceMetad... - Remove namespace.yaml files - Add managedNamespaceMetadata with Goldilocks label - Set CreateNamespace=true in syncOptions - Update kustomization.yaml to remove namespace.yaml references	2026-01-04 23:38:05 +09:00
Mayne0213	ac2abde8b5	FIX(prometheus): servicemonitor namespace - from monitoring to prometheus	2026-01-04 23:38:05 +09:00
Mayne0213	bbf6fa5001	CHORE(repo): clean kustomization files - Remove unused entries from kustomization - Clean up configuration	2026-01-04 23:38:05 +09:00

1 2

62 Commits