observability

Author	SHA1	Message	Date
Mayne0213	c3084225b7	PERF(observability): add HA for Loki and Tempo - Loki: replicas 1→2 with soft anti-affinity - Tempo: replicas 1→2 with soft anti-affinity - Thanos/Prometheus: keep replica 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 13:46:02 +09:00
Mayne0213	9e218a8adc	PERF(observability): reduce replicas, add priority - Reduce Prometheus replicas from 2 to 1 - Reduce Grafana replicas from 2 to 1 - Reduce Blackbox-exporter replicas from 2 to 1 - Move Loki, Thanos, Tempo to workers (remove tolerations) - Add medium-priority to Prometheus, Loki, Thanos, Tempo	2026-01-10 13:15:03 +09:00
Mayne0213	c34f56945a	feat(prometheus): enable container CPU throttling metrics collection - Override default cAdvisorMetricRelabelings - Remove cfs_throttled_seconds_total from drop regex - Enables CPU Throttled panels in Grafana dashboards Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 03:55:36 +09:00
Mayne0213	9e87e6fbcb	REVERT(otel): remove metrics collection, keep logs/traces only - Revert to simpler architecture where Prometheus scrapes metrics directly via ServiceMonitors - OTel Collector only handles logs (filelog) and traces (otlp) - Remove Target Allocator and metrics-related config - This reduces complexity and resource usage for home cluster	2026-01-10 01:18:35 +09:00
Mayne0213	a506ca3f58	FIX(prometheus): reduce replicas to 1 due to resource constraints - Cluster has insufficient memory to schedule 2 Prometheus replicas - Thanos sidecar still provides HA query capability	2026-01-10 01:18:26 +09:00
Mayne0213	5bc0caa324	FIX(prometheus): increase memory limit to 1536Mi to resolve OOMKilled - Prometheus pods were crashing with OOMKilled due to insufficient memory (768Mi) - Increased memory requests and limits from 768Mi to 1536Mi	2026-01-10 01:18:11 +09:00
Mayne0213	12ee5b61c0	FIX(prometheus): enable out-of-order time window - Set outOfOrderTimeWindow to 5m for TSDB - Allow slightly out-of-order samples from distributed collectors - Prevents data loss from timing differences	2026-01-09 23:43:01 +09:00
Mayne0213	a3c5a8dbcf	CHORE(prometheus): disable direct scraping - Disable ServiceMonitor/PodMonitor scraping in Prometheus - OTel Collector now handles all metrics collection - Prevents out-of-order sample errors from duplicate scraping	2026-01-09 23:39:30 +09:00
Mayne0213	02faf93555	FEAT(otel): add OTel Collector for logs and traces - Add OpenTelemetry Operator for CR management - Deploy OTel Collector as DaemonSet via CR - Enable filelog receiver for container log collection - Replace Promtail with OTel filelog receiver - Keep Prometheus for ServiceMonitor-based metrics scraping	2026-01-09 23:23:51 +09:00
Mayne0213	470a08f78a	CHORE(repo): switch to emptyDir with sizeLimit - Add sizeLimit 2Gi to loki emptyDir - Add sizeLimit 2Gi to tempo emptyDir - Change prometheus from PVC to emptyDir 5Gi - Change alertmanager from PVC to emptyDir 500Mi	2026-01-09 21:42:35 +09:00
Mayne0213	94af545120	REFACTOR(thanos): remove S3 storage integration - Disable Store Gateway and Compactor - Remove Sidecar objectStorageConfig - Keep Thanos Query + Sidecar for HA query - 3-day local retention is sufficient	2026-01-09 21:42:35 +09:00
Mayne0213	7cd778313a	FIX(prometheus): disable PrometheusDuplicateTimestamps alert - Low severity alert that fires repeatedly in HA setup - 0.05 samples/s drop rate is negligible	2026-01-09 21:42:35 +09:00
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	ea4d7d4ecf	PERF(prometheus): reduce CPU request from 200m to 50m - Actual usage is ~17m, 200m was over-provisioned - Fixes "Insufficient cpu" scheduling error for replica 2	2026-01-09 21:41:52 +09:00
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	864c2c45d8	REFACTOR(alertmanager): change storageClass - Update storageClass to local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	e4b477a510	REFACTOR(longhorn): migrate to local-path - alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	60dfa5cf7b	CHORE(resources): disable apiserver/etcd metrics - Disable kubeApiServer ServiceMonitor (~37k series) - Disable kubeEtcd ServiceMonitor (~26k series) - Expected memory reduction: ~30-40%	2026-01-05 00:40:01 +09:00
Mayne0213	d8360c10a1	FEAT(repo): add cAdvisor metrics_path relabel - Add relabeling for cAdvisor metrics - Support recording rules	2026-01-05 00:40:01 +09:00
Mayne0213	1befeb68c4	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-05 00:40:01 +09:00
Mayne0213	cd575d94a6	PERF(prometheus): optimize prometheus memory usage - Increase scrapeInterval: 30s → 60s - Increase evaluationInterval: 30s → 60s - Reduce retention: 7d → 3d - Add memory limit: 1Gi (prevent unlimited growth) - Increase memory request: 256Mi → 512Mi (reflect actual usage)	2026-01-05 00:40:01 +09:00
Mayne0213	2ec87ca7a5	PERF(prometheus): increase Prometheus CPU request from 50m to 200m - Increase CPU request based on actual usage - Optimize resource allocation	2026-01-05 00:40:01 +09:00
Mayne0213	b3ad6338ac	FIX(prometheus): grafana prometheus datasource - url with full namespace	2026-01-04 23:38:05 +09:00
Mayne0213	340c6fea11	FIX(alertmanager): prometheus alertingendpoints - to connect to alertma...	2026-01-04 23:38:05 +09:00
Mayne0213	79b34aaca6	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-04 23:38:05 +09:00
Mayne0213	ac2abde8b5	FIX(prometheus): servicemonitor namespace - from monitoring to prometheus	2026-01-04 23:38:05 +09:00
Mayne0213	5c4676ca9a	REFACTOR(repo): restructure monitoring folder - and add namespace resou... - Remove argocd/, helm-values/, ingress/ subdirectories - Move files to parent directory (argocd.yaml, helm-values.yaml, ingress.yaml) - Update helm valueFiles paths in ArgoCD Applications - Add namespace.yaml to all applications with Goldilocks labels - Update destination namespaces to match folder names - Update kustomization.yaml files to reference new structure	2026-01-04 23:38:05 +09:00

31 Commits