observability

Author	SHA1	Message	Date
Mayne0213	4a4a43ed82	FIX(prometheus): increase memory to 768Mi - Prometheus was OOMKilled with 512Mi limit - Set both requests and limits to 768Mi	2026-01-09 21:42:35 +09:00
Mayne0213	5089e8607d	CHORE(resources): set memory limits equal to memory requests Align memory limits with memory requests for guaranteed QoS class. - prometheus, thanos (query, storegateway, compactor) - alertmanager, tempo, goldilocks (dashboard, controller) - node-exporter, opentelemetry-collector, vpa, kube-state-metrics	2026-01-09 21:42:35 +09:00
Mayne0213	7139f3e5a2	FIX(prometheus): correct ArgoCD metrics service names - Update controller target to argocd-application-controller-metrics - Update repo-server target to argocd-repo-server-metrics	2026-01-09 21:41:52 +09:00
Mayne0213	ea4d7d4ecf	PERF(prometheus): reduce CPU request from 200m to 50m - Actual usage is ~17m, 200m was over-provisioned - Fixes "Insufficient cpu" scheduling error for replica 2	2026-01-09 21:41:52 +09:00
Mayne0213	6b576d6a16	FEAT(thanos): add Thanos for Prometheus HA and long-term storage - Add Thanos Query, Store Gateway, Compactor - Enable Prometheus Sidecar with S3 (MinIO) storage - Configure Prometheus replicas: 2 with pod anti-affinity - Add ExternalSecrets for MinIO credentials - Retention: raw 7d, 5m downsampled 30d, 1h downsampled 90d	2026-01-09 21:41:52 +09:00
Mayne0213	30f028fae4	CHORE(prometheus): disable CPU/Memory overcommit alerts - Disable KubeCPUOvercommit and KubeMemoryOvercommit alerts - Cluster uses replica=2 with pod anti-affinity for HA	2026-01-09 21:41:52 +09:00
Mayne0213	4286296591	PERF(resources): remove CPU limits - keep memory limits only - CPU throttling prevents app startup, not crashes - Memory OOM is the real cascading failure cause - CPU request ensures fair scheduling	2026-01-07 23:48:35 +09:00
Mayne0213	864c2c45d8	REFACTOR(alertmanager): change storageClass - Update storageClass to local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	e4b477a510	REFACTOR(longhorn): migrate to local-path - alertmanager, grafana, loki, prometheus: storageClass -> local-path-retain - Change storage backend configuration	2026-01-05 00:40:01 +09:00
Mayne0213	60dfa5cf7b	CHORE(resources): disable apiserver/etcd metrics - Disable kubeApiServer ServiceMonitor (~37k series) - Disable kubeEtcd ServiceMonitor (~26k series) - Expected memory reduction: ~30-40%	2026-01-05 00:40:01 +09:00
Mayne0213	d8360c10a1	FEAT(repo): add cAdvisor metrics_path relabel - Add relabeling for cAdvisor metrics - Support recording rules	2026-01-05 00:40:01 +09:00
Mayne0213	1befeb68c4	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-05 00:40:01 +09:00
Mayne0213	cd575d94a6	PERF(prometheus): optimize prometheus memory usage - Increase scrapeInterval: 30s → 60s - Increase evaluationInterval: 30s → 60s - Reduce retention: 7d → 3d - Add memory limit: 1Gi (prevent unlimited growth) - Increase memory request: 256Mi → 512Mi (reflect actual usage)	2026-01-05 00:40:01 +09:00
Mayne0213	2ec87ca7a5	PERF(prometheus): increase Prometheus CPU request from 50m to 200m - Increase CPU request based on actual usage - Optimize resource allocation	2026-01-05 00:40:01 +09:00
Mayne0213	b3ad6338ac	FIX(prometheus): grafana prometheus datasource - url with full namespace	2026-01-04 23:38:05 +09:00
Mayne0213	340c6fea11	FIX(alertmanager): prometheus alertingendpoints - to connect to alertma...	2026-01-04 23:38:05 +09:00
Mayne0213	79b34aaca6	FEAT(prometheus): add ServerSideApply - Enable ServerSideApply for CRD annotation handling - Fix resource management	2026-01-04 23:38:05 +09:00
Mayne0213	ac2abde8b5	FIX(prometheus): servicemonitor namespace - from monitoring to prometheus	2026-01-04 23:38:05 +09:00
Mayne0213	5c4676ca9a	REFACTOR(repo): restructure monitoring folder - and add namespace resou... - Remove argocd/, helm-values/, ingress/ subdirectories - Move files to parent directory (argocd.yaml, helm-values.yaml, ingress.yaml) - Update helm valueFiles paths in ArgoCD Applications - Add namespace.yaml to all applications with Goldilocks labels - Update destination namespaces to match folder names - Update kustomization.yaml files to reference new structure	2026-01-04 23:38:05 +09:00

19 Commits