--- sidebar_position: 3 --- # Monitoring Stack ## Overview Our monitoring stack provides complete observability with metrics, logs, and visualization. ## Components ### Prometheus **Metrics collection and storage** - Scrapes metrics from all services - Stores time-series data - Powers alerting rules Access: Internal only (no direct UI exposure) ### Grafana **Visualization and dashboards** - Beautiful dashboards - Query Prometheus data - Alert management UI Access: https://grafana0213.kro.kr ### Loki **Log aggregation** - Collects logs from all pods - Indexed for fast searching - Integrated with Grafana ### Promtail **Log shipping agent** - Runs on each node - Forwards logs to Loki - Adds metadata labels ### Alertmanager **Alert routing and notification** - Receives alerts from Prometheus - Routes to correct channels - Deduplication and grouping ## Dashboards ### Pre-built Dashboards 1. **Cluster Overview** - Node health - Resource usage - Pod status 2. **Application Metrics** - Request rate - Error rate - Response time 3. **Infrastructure** - CPU, Memory, Disk - Network traffic - Storage usage ### Creating Custom Dashboards ```bash # Export existing dashboard curl -s http://grafana:3000/api/dashboards/uid/ > dashboard.json # Import via UI Grafana → Dashboards → Import → Upload JSON ``` ## Querying Metrics ### PromQL Examples ```promql # CPU usage by pod rate(container_cpu_usage_seconds_total[5m]) # Memory usage container_memory_working_set_bytes # HTTP request rate rate(http_requests_total[5m]) # Error rate rate(http_requests_total{status=~"5.."}[5m]) ``` ## Alerts ### Viewing Alerts ```bash # List Prometheus rules sudo kubectl get prometheusrules -n monitoring # View Alertmanager status sudo kubectl get alertmanagers -n monitoring ``` ### Common Alerts - **HighCPUUsage**: Pod using >80% CPU - **HighMemoryUsage**: Pod using >80% memory - **PodCrashLooping**: Pod restarting frequently - **DiskSpaceLow**: Node disk >85% full ## Log Queries ### LogQL Examples ```logql # All logs from a namespace {namespace="my-app"} # Error logs {namespace="my-app"} |= "error" # Parse JSON logs {namespace="my-app"} | json | level="error" # Count errors count_over_time({namespace="my-app"} |= "error" [5m]) ``` ## Accessing Monitoring Data ### Grafana UI 1. Navigate to https://grafana0213.kro.kr 2. Log in with credentials 3. Browse dashboards or create queries ### Port Forwarding (Development) ```bash # Prometheus UI sudo kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 # Access at http://localhost:9090 # Alertmanager UI sudo kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093 # Access at http://localhost:9093 ``` ## Troubleshooting ### No Metrics Showing ```bash # Check Prometheus targets sudo kubectl exec -n monitoring prometheus-0 -- promtool check config /etc/prometheus/prometheus.yml # Verify service monitors sudo kubectl get servicemonitors -A ``` ### Grafana Not Loading Data ```bash # Check Grafana logs sudo kubectl logs -n monitoring deployment/grafana # Verify datasource configuration sudo kubectl get secret -n monitoring grafana-datasources -o yaml ``` ### High Cardinality Issues Too many unique label combinations can cause performance issues: ```bash # Check series count curl http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName' ``` ## Best Practices 1. **Set up alerts proactively**: Don't wait for incidents 2. **Use labels wisely**: Avoid high cardinality 3. **Create focused dashboards**: One purpose per dashboard 4. **Set retention policies**: Balance storage vs history 5. **Document custom metrics**: Help future maintainers ## Metrics to Monitor ### Application Level - Request rate - Error rate - Response time (latency) - Saturation (queue depth) ### Infrastructure Level - CPU usage - Memory usage - Disk I/O - Network throughput ### Business Level (Optional) - User signups - Active sessions - Feature usage - Transaction volume ## Next Steps - [Kubernetes Operations](./kubernetes) - [ArgoCD Configuration](./argocd)