REFACTOR(docs): detach services,ingress from docs

This commit is contained in:
2025-12-29 14:22:41 +09:00
parent cdbf94bc81
commit 0996187c82
21 changed files with 52 additions and 195 deletions

View File

@@ -0,0 +1,225 @@
---
sidebar_position: 3
---
# Monitoring Stack
## Overview
Our monitoring stack provides complete observability with metrics, logs, and visualization.
## Components
### Prometheus
**Metrics collection and storage**
- Scrapes metrics from all services
- Stores time-series data
- Powers alerting rules
Access: Internal only (no direct UI exposure)
### Grafana
**Visualization and dashboards**
- Beautiful dashboards
- Query Prometheus data
- Alert management UI
Access: https://grafana0213.kro.kr
### Loki
**Log aggregation**
- Collects logs from all pods
- Indexed for fast searching
- Integrated with Grafana
### Promtail
**Log shipping agent**
- Runs on each node
- Forwards logs to Loki
- Adds metadata labels
### Alertmanager
**Alert routing and notification**
- Receives alerts from Prometheus
- Routes to correct channels
- Deduplication and grouping
## Dashboards
### Pre-built Dashboards
1. **Cluster Overview**
- Node health
- Resource usage
- Pod status
2. **Application Metrics**
- Request rate
- Error rate
- Response time
3. **Infrastructure**
- CPU, Memory, Disk
- Network traffic
- Storage usage
### Creating Custom Dashboards
```bash
# Export existing dashboard
curl -s http://grafana:3000/api/dashboards/uid/<uid> > dashboard.json
# Import via UI
Grafana → Dashboards → Import → Upload JSON
```
## Querying Metrics
### PromQL Examples
```promql
# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])
# Memory usage
container_memory_working_set_bytes
# HTTP request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
```
## Alerts
### Viewing Alerts
```bash
# List Prometheus rules
sudo kubectl get prometheusrules -n monitoring
# View Alertmanager status
sudo kubectl get alertmanagers -n monitoring
```
### Common Alerts
- **HighCPUUsage**: Pod using >80% CPU
- **HighMemoryUsage**: Pod using >80% memory
- **PodCrashLooping**: Pod restarting frequently
- **DiskSpaceLow**: Node disk >85% full
## Log Queries
### LogQL Examples
```logql
# All logs from a namespace
{namespace="my-app"}
# Error logs
{namespace="my-app"} |= "error"
# Parse JSON logs
{namespace="my-app"} | json | level="error"
# Count errors
count_over_time({namespace="my-app"} |= "error" [5m])
```
## Accessing Monitoring Data
### Grafana UI
1. Navigate to https://grafana0213.kro.kr
2. Log in with credentials
3. Browse dashboards or create queries
### Port Forwarding (Development)
```bash
# Prometheus UI
sudo kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Access at http://localhost:9090
# Alertmanager UI
sudo kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# Access at http://localhost:9093
```
## Troubleshooting
### No Metrics Showing
```bash
# Check Prometheus targets
sudo kubectl exec -n monitoring prometheus-0 -- promtool check config /etc/prometheus/prometheus.yml
# Verify service monitors
sudo kubectl get servicemonitors -A
```
### Grafana Not Loading Data
```bash
# Check Grafana logs
sudo kubectl logs -n monitoring deployment/grafana
# Verify datasource configuration
sudo kubectl get secret -n monitoring grafana-datasources -o yaml
```
### High Cardinality Issues
Too many unique label combinations can cause performance issues:
```bash
# Check series count
curl http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName'
```
## Best Practices
1. **Set up alerts proactively**: Don't wait for incidents
2. **Use labels wisely**: Avoid high cardinality
3. **Create focused dashboards**: One purpose per dashboard
4. **Set retention policies**: Balance storage vs history
5. **Document custom metrics**: Help future maintainers
## Metrics to Monitor
### Application Level
- Request rate
- Error rate
- Response time (latency)
- Saturation (queue depth)
### Infrastructure Level
- CPU usage
- Memory usage
- Disk I/O
- Network throughput
### Business Level (Optional)
- User signups
- Active sessions
- Feature usage
- Transaction volume
## Next Steps
- [Kubernetes Operations](./kubernetes)
- [ArgoCD Configuration](./argocd)