Monitoring & Metrics
This suite includes a practical observability starter stack built around Prometheus, Grafana, and Jaeger.
The app also exposes a dependency-aware /ready endpoint so operators can distinguish between basic process health and backing-service readiness.
📊 Real-Time Dashboard (Grafana)
The Docker Compose stack provisions the Prometheus data source and the API Reliability SLO Overview dashboard automatically.
Key Metrics Tracked
- SLO Tracking: Success ratio, p99 latency, request rate, and error-budget burn rate.
- Latency Distribution: Tracks the 5-minute p99 latency recording rule for tail behavior.
- System Resilience: Visualizes the circuit breaker state in real time.
Open the Provisioned Dashboard
- Start the stack with
make stack-up. - Access Grafana at
http://localhost:3030withadmin / admin. - Open
http://localhost:3030/d/api-reliability-slo.
Provisioning files live in:
infra/grafana/provisioning/datasources/prometheus.ymlinfra/grafana/provisioning/dashboards/dashboards.ymlinfra/grafana/dashboards/api-reliability-overview.json
🔗 Distributed Tracing (Jaeger)
Requests can be traced using OpenTelemetry, providing visibility into how traffic moves through the application during local runs and demos.
Trace Propagation
- Inbound: Middleware automatically injects
trace_id,span_id, and a user-facingcorrelation_id. - Outbound: The instrumented HTTP client (
src/infrastructure/http_client.py) propagates context to external services (like Groq or OpenAI) via W3Ctraceparentheaders.
Dashboard Access
View live traces at http://localhost:16686.
🚨 Alerting Strategy
The local Prometheus container loads Golden Signal alert rules and forwards them to Alertmanager so you can inspect firing conditions during demos and manual testing.
| Alert | Condition | Severity |
|---|---|---|
| HighErrorRate | 5xx Errors > 5% | Critical |
| HighLatency | P99 Latency > 1s | Warning |
| CircuitBreakerOpen | Breaker state is "Open" | Warning |
Alert Config
Rules are defined in infra/prometheus/alert_rules.yml, mounted into the local Prometheus container at /etc/prometheus/alert_rules.yml, and routed to Alertmanager at http://localhost:9093.
Recording Rules
Prometheus also records SLO-oriented series for request rate, error ratio, p99 latency, and error-budget burn rate. See docs/load-testing.md for the smoke profile and the latest recorded local baseline.